Python gets down to (the) Metal

April 26, 2023

Are you a mac & Python person?

Do you have a few trillion numbers to mulitply together?

But are you also a busy person who doesn’t have all day to wait for the results?

If the answer for all these questions is “Yes!”, then this post might just be for you.

Python is quite slow

I have been a happy Python user for many years.

I feel that Python (and the vast array of packages built for it) gives me a way to make computers do the things I want to do with data quickly and easily.

When those things involve images or videos, that often means multiplying lots of numbers together.

Python has a reputation of being slow at that kind of work when compared to (compiled) languages like C or Rust, and it is true - basic python does things like that much more slowly.

Packages like NumPy can help, because they offload heavy compute work to much faster C-based libraries which can make better use of the CPU.

But CPUs are not the fastest devices to multiply many numbers.

If your computer has a graphics processor (GPU), you can calculate things 10x or even 100x faster than with a CPU.

mac is quite fast

I recently became a happy mac user.

Apple started to make their own chips for macs, with their own graphics processors inside.

The latest chip family used in the newest macs is the M2, and even the basic 8-core GPU model of that used in the mini & Air (theoretically) multiplies around 3 trillion numbers together in a second.

The bigger models in the same family can do more than 10 trillion multiplies in the same time.

That is a lot of multiplying.

Metal is the way (to program mac GPUs)

Like it or not, Apple do things their own way.

With GPUs, that way is a language called Metal (described in great detail here in the Metal Shading Language Specification document).

Metal looks a bit like C++.

It has functions (called “kernels”) which can describe how to take numbers from memory locations, multiply them together, and put the results back in different memory locations.

Then there is a way to call those functions many times with different memory locations.

The trick to make it go very fast, is that the same function can effectively be called many times at once.

On an M2 with “8 GPU cores”, each “core” can run up to 256 different multiplications or additions from different kernel calls at once. That is more than 2000 kernel calls.

A new result is available after every clock cycle, which means 1.4 billion times every second. In total that means 2.8 trillion results in a second.

Quickly coding quick code with Python and Metal together

So to recap, I have a new mac with a GPU and I have a lot of numbers to multiply.

I would like to use Python to quickly describe the kind of multiplying I want to do.

But then I also want that multiplying to be run quickly in the fastest bit of my mac, which is the GPU.

What I need(ed) is a way to write kernels with Metal to describe the calculations I want to do, and then use Python to control the data to give to the kernels, and how many of them to run at once.

That way is a package for Python on macs, called metalcompute.

metalcompute

metalcompute is now an installable package here on PyPI.

On a mac you can install it into a python virtual environment just by doing:

  python3 -m pip install metalcompute

Source code and examples can also be found here on github.

Using metalcompute

The idea with metalcompute is to let you get down close to the “bare Metal” with very little in the way to slow things down (you or the computer)

That means you do need to learn a little bit about the Metal language, and how to write kernels, but in return you get flexibility and freedom.

The simplest working program with metalcompute can be quite small. Here is a minimal example, the “hello world” of metalcompute if you like.

import metalcompute as mc
from array import array
# The metal kernel to run
kernel = """
#include <metal_stdlib>;
using namespace metal;
kernel void helloABC(const device float *A [[buffer(0)]],
                       const device float *B [[buffer(1)]],
                       device float *C [[buffer(2)]],
                       uint id [[thread_position_in_grid]]) {
    C[id] = A[id] * B[id]; 
}
"""
count = 10
# Create a metal device to run kernel with
dev = mc.Device()
# Create a compiled kernel to run
helloABC = dev.kernel(kernel).function("helloABC")
# Make a buffer for the result
A = array('f',range(10)) # 0..9
B = array('f',range(10,20)) # 10...19
C = dev.buffer(count * 4) # 4 because a float32 needs 4 bytes 
C_view = memoryview(C).cast('f')
# Run the kernel count times
helloABC(count, A, B, C)
# Print the second result which should be 1.0 * 11.0 = 11.0
print(C_view[1])

When I run it:

 % time python3 helloworld.py
 11.0

First we import the metalcompute library with a shortname “mc”.

Then we write out metal kernel in a string. This needs one kernel function (here called “helloABC”) which defines the input and output memory buffers, and the “id” of the kernel.

This kernel just reads one number from buffer A and buffer B, multiplies them together, and saves the result to buffer C.

This buffers are indexed with the kernel id so that each kernel call will use different memory locations.

Then back to python…

We create a “metal device”, which is a way to talk to the GPU.

Then we compile our kernel string into a function we can call.

Finally we make the A and B input buffers (with python array objects) and the C output buffer as a special metalcompute GPU memory object (we just have to say how many bytes of memory it needs). C_view just lets us see the data inside C from python later.

Then we call the kernel, giving the number of functions to run together (10), and the buffers to use.

At the end, the results can be see in C_view.

Show me the Trillions

On my mac, this program takes about 0.07 seconds (70 milliseconds) to run from the command line. But as it only does about 10 multiplications, that is not very impressive.

If I make the value of count much bigger, lets say 1,000,000,000 (1 billion), how long does it take now?

0.34 seconds.

OK, now we are gettting somewhere.

But how much of that was actually the multiplying part, and how much was the creation of the input data? If I measure just the call to “helloABC” we see that part takes 0.19s, about half.

So we can do 5 billion multiples per second like this. Not bad, but quite far from the 3 trillion multiplies we were promised. Where did it all go?

Moving things takes time

While the GPU can do trillions of multiplies in a second, the main memory in a computer is much slower. For an M2 with the largest memory configuration (24GB), “only” 100 billion bytes can be moved to or from memory in a second.

Worse, if the numbers you are moving are actually in “32b floating point” format, the 4 bytes are needed for each number. So, now we are down to only 25 billion floats moved every second.

The hello world kernel we showed needed 3 memory movements:

  • Reading 1 billion values from A
  • Reading 1 billion values from B
  • Writing 1 billion value into C
  • Total 3 billion float moves per second

There is still a gap, but not so big now.

We can slightly change the kernel call to speed things up:

 [helloABC(count, A, B, C) for _ in range(100)][-1]

This now runs the same kernel 100 times with no gaps. On a 24GB M2 this should take about 4 seconds.

So metalcompute can compute at the full speed of the memory.

… Billions, and Billions, and Trillions, and….

Can we get to trillions of operations?

Yes, if we reduce the movement of numbers.

We need to do more calculations in a kernel using local results instead of new numbers from memory.

Here is a slightly more complex kernel:

kernel = """
#include <metal_stdlib>;
using namespace metal;
kernel void helloABC(const device float *A [[buffer(0)]],
                       const device float *B [[buffer(1)]],
                       device float *C [[buffer(2)]],
                       uint id [[thread_position_in_grid]]) {
    float Ai = A[id];
    float Bi = B[id];
    float ABi = Ai * Bi;
    float D = Ai;
    float E = Bi;
    float F = Ai+Bi;
    float G = Ai-Bi;
"""
kernel += 400*"""
    D += Ai * G;
    E += Bi * F; 
    F += Ai * E;
    G += Bi * D; 
"""
kernel += """
    C[id] = D + E + F + G;
}
"""

This kernel does 8 operations in the middle block which is repeated 400 times. Then there are 6 other operations. It should use every operation the GPU can do.

(If you are wondering why the kernel += part was needed instead of a normal loop, it was so the GPU would not waste time with a loop counter. This was an “unrolled” loop).

Is this useful?

Not always. Sometimes you really need new data, and then you are stuck with the speed of your memory.

But when you have a computation with small bunch of numbers to mix together with different combinations of basic operations, GPUs can do that very quickly.

One example is calculating the Julia set which was the thumbnail of this post, and you should be able to see below being calculated right now by your GPU…

void main() { vec2 co = (gl_FragCoord.xy-(0.5*vec2(256.,256.)))/256.; float speed = 0.1; float t = speed*time + 1.2*sin(speed*time*0.3828); float amp = 0.8+0.2*sin(t*9.119281); vec2 c = vec2( amp*sin(t), amp*cos(t)); co *= 4.0+3.0*sin(t); // Calculate julia sets float R = 100.0; vec2 z = vec2(co.x,co.y); float i = 0.0; float m_old = 0.0; float m_new = 0.0; for (int iter=0;iter<100;iter++) { vec2 z2 = vec2(z.x*z.x-z.y*z.y+c.x, +2.0*z.x*z.y+c.y); z = z2; m_new = (z.x*z.x)+(z.y*z.y); if (m_new>(R*R)) { break; } m_old = m_new; i += 1.0; } float add = log2(log(sqrt(m_new))/log(R))-1.; i -= add; float huespeed = 8.; i = log2(i); vec3 rgb = vec3(0.5+0.4*sin(huespeed*0.31*i),0.5+0.4*sin(huespeed*0.53*i),0.5+0.4*sin(huespeed*0.71*i)); gl_FragColor=vec4(rgb,1.0); }