Custom function evaluation on the GPU with MPSGraph

MPSGraph enables us to run math on the GPU without needing to write C++ based Metal shaders. We can combine logical and arithmetic nodes in a graph that the framework then stitches together.

In this post we will go over the steps needed to evaluate a polynomial function of variable size rapidly on the GPU. This can be easily modified for other arithmetic functions. In a future post I plan on documenting how we can use control flow nodes to provide GPU based solvers for finding approximate numerical solutions to computationally complex functions.

# Building the graph

We start by building the graph. Here are the inputs to our function. The exponents and coefficients will be used in the function f(location) = coefficients[0]*(location^ exponents[0]) + coefficients[1]*(location^exponents[1]) + ....

import Metal
import MetalPerformanceShaders
import MetalPerformanceShadersGraph

let graph = MPSGraph()

// Create placeholder for the exponents of polynomial
let exponents = graph.placeholder(
    shape: nil,
    dataType: .float32,
    name: "exponents"
)
     
// Create placeholder coefficients of the polynomial
let coefficients = graph.placeholder(
    shape: nil,
    dataType: .float32,
    name: "coefficients"
)

// Create placeholder for the x value that we will evaluate
let location = graph.placeholder(
    shape: nil,
    dataType: .float32,
    name: "location"
)

At this point we are not constraining the shape of the values, since we would like to be able to reuse this code for many different sizes of the polynomial.

The first thing we need to do is raise the location to the power in the exponents. The framework provides power(::name) method that creates a node to do it.

let locationHat = graph.power(
    location,
    exponents,
    name: "power"
)

Since location is a single value, this node will return a new vector of the same length as the exponents, raising the location to the power of each corresponding component.

Next we create a multiplication(::name) node combining the coefficients vector with the pairwise values from the locationHat vector.

let value = graph.reductionSum(
    with: values,
    axes: nil,
    name: "sum"
)

Then we'll need to apply a reductionSum(with:axis:name) to this output to resolve a single value as our output.

# Preparing data

There are multiple solutions for how to prepare data. Depending on how frequently you need to change the data, you might need to consider a more involved method. For simplicity here we will create the exponents and coefficients in Swift on the CPU and copy/share them with the GPU.

Let's consider a simple example of f(x) = 2 + 2x + 2x^2 + 2x^3 and define the values.

let exponentValues: [Float32] = [0, 1, 2, 3]
let coefficientValues: [Float32] = [2, 2, 2, 2]

With these values defined in Swift we need to copy them to the GPU. Create an extension on Array so that we can produce a matching MTLBuffer on a target metal device.

extension Array where Element == Float32 {   
    func asBuffer(on device: MTLDevice) -> MTLBuffer? {
        self.withContiguousStorageIfAvailable { pointer -> MTLBuffer? in
            guard let baseAddress = pointer.baseAddress else {
                return nil
            }
            return device.makeBuffer(
                bytes: baseAddress,
                length: pointer.count * MemoryLayout<Float32>.size,
                options: MTLResourceOptions.storageModeShared
            )
        } ?? nil
    }
}

This buffer lacks information about the data type that is stored within it. We need to create an MPSVector that wraps the buffer describing its value type.

extension Array where Element == Float32 {   
    func asVector(on device: MTLDevice) -> MPSVector? {
        guard let buffer = self.asBuffer(on: device) else {
            return nil
        }
        return MPSVector(
            buffer: buffer,
            descriptor: MPSVectorDescriptor(
                length: self.count,
                dataType: MPSDataType.float32
            )
        )
    }
}

We can now create instances of MPSGraphTensorData that the graph takes as input.

// Get the GPU
let device = MTLCreateSystemDefaultDevice()!

let exponentsTensorData = MPSGraphTensorData(
    exponentValues.asVector(on: device)!
)

let coefficientsTensorData = MPSGraphTensorData(
    coefficientsValues.asVector(on: device)!
)

We still need to create one more input: the value of x that we want to evaluate. For simplicity we are just going to evaluate a single location.

let locationTensorData = MPSGraphTensorData(
    MPSNDArray(device: device, scalar: 2.0)
)

# Running the graph

For running the graph we are going to use the simple run() method. It will compile your graph into a metal shader on the first call and then dispatch it to compute.

let results = graph.run(
    feeds: [
        location : locationTensorData,
        exponents: exponentsTensorData,
        coefficients: coefficientsTensorData
    ],
    targetTensors: [value],
    targetOperations: nil
) 

You can instead pre-compile it and run it as part of a larger metal compute shader pipeline. This could be useful if you need to pre-compute/post-compute some of the data using other Metal shaders.

To retrieve the computed result from our graph we need to copy the value back from the GPU to the CPU, so that we can use the value within our regular Swift code.

// We can index the results returned by the `run()` method
// using the `value` tensor.
let result = results[value]

let outputNDArray = result?.mpsndarray()

// Allocate a local array of the correct size
// so we can copy the values in.
var outputValues: [Float32] = [0]

// Copy all the values from the Metal array to the Swift array.
outputNDArray?.readBytes(&outputValues, strideBytes: nil)

The outputValues array will now contain the function evaluation. This method of evaluating polynomials is only worth it if you have a high order polynomial with 100s if not 1000s of terms.

# Evaluating multiple locations at once

We can modify the above code a little bit to enable us to evaluate many locations at once.

Firstly, we need to create a matrix of our locations with the values saved in the first column.

func asMatrix(on device: MTLDevice) -> MPSMatrix? {
    guard let buffer = self.asBuffer(on: device) else {
        return nil
    }

    return MPSMatrix(
        buffer: buffer,
        descriptor: .init(
            rows: self.count,
            columns: 1, rowBytes: MemoryLayout<Float32>.size,
            dataType: .float32
        )
    )
}

We also need to use this when we declare the locationTensorData.

let locationValues: [Float32] = [-1, 0, 1]
let locationTensorData = MPSGraphTensorData(   
    locationValues.asMatrix(on: device)!
)

Then we update the reductionSum to correctly sum up the values for each location rather than all the values over all locations.

let value = graph.reductionSum(
    with: values,
    axes: [NSNumber(1)],
    name: "sum"
)

And finally, when the graph finishes we need to ensure that outputValues array is large enough to have a result for each input location.

var outputValues: [Float32] = [0, 0, 0]

With these changes we can evaluate many points at once on the GPU in parallel.