<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">
Julia header (1)

Sep 20, 2023

Running Julia on Graphcore IPUs

Written By:

Mosè Giordano

We're Hiring

Join us and build the next generation AI stack - including silicon, hardware and software - the worldwide standard for AI compute

Join our team

I am a Research Software Developer at University College London and user of the Julia programming language since 2016. I am also curious about trying out Julia on interesing hardware - which led me to implementing it on the Graphcore IPUs. However, I am not a compiler engineer.

You can try running code written in Julia on the IPU for free using a number of sample notebooks on Paperspace Gradient.

You may also like to watch a presentation I gave on implementing Julia on the Graphcore IPU at JuliaCon 2023.

What is Julia?

Julia is a modern, dynamic, general-purpose, compiled programming language. It's interactive ("like Python"), can be used in a REPL or notebooks, like Jupyter (it's the "Ju") or Pluto (this one🎈). Julia has a runtime which includes a just-in-time (JIT) compiler and a garbage collector (GC), for automatic memory management.

Julia is mainly used for numerical computing, diffential equations solvers suite is quite popular.

Main paradigm of Julia is multiple dispatch, what functions do depend on type and number of all arguments.

Why Julia?

Target 1Target 2

From "My Target Audience" by Matthijs Cox

  • Explorable & Understandable
  • Composability thanks to multiple dispatch (ask me more about this at the end!)
  • User-defined types are as fast and compact as built-ins
  • Code that is close to the mathematics
  • No need to switch languages for performance...
  • ...but you can still call C-like shared libraries with simple Foreign Function Interface (FFI) if you want to
  • MIT licensed: free and open source

My first IPU program in Julia

We developed a package called IPUToolkit.jl to interface the Poplar SDK:


using IPUToolkit.Poplar

let     # Set up graph and program     device = Poplar.get_ipu_device()     target = Poplar.DeviceGetTarget(device)     graph = Poplar.Graph(target)     prog = Poplar.ProgramSequence()     # Create CPU array     h3 = zeros(Float32, 4, 4)     # Create IPU tensors     c1 = Poplar.GraphAddConstant(graph, Float32[1.0, 1.5, 2.0, 2.5])     v1 = similar(graph, c1, "v1")     v2 = similar(graph, c1, "v2")     v3 = similar(graph, h3, "v3")     v4 = Poplar.GraphAddVariable(graph, Poplar.INT(), UInt64[10], "v4")     # Tensors tile mapping     Poplar.GraphSetTileMapping(graph, v1, 0)     for i in UInt64(0):UInt64(3)         Poplar.GraphSetTileMapping(graph, v2[i], i)     end     Poplar.GraphSetTileMapping(graph, v3, 0)     Poplar.GraphSetTileMapping(graph, v4, 0)     Poplar.GraphSetTileMapping(graph, c1, 0)     # Copy `c1` to `v1` and print `v1`     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramCopy(c1, v1))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramPrintTensor("v1-debug", v1))     # Copy `v1` to `v2` and print `v2` (should be same as `v1` above)     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramCopy(v1, v2))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramPrintTensor("v2-debug", v2))     # Prepare copying data between CPU and IPU     Poplar.GraphCreateHostWrite(graph, "v3-write", v3)     Poplar.GraphCreateHostRead(graph, "v3-read", v3)     v1slice = Poplar.TensorSlice(v1, 0, 3)     v3slice = Poplar.TensorSlice(v3, UInt64[1, 1], UInt64[2, 4])     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramCopy(v1slice, v3slice))     # Read three batches of 10 `Int32`s from the strea `inStream` into `v4` and print values     inStream = Poplar.GraphAddHostToDeviceFIFO(graph, "v4-input-stream", Poplar.INT(), 10)     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramCopy(inStream, v4))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramPrintTensor("v4-0", v4))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramCopy(inStream, v4))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramPrintTensor("v4-1", v4))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramCopy(inStream, v4))     Poplar.ProgramSequenceAdd(prog, Poplar.ProgramPrintTensor("v4-2", v4))     flags = Poplar.OptionFlags()     Poplar.OptionFlagsSet(flags, "debug.instrument", "true")     engine = Poplar.Engine(graph, prog, flags)     Poplar.EngineLoad(engine, device)     Poplar.EngineWriteTensor(engine, "v3-write", h3)     # Create data to stream to `v4`     inData = Int32.(1:30)     # Connect data to stream     Poplar.EngineConnectStream(engine, "v4-input-stream", inData)     # Run the engine     Poplar.EngineRun(engine, 0)     # Write IPU tensor `v3` to CPU array `h3`     Poplar.EngineReadTensor(engine, "v3-read", h3)     # Release device     Poplar.detach_devices()     # Print value of CPU array `h3`     print("h3 data: ")     display(h3') end

Trying to attach to device 0... Successfully attached to device 0

v1-debug: [1.0000000 1.5000000 2.0000000 2.5000000] v2-debug: [1.0000000 1.5000000 2.0000000 2.5000000] v4-0: [ 1 2 3 4 5 6 7 8 9 10] v4-1: [11 12 13 14 15 16 17 18 19 20] v4-2: [21 22 23 24 25 26 27 28 29 30] h3 data: 4×4 adjoint(::Matrix{Float32}) with eltype Float32: 0.0 0.0 0.0 0.0 0.0 1.0 1.5 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

But we are just writing C++ code in Julia, nothing excitingly new... Can we do more?

What's a compiler?

LysTgpt (1)

(A high-level diagram of a compiler pipeline, with emphasis on the front-end, from "What can we learn from how compilers are designed?")

The only existing implementation of Julia is based on the LLVM compiler, a popular modular compilation framework.

With Julia we can easily inspect each stage of the compilation pipeline.


Meta.@dump 1.0 + 2.0

Expr head: Symbol call args: Array{Any}((3,)) 1: Symbol + 2: Float64 1.0 3: Float64 2.0

CodeInfo( 1 ─ %1 = Base.add_float(x, y)::Float64 └── return %1 ) =>Float64) @code_typed 1.0 + 2.0

@code_llvm debuginfo=:none 1.0 + 2.0

define double @"julia_+_4759"(double %0, double %1) #0 { top:   %2 = fadd double %0, %1   ret double %2 }

@code_native debuginfo=:none 1.0 + 2.0

 .text  .file "+"  .globl "julia_+_4795"                  # -- Begin function julia_+_4795  .p2align 4, 0x90  .type "julia_+_4795",@function "julia_+_4795":                         # @"julia_+_4795" # %bb.0:                                # %top  push rbp  mov rbp, rsp  vaddsd xmm0, xmm0, xmm1  pop rbp  ret .Lfunc_end0:  .size "julia_+_4795", .Lfunc_end0-"julia_+_4795"                                         # -- End function  .section ".note.GNU-stack","",@progbits

For more details on the Julia compiler watch the talk "Compiling Julia: Making dynamic programs run fast" by Valentin Churavy.

The common factor: LLVM

It turns out also the Poplar compiler, the compiler developed by Graphcore for generating the native code for the IPU, is based on LLVM. When you compile a codelet for the IPU you, the Poplar compiler performs the same pipeline we have seen above.

This means that Julia and the Poplar compiler can actually talk the same language: the LLVM IR.

We added to IPUToolkit.jl the capability to generate code for the IPU with Julia. All in all, this package has the following goals:

  • interfacing the Poplar SDK for writing an IPU program in Julia
  • exploring Julia's metaprogramming capabilities for reducing boilerplate code in IPU programs
  • using Julia code generation capabilities to produce native code for the IPU through the Poplar compiler, using the pipeline outlined in this picture:

LLVM (1)

Writing IPU codelets in Julia

We can use the IPUToolkit.jl package to write IPU codelets in Julia, generating LLVM IR code which is then compiled to native code with the Poplar compiler. The code generation via LLVM is based on the GPUCompiler.jl package, which is a generic framework for generating LLVM IR code for specialised targets, not limited to GPUs despite the historical name.

The code inside a codelet has the same limitations as all the compilation models based on GPUCompiler.jl:

  • the code has to be statically inferred and compiled, dynamic dispatch is not admitted;
  • you cannot use functionalities which require the Julia runtime, most notably the garbage collector;
  • you cannot call into any other external binary library at runtime, for example you cannot call into a BLAS library.

Colossus target and performance

The LLVM IR is mostly target-independent, but not completely. Also, some details, like width of vectorisation registers, can be better tailored for the target if its properties are known correctly. For most of this project I actually generated LLVM IR for the host CPU, which "works", but is not optimal.

I recently managed to link Julia to Graphcore's fork of LLVM, which allowed me to directly generate code for the IPU (the "Colossus" target), however this combination is highly experimental and causes some unexpected errors.

Computing 𝜫

Inspired by Owain Kenway's pi_examples in several different languages.


using IPUToolkit.IPUCompiler

# Disable spinner to show progress of codelet compilation, # Pluto notebooks have their own progress indicator. IPUCompiler.PROGRESS_SPINNER[] = false;

# Do not automatically delete LLVM modules of codelets. IPUCompiler.KEEP_LLVM_FILES[] = true;

# Limit run-time of codelets ENV["POPLAR_RUNTIME_OPTIONS"] = """{"target.hostSyncTimeout":"60"}""";

let device = Poplar.get_ipu_device() target = Poplar.DeviceGetTarget(device) graph = Poplar.Graph(target)   num_tiles = Int(Poplar.TargetGetNumTiles(target)) n::UInt32 = typemax(UInt32) ÷ num_tiles unroll::UInt32 = 1 iszero(rem(n, unroll)) || error("n ($(n)) is not an integer multiple of unroll factor ($(unroll))") num_steps::UInt32 = num_tiles * n slice::Float32 = 1 / num_steps   tile_clock_frequency = Poplar.TargetGetTileClockFrequency(target)   ids = collect(UInt32.(0:(num_tiles - 1))) sums = similar(ids, Float32) cycles = similar(ids)   # With `@eval` we can interpolate (using `$`) global variables # inside the kernel. @eval function pi_kernel(i::T) where {T<:Integer}     sum = 0f0 for j in (i * $(n)):$(unroll):((i + one(T)) * $(n) - one(T)) Base.Cartesian.@nexprs $(Int64(unroll)) idx -> begin x = (j + Float32(idx - 1) - 0.5f-1) * $(slice) sum += 4 / (1 + x ^ 2) end end     return sum end   @codelet graph function Pi(in::VertexVector{UInt32, In},                            out::VertexVector{Float32, Out},                            cycles::VertexVector{UInt32, Out})     cycles[begin] = @ipuelapsed(out[begin] = pi_kernel(in[begin])) end   # Type and shape of tensors automatically inferred from input arrays ids_ipu = Poplar.GraphAddConstant(graph, ids) sums_ipu = similar(graph, sums, Float32, "sums"); cycles_ipu = similar(graph, cycles, UInt32, "cycles");   prog = Poplar.ProgramSequence()   # Automatically add the codelet defined above to the program # and spread it across all the tiles add_vertex(graph, prog, 0:(num_tiles - 1), Pi, ids_ipu, sums_ipu, cycles_ipu)   # Simplified API for `GraphCreateHostRead`, we can use fewer arguments Poplar.GraphCreateHostRead(graph, "sums-read", sums_ipu) Poplar.GraphCreateHostRead(graph, "cycles-read", cycles_ipu)   engine = Poplar.Engine(graph, prog) Poplar.EngineLoadAndRun(engine, device)   # Simplified API also for `EngineReadTensor` Poplar.EngineReadTensor(engine, "sums-read", sums) Poplar.EngineReadTensor(engine, "cycles-read", cycles)   Poplar.detach_devices()   pi = sum(sums) * slice time = round(maximum(cycles) / tile_clock_frequency; sigdigits=4)   print("""           Calculating PI using:             $(num_steps) slices             $(num_tiles) IPU tiles             loop unroll factor $(Int(unroll))           Obtained value of PI: $(pi)           Time taken: $(time) seconds ($(maximum(cycles)) cycles at $(round(tile_clock_frequency / 1e9; sigdigits=3)) GHz)           """) end;

Trying to attach to device 0... Successfully attached to device 0

Calculating PI using: 4294966272 slices 1472 IPU tiles loop unroll factor 1 Obtained value of PI: 3.1499734 Time taken: 0.123 seconds (227586882 cycles at 1.85 GHz)

Comparison with C++ program

We can write an equivalent C++ program to compare performance, to check how Julia code generation is faring. This is the codelet:


#include <poplar/Vertex.hpp> using namespace poplar; class VertexPi : public Vertex { public:     Input<unsigned> id;     Input<unsigned> n;     Input<float> slice;     Output<Vector<float>> out;     Output<Vector<unsigned>> cycles;     bool compute() {         float sum = 0;         unsigned start = __builtin_ipu_get_scount_l();         for (unsigned j=id * n; j < (id + 1) * n; j++) {             float x = (j - 0.5f) * slice;             sum += 4.0f / (1.0f + x * x);         }         unsigned end = __builtin_ipu_get_scount_l();         out[0] = sum;         cycles[0] = end - start;         return true;     } };

run(`$(cpp_exe_path)`);

Using HW device ID: 0 Calculating PI using: 4294966272 slices 1472 IPU tiles Obtained value of PI: 0.146298 Time taken: 0.141946 seconds (262599924 cycles at 1.85 GHz)

Main performance difference between the C++ and the Julia codelets in this case is due to loop unrolling. Some experimentations showed that when both code have loops similarly unrolled then performance is perfectly comparable. This showed that Julia can be a valid front-end for writing code for the IPU.

Benchmarking within codelets

IPUToolkit.jl provides some macros for quickly benchmarking parts of a codelet: @ipucycles, @ipushowcycles, @ipuelapsed (the latter was already used above).


let device = Poplar.get_ipu_device()     target = Poplar.DeviceGetTarget(device)     graph = Poplar.Graph(target)       @codelet graph function Benchmark(input::VertexVector{Float32, In},   output::VertexVector{Float32, Out})         for idx in eachindex(input, output)             x = input[idx]             @ipushow x             x = @ipushowcycles sin(x)             x = @ipushowcycles cos(x)             x = @ipushowcycles exp(x)             output[idx] = x         end     end       in_ipu = Poplar.GraphAddConstant(graph, randn(Float32, 5))     out_ipu = similar(graph, in_ipu);       prog = Poplar.ProgramSequence()       add_vertex(graph, prog, Benchmark, in_ipu, out_ipu)       engine = Poplar.Engine(graph, prog)     Poplar.EngineLoadAndRun(engine, device)     Poplar.detach_devices() end

Trying to attach to device 0... Successfully attached to device 0

T[0.5]: x = 0.433483 T[0.5]: sin(x): T[0.5]: 14418 cycles T[0.5]: cos(x): T[0.5]: 13320 cycles T[0.5]: exp(x): T[0.5]: 18 cycles T[0.5]: x = -1.012771 T[0.5]: sin(x): T[0.5]: 14448 cycles T[0.5]: cos(x): T[0.5]: 15834 cycles T[0.5]: exp(x): T[0.5]: 18 cycles T[0.5]: x = 0.436158 T[0.5]: sin(x): T[0.5]: 14430 cycles T[0.5]: cos(x): T[0.5]: 13302 cycles T[0.5]: exp(x): T[0.5]: 18 cycles T[0.5]: x = 1.835203 T[0.5]: sin(x): T[0.5]: 14562 cycles T[0.5]: cos(x): T[0.5]: 15798 cycles T[0.5]: exp(x): T[0.5]: 18 cycles T[0.5]: x = -0.725937 T[0.5]: sin(x): T[0.5]: 14592 cycles T[0.5]: cos(x): T[0.5]: 13308 cycles T[0.5]: exp(x): T[0.5]: 18 cycles

Note: This isn't a replacement for profiling, which is still an invaluable tool (but JIT complicats stacktraces), however these macros can be useful for quick feedback.

Cycle-count-based benchmarking is not reliable for blocks taking longer than typemax(UInt32) = 4294967295 cycles, about 2-3 seconds depending on your IPU model. You need to have a feeling whether the block you want to benchmark would overflow the counter.

Using external packages: StaticArrays.jl

We can also use third-party packages inside codelets, as long as the requirements mentioned above are met: no memory allocations, fully inferrable code. The StaticArrays.jl allows you to create arrays on the stack and do basic linear algebra operations with them. Code involving static arrays is also usally easy to infer for the compiler.

StaticArrays.jl isn't more efficient than using specialised Popops linear algebra routines, but nonetheless is a good example of usage of external packages inside IPU codelets written in Julia.


using StaticArrays

let device = Poplar.get_ipu_device() target = Poplar.DeviceGetTarget(device) graph = Poplar.Graph(target)   N = 16     mat1 = randn(Float32, N)     mat2 = randn(Float32, N)       mul = PoplarVector{Float32}(undef, N)     inverse = PoplarVector{Float32}(undef, N)       @ipuprogram device begin         function MatMul(in1::VertexVector{Float32, In}, in2::VertexVector{Float32, In}, mul::VertexVector{Float32, Out})             m1 = @inbounds SMatrix{4,4,Float32,16}(in1)             m2 = @inbounds SMatrix{4,4,Float32,16}(in2)             mul .= (m1 * m2)[:]         end         function Inverse(input::VertexVector{Float32, In}, inverse::VertexVector{Float32, Out})             m = @inbounds SMatrix{4,4,Float32,16}(input)             inverse .= inv(m)[:]         end           MatMul(mat1, mat2, mul)         Inverse(mul, inverse)           jl_mul = mul         jl_inv = inverse     end     Poplar.detach_devices()   jl_mul = reshape(jl_mul, 4, 4) @show reshape(mat1, 4, 4) * reshape(mat2, 4, 4) ≈ jl_mul @show reshape(jl_inv, 4, 4) ≈ inv(jl_mul) end;

Trying to attach to device 0... Successfully attached to device 0

reshape(mat1, 4, 4) * reshape(mat2, 4, 4) ≈ jl_mul = true reshape(jl_inv, 4, 4) ≈ inv(jl_mul) = true

Stochastic rounding

Real numbers constitue a continuous set \(\mathbb{R}\), but finite precision numbers used in computers are a part of a discrete set \(F\subset\mathbb{R}\). When computers do operations involving floating point numbers in \(F\), the true result \(x\in\mathbb{R}\) will be approximated by a number \(\hat x \in F\), which is typically chosen deterministically to be the nearest number in \(F\): this is called "nearest rounding".

Stochastic rounding is an alternative rounding mode to classic deterministic rounding, which randomly rounds a number \(x\in\mathbb{R}\) to either of the two nearest floating point numbers of the result \([x]\) (previous number in \(F\)) or \([x]\) (following number in \(F\)) with the following rule:

2023-09-07 09 57 30Common choices are \(P(x) = 1/2\)  or, more interestingly,

2023-09-07 09 57 34In the following we'll always talk about the latter probability function \(P(x)\).

2023-09-07 10 07 29(source: "What Is Stochastic Rounding?" by Nick Higham)

Stochastic rounding is useful because the average result of operations matches the expected mathematical result. In a statistical sense, it retains some of the information that is discarded by a deterministic rounding scheme, smoothing out numerical rounding errors due to limited precisions. This is particularly important when using low-precision floating point numbers like Float16. For contrast, deterministic rounding modes like nearest rounding introduce a bias, which is more severe as the precision of the numbers is lower.

The IPU is one of the very few processors available with hardware support for stochastic rounding.

Let's do an exercise on the CPU with classical nearest rounding. We define a function to do the naive sequential sum of a vector of numbers, because the sum function in Julia uses pairwise summation, which would have better accuracy.

naive_sum (generic function with 1 method)

naive_sum(v) = foldl(+, v)

x_sr = fill(Float16(0.9), 1000);
Float16(965.5)

naive_sum(x_sr)
false

naive_sum(x_sr) ≈ x_sr[1] * length(x_sr)

Now let's write an IPU program which computes the sum of x_sr multiple times with stochastic rounding:


sums_sr = let     device = Poplar.get_ipu_device()     target = Poplar.DeviceGetTarget(device)     graph = Poplar.Graph(target)     prog = Poplar.ProgramSequence()     Poplar.PoplarSetStochasticRounding(graph, prog, true, "")       @codelet graph function stochastic_rounding(x::VertexVector{Float16, In}, sums::VertexVector{Float16, Out})         for idx in eachindex(sums)             sums[idx] = naive_sum(x)         end     end   sums_sr = Vector{Float16}(undef, 100_000)     x_ipu =  Poplar.GraphAddConstant(graph, x_sr)     sums_ipu = similar(graph, sums_sr)     add_vertex(graph, prog, stochastic_rounding, x_ipu, sums_ipu)       # Prepare tensors read     Poplar.GraphCreateHostRead(graph, "sums-read", sums_ipu)       # Run the program     engine = Poplar.Engine(graph, prog)     Poplar.EngineLoadAndRun(engine, device)       # Read the output tensors back to the CPU     Poplar.EngineReadTensor(engine, "sums-read", sums_sr)       Poplar.detach_devices()   sums_sr end;

Trying to attach to device 0... Successfully attached to device 0

using Plots: plot, plot!, contour, surface, heatmap, histogram, vline!, plotlyjs

import PlotlyJS

plotlyjs();

2023-09-07 10 23 19-1


using Statistics
(879.0, 919.0)

extrema(sums_sr)
899.90252

mean(Float64.(sums_sr))
4.683982498528412

std(Float64.(sums_sr))
Float16(900.0)

median(sums_sr)
true

all(isapprox(sum(Float64.(x_sr))), sums_sr)

Solving differential equations on the IPU

If we want to use external packages on the IPU, we can look for similar solutions for the GPU. If they have been made generic enough to work with different backends chances are good that they can be used on the IPU as well.

An example is DiffEqGPU.jl, a suite of differential equations solvers part of the SciML ecosystem that can be run on different kinds of GPUs (Nvidia, AMD, Intel, Metal), but it provides basic functionalities that we can reuse on the IPU as well.


using DiffEqGPU, OrdinaryDiffEq

N_lotka_volterra = 20_000;

begin     T_lotka_volterra = Float16     α_lotka_volterra_range = range(T_lotka_volterra(0.1); step=T_lotka_volterra(0.1), length=64)     β_lotka_volterra_range = range(T_lotka_volterra(0.1); step=T_lotka_volterra(0.1), length=23) end;

# Create a range of different input parameters lotka_volterra_parameters = let     α = repeat(α_lotka_volterra_range; inner=23)     β = repeat(β_lotka_volterra_range; outer=64)     γ = repeat([T_lotka_volterra(3)], length(β))     δ = repeat([T_lotka_volterra(1)], length(β))     zip(α, β, γ, δ) |> Iterators.flatten |> collect end;

function lotka_volterra(u, p, t)     α, β, γ, δ = p     🐰, 🦊 = u     d🐰 = α * 🐰 - β * 🐰 * 🦊     d🦊 = -γ * 🦊 + δ * 🐰 * 🦊     return SVector{2}(d🐰, d🦊) end;

function lotka_volterra_cpu(p::Vector, n::Integer, timestep::Real, FT::DataType=Float64) u1 = Vector{FT}(undef, n) u2 = Vector{FT}(undef, n) u0 = @SVector [FT(1); FT(1)]     svp = @inbounds SVector{4, FT}(p)     integ = DiffEqGPU.init(GPUTsit5(), lotka_volterra, false, u0, FT(0), FT(timestep), svp, nothing, CallbackSet(nothing), true, false)   for idx in eachindex(u1, u2)         DiffEqGPU.step!(integ, integ.t + integ.dt, integ.u)         u1[idx] = integ.u[1]         u2[idx] = integ.u[2]     end   return u1, u2 end;

u1_16, u2_16, u1_16_sr, u2_16_sr = let device = Poplar.get_ipu_device() target = Poplar.DeviceGetTarget(device) graph = Poplar.Graph(target)   tile_clock_frequency = Poplar.TargetGetTileClockFrequency(target) num_tiles = Int(Poplar.TargetGetNumTiles(target)) tiles = 0:(num_tiles - 1)   @codelet graph function solve_lv(u1::VertexVector{Float16, Out},                             u2::VertexVector{Float16, Out},                                  p::VertexVector{Float16, In}, cycles::VertexVector{UInt32, Out})     u0 = @SVector [Float16(1); Float16(1)]     svp = @inbounds SVector{4, Float16}(p)     integ = DiffEqGPU.init(GPUTsit5(), lotka_volterra, false, u0, Float16(0), Float16(0.001), svp, nothing, CallbackSet(nothing), true, false)     cycles[begin] = @ipuelapsed for idx in eachindex(u1, u2)         DiffEqGPU.step!(integ, integ.t + integ.dt, integ.u)         u1[idx] = integ.u[1]         u2[idx] = integ.u[2]     end end   FT = T_lotka_volterra len = N_lotka_volterra * (length(lotka_volterra_parameters) ÷ 4) u1_16 = Vector{FT}(undef, len) u2_16 = Vector{FT}(undef, len) u1_16_sr = Vector{FT}(undef, len) u2_16_sr = Vector{FT}(undef, len) cycles = Vector{UInt32}(undef, num_tiles) cycles_sr = Vector{UInt32}(undef, num_tiles)   p_ipu =  Poplar.GraphAddConstant(graph, lotka_volterra_parameters)   # Convenient function for defining the name of a read operation _read(name::String, sr::Bool) = "$(name)-$(sr ? "sr-" : "")read"   for sr in (true, false) prog = Poplar.ProgramSequence() Poplar.PoplarSetStochasticRounding(graph, prog, sr, "")   u1_ipu = similar(graph, sr ? u1_16_sr : u1_16) u2_ipu = similar(graph, sr ? u2_16_sr : u2_16) cycles_ipu = similar(graph, sr ? cycles_sr : cycles)   # Run the codelet defined above on all tiles, # with tensors evenly spread across all cores. add_vertex(graph, prog, tiles, solve_lv, u1_ipu, u2_ipu, p_ipu, cycles_ipu)   # Prepare tensors read Poplar.GraphCreateHostRead(graph, _read("u1", sr), u1_ipu) Poplar.GraphCreateHostRead(graph, _read("u2", sr), u2_ipu) Poplar.GraphCreateHostRead(graph, _read("cycles", sr), cycles_ipu)   # Run the program engine = Poplar.Engine(graph, prog) Poplar.EngineLoadAndRun(engine, device)   # Read the output tensors back to the CPU Poplar.EngineReadTensor(engine, _read("u1", sr), sr ? u1_16_sr : u1_16) Poplar.EngineReadTensor(engine, _read("u2", sr), sr ? u2_16_sr : u2_16) Poplar.EngineReadTensor(engine, _read("cycles", sr), sr ? cycles_sr : cycles) end   Poplar.detach_devices()   time = round(maximum(cycles) / tile_clock_frequency; sigdigits=3) time_sr = round(maximum(cycles_sr) / tile_clock_frequency; sigdigits=3)   @info "Solving $(length(lotka_volterra_parameters) ÷ 4) ODEs on $(num_tiles) tiles took $(time) seconds with NR, and $(time_sr) with SR."   u1_16, u2_16, u1_16_sr, u2_16_sr end;

Trying to attach to device 0... Successfully attached to device 0 Solving 1472 ODEs on 1472 tiles took 0.0145 seconds with NR, and 0.0145 with SR.

2023-09-18 09 53 44

2023-09-07 13 01 09Automatic differentiation

In mathematics, contrary to integration, computing the derivative of an expression is a mechanical process. Wouldn't it be nice to automatically compute the exact derivative of mathematical functions in your code?

Automatic differentiation (autodiff in short) is a set of techniques to automatically compute the derivative of a function in a computer program.

Enzyme is a source code transformation autodiff engine operating at the LLVM level: it analysis the LLVM IR of the source code and takes the derivative of all the instructions, based on the built-in differentiation rules. The fact that it works at the LLVM level means that it's mostly front-end-language-agnostic and can be used to take derivative of source code combining multiple languages or using parallelisation frameworks ("Scalable automatic differentiation of multiple parallel paradigms through compiler augmentation" won Best Student Paper award at SuperComputing 2022). Also, Enzyme generates code at compile-time, it doesn't run itself on the target system: it's the perfect candidate for use in an IPU program!


using Enzyme

Enzyme.Compiler.enzyme_code_llvm(stdout, sin, Active, Tuple{Active{Float64}})

; Function Attrs: alwaysinline nofree nosync readnone define [1 x [1 x double]] @diffejulia_sin_10561mustwrap_inner_1wrap(double %0, double %1) #2 { entry: %2 = call fast double @llvm.cos.f64(double %0) %3 = fmul fast double %2, %1 %.unpack1 = insertvalue [1 x double] zeroinitializer, double %3, 0 %4 = insertvalue [1 x [1 x double]] zeroinitializer, [1 x double] %.unpack1, 0 ret [1 x [1 x double]] %4 }

Rosenbrock function

Evereyone's favourite example function for optimisation problems.

2023-09-07 12 41 33

rosenbrock (generic function with 2 methods)

rosenbrock(x, y=4) = (1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2

2023-09-07 13 00 282023-09-07 13 00 38Finding the valley is trivial, but converging to the global minimum is hard.

The Rosenbrock function is a polynomial function, easy to compute its gradient:

2023-09-07 13 04 57We can ask Enzyme to compute the partial derivative on the second argument:


Enzyme.Compiler.enzyme_code_llvm(stdout, rosenbrock, Active, Tuple{Const{Float64}, Active{Float64}})

;  @ /home/cceamgi/repo/rsdg-meetings/2023-07-20-julia-ipu/julia-ipu.jl#==#c30b8552-03f4-4f79-b4c0-7179ead69192 within `diffejulia_rosenbrock_11601_inner_1wrap` ; Function Attrs: alwaysinline nofree define [1 x { double }] @diffejulia_rosenbrock_11601_inner_1wrap(double %0, double %1, double %2) #1 { entry:   %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #6   %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8   %3 = bitcast i8* %tls_ppgcstack to {}****   %tls_pgcstack = load {}***, {}**** %3, align 8   %ptls_field.i2.i = getelementptr inbounds {}**, {}*** %tls_pgcstack, i64 2   %4 = bitcast {}*** %ptls_field.i2.i to i64***   %ptls_load.i34.i = load i64**, i64*** %4, align 8   %5 = getelementptr inbounds i64*, i64** %ptls_load.i34.i, i64 2   %safepoint.i.i = load i64*, i64** %5, align 8   fence syncscope("singlethread") seq_cst ; ┌ @ /home/cceamgi/repo/rsdg-meetings/2023-07-20-julia-ipu/julia-ipu.jl#==#c30b8552-03f4-4f79-b4c0-7179ead69192 within `rosenbrock` @ /home/cceamgi/repo/rsdg-meetings/2023-07-20-julia-ipu/julia-ipu.jl#==#c30b8552-03f4-4f79-b4c0-7179ead69192:1    %6 = load volatile i64, i64* %safepoint.i.i, align 8    fence syncscope("singlethread") seq_cst ; │┌ @ intfuncs.jl:332 within `literal_pow` ; ││┌ @ float.jl:411 within `*`      %7 = fmul double %0, %0 ; │└└ ; │┌ @ float.jl:410 within `-`     %8 = fsub double %1, %7 ; │└ ; │┌ @ intfuncs.jl:332 within `literal_pow` ; ││┌ @ float.jl:411 within `*`      %9 = fmul fast double %2, 2.000000e+02      %10 = fmul fast double %9, %8      fence syncscope("singlethread") seq_cst      %.unpack1 = insertvalue { double } zeroinitializer, double %10, 0      %11 = insertvalue [1 x { double }] zeroinitializer, { double } %.unpack1, 0      ret [1 x { double }] %11 ; └└└ }

Minimising a function with Enzyme

We can run an IPU program to find the minima of the Rosenbrock function on a large grid of different starting points, and plot the number of iterations taken before stopping, when the terminating criteria are met. In particular, we will use the Adam, an optimisation method which requires the gradient of the function to optimise as input, and we will use Enzyme to automatically compute it on the host system at compile time.


rosenbrock(x, y) = (1 - x) ^ 2 + 100 * (y - x ^ 2) ^ 2;

∂(f, x, y) = first(autodiff_deferred(Reverse, f, Active(x), Active(y)));

rosenbrock′(x, y) = ∂(rosenbrock, x, y);

function adam(∂f, x₀::T, y₀::T) where {T}     x, y = x₀, y₀     # Some constants     α = T(0.001) # learning rate     β₁ = T(0.9)     β₂ = T(0.999)     ϵ = T(1e-8)     # Momenta     m = zero(T), zero(T)     v = zero(T), zero(T)     # Stopping criteria     Δ = 10 * eps(T)     δ_x, δ_y = one(T), one(T)     max_t = Int32(500_000)     t = one(max_t)     while abs(δ_x) > Δ < abs(δ_y) && t ≤ max_t       g = ∂f(x, y)       m = β₁ .* m .+ (1 - β₂) .* g       v = β₂ .* v .+ (1 - β₂) .* g .^ 2       m̂ = m ./ (1 - β₁ ^ t)       v̂ = v ./ (1 - β₂ ^ t)       δ_x, δ_y = α .* m̂ ./ (.√(v̂) .+ ϵ)       x -= δ_x       y -= δ_y       t += one(t)     end # Subtract one because at the end of the last iteration we # added 1 to the counter but didn't run the following iteration.     return t - one(t) end;

begin make_grid(x::T, y::T) where {T<:AbstractVector} =     repeat(x; inner=length(y)), repeat(y; outer=length(x))   # High resolution range x_range = y_range = collect(Float32.(range(-6; step=2^-4, length=23 * 8)))   # # Low resolution range # x_range = collect(Float32.(range(-6; step=2^-2, length=23 * 2))) # y_range = collect(Float32.(range(-8; step=2^-2, length=64))) end;

m = let # Inputs to the codelet x₀, y₀ = make_grid(x_range, y_range) # Output of the codelet num_iterations = similar(x₀, Int32)   device = Poplar.get_ipu_device() target = Poplar.DeviceGetTarget(device) graph = Poplar.Graph(target)   num_tiles = Int(Poplar.TargetGetNumTiles(target))   @codelet graph function RosenAdam(x::VertexVector{Float32, In},                                   y::VertexVector{Float32, In},                                   num_iterations::VertexVector{Int32, Out})     for idx in eachindex(num_iterations)         num_iterations[idx] = adam(rosenbrock′, x[idx], y[idx])     end end   x₀_ipu = Poplar.GraphAddConstant(graph, x₀) y₀_ipu = Poplar.GraphAddConstant(graph, y₀) num_iterations_ipu = similar(graph, num_iterations, Int32);   prog = Poplar.ProgramSequence()   add_vertex(graph, prog, 0:(num_tiles - 1), RosenAdam, x₀_ipu, y₀_ipu, num_iterations_ipu)   Poplar.GraphCreateHostRead(graph, "iterations-read", num_iterations_ipu)   engine = Poplar.Engine(graph, prog) elapsed = @elapsed Poplar.EngineLoadAndRun(engine, device) Poplar.EngineReadTensor(engine, "iterations-read", num_iterations)   Poplar.detach_devices()   m = reshape(sqrt.(Int.(num_iterations)), length(y_range), length(x_range))   @info "Loading and running the program on $(length(x₀)) points took $(round(elapsed; sigdigits=4)) seconds"   m end;

Trying to attach to device 0... Successfully attached to device 0 Loading and running the program on 33856 points took 30.06 seconds

2023-09-07 13 16 16


; ModuleID = 'start' source_filename = "start" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-linux-gnu" define void @_Z13_333___Pi_235() local_unnamed_addr #0 !dbg !35 { top:   %0 = call i64 @get_vec_ptr_Pi(i32 0), !dbg !39   %1 = call i32 @get_vec_size_Pi(i32 0), !dbg !39   %2 = call i64 @get_vec_ptr_Pi(i32 1), !dbg !39   %3 = call i32 @get_vec_size_Pi(i32 1), !dbg !39   %4 = call i64 @get_vec_ptr_Pi(i32 2), !dbg !39   %5 = call i32 @get_vec_size_Pi(i32 2), !dbg !39   %6 = call i32 @llvm.colossus.get.scount.l(), !dbg !40   %coercion = inttoptr i64 %0 to i32*, !dbg !49   %pointerref = load i32, i32* %coercion, align 1, !dbg !49, !tbaa !61, !alias.scope !65, !noalias !68   %7 = call fastcc float @julia_pi_kernel_5874(i32 zeroext %pointerref), !dbg !60   %coercion1 = inttoptr i64 %2 to float*, !dbg !73   store float %7, float* %coercion1, align 1, !dbg !73, !tbaa !61, !alias.scope !65, !noalias !68   %8 = call i32 @llvm.colossus.get.scount.l(), !dbg !81   %9 = sub i32 %8, %6, !dbg !83   %coercion2 = inttoptr i64 %4 to i32*, !dbg !87   store i32 %9, i32* %coercion2, align 1, !dbg !87, !tbaa !61, !alias.scope !65, !noalias !68   ret void, !dbg !91 } declare i64 @get_vec_ptr_Pi(i32) local_unnamed_addr declare i32 @get_vec_size_Pi(i32) local_unnamed_addr declare i32 @llvm.colossus.get.scount.l() local_unnamed_addr define internal fastcc float @julia_pi_kernel_5874(i32 zeroext %0) unnamed_addr #0 !dbg !92 { top:   %1 = mul i32 %0, 2917776, !dbg !93   %.not16 = icmp ult i32 %1, -2917775, !dbg !96   %value_phi.v = select i1 %.not16, i32 2917775, i32 -1, !dbg !101   %value_phi = add i32 %1, %value_phi.v, !dbg !101   %.not = icmp ugt i32 %1, %value_phi, !dbg !111   br i1 %.not, label %L78, label %L56, !dbg !95 L56:                                              ; preds = %L56, %top   %value_phi4 = phi i32 [ %9, %L56 ], [ %1, %top ]   %value_phi6 = phi float [ %8, %L56 ], [ 0.000000e+00, %top ]   %2 = uitofp i32 %value_phi4 to float, !dbg !118   %3 = fadd float %2, 0xBFA99999A0000000, !dbg !132   %4 = fmul float %3, 0x3DF0000040000000, !dbg !134   %5 = fmul float %4, %4, !dbg !136   %6 = fadd float %5, 1.000000e+00, !dbg !141   %7 = fdiv float 4.000000e+00, %6, !dbg !144   %8 = fadd float %value_phi6, %7, !dbg !148   %.not15 = icmp eq i32 %value_phi4, %value_phi, !dbg !149   %9 = add i32 %value_phi4, 1, !dbg !151   br i1 %.not15, label %L78, label %L56, !dbg !152 L78:                                              ; preds = %L56, %top   %value_phi10 = phi float [ 0.000000e+00, %top ], [ %8, %L56 ]   ret float %value_phi10, !dbg !153 } attributes #0 = { "probe-stack"="inline-asm" } !llvm.module.flags = !{!0, !1} !llvm.dbg.cu = !{!2, !4, !5, !6, !7, !8, !9, !10, !11, !12, !13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28, !29, !30, !31, !32, !33} !julia.kernel = !{!34} !0 = !{i32 2, !"Dwarf Version", i32 4} !1 = !{i32 2, !"Debug Info Version", i32 3} !2 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !3 = !DIFile(filename: "julia", directory: ".") !4 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !5 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !6 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !7 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !8 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !9 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !10 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !11 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !12 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, nameTableKind: None) !34 = !{void ()* @_Z13_333___Pi_235} !35 = distinct !DISubprogram(name: "#333###Pi#235", linkageName: "julia_#333###Pi#235_5871", scope: null, file: !36, line: 68, type: !37, scopeLine: 68, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !36 = !DIFile(filename: "/home/cceamgi/.julia/packages/IPUToolkit/1u83e/src/compiler/codelet.jl#@#==#892eee7a-73be-4479-8d00-6e1ee8bdd97b", directory: ".") !37 = !DISubroutineType(types: !38) !38 = !{} !39 = !DILocation(line: 69, scope: !35) !40 = !DILocation(line: 45, scope: !41, inlinedAt: !43) !41 = distinct !DISubprogram(name: "get_scount_l;", linkageName: "get_scount_l", scope: !42, file: !42, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !42 = !DIFile(filename: "/home/cceamgi/.julia/packages/IPUToolkit/1u83e/src/compiler/runtime.jl", directory: ".") !43 = !DILocation(line: 100, scope: !44, inlinedAt: !46) !44 = distinct !DISubprogram(name: "macro expansion;", linkageName: "macro expansion", scope: !45, file: !45, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !45 = !DIFile(filename: "/home/cceamgi/.julia/packages/IPUToolkit/1u83e/src/compiler/timing.jl#@#==#892eee7a-73be-4479-8d00-6e1ee8bdd97b", directory: ".") !46 = !DILocation(line: 35, scope: !47, inlinedAt: !39) !47 = distinct !DISubprogram(name: "Pi;", linkageName: "Pi", scope: !48, file: !48, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !48 = !DIFile(filename: "/home/cceamgi/repo/rsdg-meetings/2023-07-20-julia-ipu/julia-ipu.jl#==#892eee7a-73be-4479-8d00-6e1ee8bdd97b", directory: ".") !49 = !DILocation(line: 119, scope: !50, inlinedAt: !52) !50 = distinct !DISubprogram(name: "unsafe_load;", linkageName: "unsafe_load", scope: !51, file: !51, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !51 = !DIFile(filename: "pointer.jl", directory: ".") !52 = !DILocation(line: 46, scope: !53, inlinedAt: !55) !53 = distinct !DISubprogram(name: "getindex;", linkageName: "getindex", scope: !54, file: !54, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !54 = !DIFile(filename: "/home/cceamgi/.julia/packages/IPUToolkit/1u83e/src/compiler/vertices.jl", directory: ".") !55 = !DILocation(line: 1336, scope: !56, inlinedAt: !58) !56 = distinct !DISubprogram(name: "_getindex;", linkageName: "_getindex", scope: !57, file: !57, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !57 = !DIFile(filename: "abstractarray.jl", directory: ".") !58 = !DILocation(line: 1286, scope: !59, inlinedAt: !60) !59 = distinct !DISubprogram(name: "getindex;", linkageName: "getindex", scope: !57, file: !57, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !60 = !DILocation(line: 101, scope: !44, inlinedAt: !46) !61 = !{!62, !62, i64 0} !62 = !{!"jtbaa_data", !63, i64 0} !63 = !{!"jtbaa", !64, i64 0} !64 = !{!"jtbaa"} !65 = !{!66} !66 = !{!"jnoalias_data", !67} !67 = !{!"jnoalias"} !68 = !{!69, !70, !71, !72} !69 = !{!"jnoalias_gcframe", !67} !70 = !{!"jnoalias_stack", !67} !71 = !{!"jnoalias_typemd", !67} !72 = !{!"jnoalias_const", !67} !73 = !DILocation(line: 146, scope: !74, inlinedAt: !75) !74 = distinct !DISubprogram(name: "unsafe_store!;", linkageName: "unsafe_store!", scope: !51, file: !51, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !75 = !DILocation(line: 42, scope: !76, inlinedAt: !77) !76 = distinct !DISubprogram(name: "setindex!;", linkageName: "setindex!", scope: !54, file: !54, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !77 = !DILocation(line: 1419, scope: !78, inlinedAt: !79) !78 = distinct !DISubprogram(name: "_setindex!;", linkageName: "_setindex!", scope: !57, file: !57, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !79 = !DILocation(line: 1389, scope: !80, inlinedAt: !60) !80 = distinct !DISubprogram(name: "setindex!;", linkageName: "setindex!", scope: !57, file: !57, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !81 = !DILocation(line: 45, scope: !41, inlinedAt: !82) !82 = !DILocation(line: 102, scope: !44, inlinedAt: !46) !83 = !DILocation(line: 86, scope: !84, inlinedAt: !86) !84 = distinct !DISubprogram(name: "-;", linkageName: "-", scope: !85, file: !85, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !2, retainedNodes: !38) !85 = !DIFile(filename: "int.jl", directory: ".") !86 = !DILocation(line: 103, scope: !44, inlinedAt: !46) !87 = !DILocation(line: 146, scope: !74, inlinedAt: !88) !88 = !DILocation(line: 42, scope: !76, inlinedAt: !89) !89 = !DILocation(line: 1419, scope: !78, inlinedAt: !90) !90 = !DILocation(line: 1389, scope: !80, inlinedAt: !46) !91 = !DILocation(line: 70, scope: !35) !92 = distinct !DISubprogram(name: "pi_kernel", linkageName: "julia_pi_kernel_5874", scope: null, file: !48, line: 21, type: !37, scopeLine: 21, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !93 = !DILocation(line: 88, scope: !94, inlinedAt: !95) !94 = distinct !DISubprogram(name: "*;", linkageName: "*", scope: !85, file: !85, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !95 = !DILocation(line: 23, scope: !92) !96 = !DILocation(line: 513, scope: !97, inlinedAt: !98) !97 = distinct !DISubprogram(name: "<;", linkageName: "<", scope: !85, file: !85, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !98 = !DILocation(line: 376, scope: !99, inlinedAt: !101) !99 = distinct !DISubprogram(name: ">;", linkageName: ">", scope: !100, file: !100, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !100 = !DIFile(filename: "operators.jl", directory: ".") !101 = !DILocation(line: 343, scope: !102, inlinedAt: !104) !102 = distinct !DISubprogram(name: "steprange_last;", linkageName: "steprange_last", scope: !103, file: !103, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !103 = !DIFile(filename: "range.jl", directory: ".") !104 = !DILocation(line: 325, scope: !105, inlinedAt: !106) !105 = distinct !DISubprogram(name: "StepRange;", linkageName: "StepRange", scope: !103, file: !103, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !106 = !DILocation(line: 379, scope: !105, inlinedAt: !107) !107 = !DILocation(line: 24, scope: !108, inlinedAt: !109) !108 = distinct !DISubprogram(name: "_colon;", linkageName: "_colon", scope: !103, file: !103, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !109 = !DILocation(line: 22, scope: !110, inlinedAt: !95) !110 = distinct !DISubprogram(name: "Colon;", linkageName: "Colon", scope: !103, file: !103, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !111 = !DILocation(line: 38, scope: !112, inlinedAt: !114) !112 = distinct !DISubprogram(name: "&;", linkageName: "&", scope: !113, file: !113, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !113 = !DIFile(filename: "bool.jl", directory: ".") !114 = !DILocation(line: 669, scope: !115, inlinedAt: !116) !115 = distinct !DISubprogram(name: "isempty;", linkageName: "isempty", scope: !103, file: !103, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !116 = !DILocation(line: 897, scope: !117, inlinedAt: !95) !117 = distinct !DISubprogram(name: "iterate;", linkageName: "iterate", scope: !103, file: !103, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !118 = !DILocation(line: 165, scope: !119, inlinedAt: !121) !119 = distinct !DISubprogram(name: "Float32;", linkageName: "Float32", scope: !120, file: !120, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !120 = !DIFile(filename: "float.jl", directory: ".") !121 = !DILocation(line: 7, scope: !122, inlinedAt: !124) !122 = distinct !DISubprogram(name: "convert;", linkageName: "convert", scope: !123, file: !123, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !123 = !DIFile(filename: "number.jl", directory: ".") !124 = !DILocation(line: 370, scope: !125, inlinedAt: !127) !125 = distinct !DISubprogram(name: "_promote;", linkageName: "_promote", scope: !126, file: !126, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !126 = !DIFile(filename: "promotion.jl", directory: ".") !127 = !DILocation(line: 393, scope: !128, inlinedAt: !129) !128 = distinct !DISubprogram(name: "promote;", linkageName: "promote", scope: !126, file: !126, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !129 = !DILocation(line: 422, scope: !130, inlinedAt: !131) !130 = distinct !DISubprogram(name: "+;", linkageName: "+", scope: !126, file: !126, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !131 = !DILocation(line: 25, scope: !92) !132 = !DILocation(line: 410, scope: !133, inlinedAt: !131) !133 = distinct !DISubprogram(name: "-;", linkageName: "-", scope: !120, file: !120, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !134 = !DILocation(line: 411, scope: !135, inlinedAt: !131) !135 = distinct !DISubprogram(name: "*;", linkageName: "*", scope: !120, file: !120, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !136 = !DILocation(line: 411, scope: !135, inlinedAt: !137) !137 = !DILocation(line: 332, scope: !138, inlinedAt: !140) !138 = distinct !DISubprogram(name: "literal_pow;", linkageName: "literal_pow", scope: !139, file: !139, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !139 = !DIFile(filename: "intfuncs.jl", directory: ".") !140 = !DILocation(line: 26, scope: !92) !141 = !DILocation(line: 409, scope: !142, inlinedAt: !143) !142 = distinct !DISubprogram(name: "+;", linkageName: "+", scope: !120, file: !120, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !143 = !DILocation(line: 422, scope: !130, inlinedAt: !140) !144 = !DILocation(line: 412, scope: !145, inlinedAt: !146) !145 = distinct !DISubprogram(name: "/;", linkageName: "/", scope: !120, file: !120, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !146 = !DILocation(line: 425, scope: !147, inlinedAt: !140) !147 = distinct !DISubprogram(name: "/;", linkageName: "/", scope: !126, file: !126, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !148 = !DILocation(line: 409, scope: !142, inlinedAt: !140) !149 = !DILocation(line: 521, scope: !150, inlinedAt: !151) !150 = distinct !DISubprogram(name: "==;", linkageName: "==", scope: !126, file: !126, type: !37, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !4, retainedNodes: !38) !151 = !DILocation(line: 901, scope: !117, inlinedAt: !152) !152 = !DILocation(line: 28, scope: !92) !153 = !DILocation(line: 29, scope: !92) Conclusions 🎉

Conclusions

2023-09-07 14 39 07

What was achieved?

  • We developed the first (to our knowledge) third-party programming model for the IPU with a language not officially supported by Graphcore. This was made possible by the ubiquitous LLVM compiler framework
  • Julia allowed us doing general-purpose programming on the IPU with a high-level language (although while still using low-level Poplar functionalities), outside of the machine learning niche
  • Proposed non-trivial programs using third-party packages (e.g. differential equations solver and zero-runtime-cost autodiff with Enzyme), showcasing possibility of code reuse, much larger than what can be achieved in C++
  • Thanks to Julia's introspection and metaprogramming capabilities we were able to simplify some aspects of writing IPU programs
  • We can use both single and half precision floating point numbers, including stochastic rounding for the latter
  • First use of Graphcore's LLVM fork
  • Overall, we were not concerned about performance throughout this project, the main goal being to make the basic interface between Julia and the IPU work. However, we showed that at least in the specific example of the π program performance of the LLVM IR generated by Julia (for the host CPU!) was competitive with a native C++ codelet.

Limitations and possible future work

  • There are many limitations for the reuse of Julia code on the IPU (no runtime: no compilation, no garbage collection, etc; no external binary libraries), but these aren't specific to IPU, in common with other offloading techniques, like GPU
  • Julia doesn't have native support for 8-bit floating point numbers
  • Only a subset of Poplar/Poplibs libraries have been wrapped
  • We're currently using a light C++ shim to define a vertex ➡ fully define codelet in LLVM IR
  • Targeting Colossus backend is very experimental (sometimes optimisation passes optimise too much) ➡ iron out integration with Colossus backend
  • Couldn't figure out a way to integrate with GPUArrays.jl ➡ explore other programming models (KernelAbstracts.jl, GPUArrays.jl?)
  • Currently no access to tile-level multi-threading ➡ is it possible to express tile-level multi-threading?

Give it a try in the cloud

Head to https://github.com/JuliaIPU/JuliaIpuDemo and follow the instructions in the README.md for how to play with Julia on the IPU in the cloud with Paperspace for free.

Acknowledgements

The work of integrating Julia and the IPU was originally started by Emily Dietrich and Luk Burchard at Simula, I recently picked it up and expanded it.

My work was funded by UCL Centre for Advanced Research Computing. I'd like to thank my colleague Sofía Miñano for fruitful discussions.

I'd also like to thank developers and users of GPUCompiler.jl (Valentin Churavy, Tim "maleadt" Besard, Gabriel Baraldi, Valentin "Sukera" Bogad) for their patience in answering my questions, and Graphcore engineers (Mark Pupilli, Andrew Fitzgibbon, Dave Bozier) for their support along the way, and Simula eX³ for the use of their facilities.