IPU Programmer’s Guide


This programmer’s guide describes the architecture of the Graphcore Intelligence Processing Unit (IPU) range of accelerators. The IPU is a highly parallel processor, specifically designed for machine learning and artificial intelligence applications. The IPU is used as an accelerator to a host computer which runs code on one or more IPUs.

The first section of this document describes the high-level architecture of the IPU and its interface to the host. This is followed by a description of the graph-based programming model used and an overview of the programming tools available.

The document also includes a Glossary of terms used to describe the architecture and programming model.

IPU hardware overview

The IPU is based on a highly parallel, memory-centric architecture designed to accelerate machine learning applications. It provides very high floating-point performance on mixed-precision floating-point data. There is a large amount of distributed SRAM, allowing fast and flexible access to data.

The IPU is used as an accelerator for a host computer. The host processor can create IPU code to be executed across one or more IPUs. The host computer communicates with the IPUs through the PCI Express (PCIe) interface. This allows the host to offload computation tasks to the IPU, and the IPU to transfer data to and from the host memory.

A single IPU can carry out a wide range of machine learning or artificial intelligence tasks. Multiple IPUs can be used together on a single task. In this case they communicate through custom IPU-Link® interconnect cables.

Host computer with IPU accelerator

Host computer with IPU accelerator

The IPU itself is made up of many independent processing units, called tiles. All the tiles are connected to an ultra-fast, all-to-all communication fabric called the exchange. When multiple IPUs are connected together, the exchange fabric extends to all tiles on all of the IPUs.

IPU internal architecture

IPU internal architecture

Tile architecture

Each tile consists of a single processor and its local memory.

Architecture of a single tile

Architecture of a single tile

Each tile is a complete processor capable of running independent programs with arbitrary control flow.

The processor supports a fixed number of hardware threads, which are serviced in round-robin order. Memory accesses and most instructions take a single cycle. Code can run in two modes: supervisor or worker. Supervisor code controls the execution of worker tasks, which perform floating point computation.

The instruction set of the processor has been designed from scratch for machine learning and artificial intelligence. The processor can perform single-precision (32 bit) and half-precision (16 bit) floating point operations. These can be vectorised, enabling up to 32 multiply-accumulate operations per cycle. There is hardware support for common transcendentals, random number generation and stochastic rounding.

Memory architecture

In the current IPU, Colossus GC2, each tile has 256 kilobytes of SRAM. This means that an IPU with 1,216 tiles has about 300 MB of memory in total.

This local memory is the only memory directly accessible by tile instructions. It is used for both the code and data used by that tile. There is no shared memory access between tiles.

The tile uses a contiguous unsigned 21-bit address space, beginning at address 0x0. In practice, only a part of this memory space is populated with memory. The available memory starts at address 0x40000 and ends at 0x7FFFF. (A non-zero start address is a simple way to prevent invalid null pointers from being accessed.) All invalid addresses are mapped to valid ones by masking off the upper bits, starting at bit 18.

Memory architecture for Colossus GC2

Memory architecture for Colossus GC2

The memory is organised as two regions each made up of eight 64-bit wide banks. Concurrent accesses can be made to different banks, multiple accesses to the same bank must be sequential.

Region 0 is selected when bit 17 of the address is 0, and addressed with bits [16:3].

Instructions can only be fetched from region 0.

The banks in region 1 are interleaved, with bit 3 of the address selecting 64-bit words from alternating odd and even banks. Interleaving allows two 64-bit aligned addresses to be accessed simultaneously as long as they are an odd number of words apart. This means, for example, that an instruction can perform a 128-bit load as this reads the consecutive words from two different banks.

All loads and stores must be naturally aligned.

Parity errors

Memory parity errors can occur when data is read from memory; for example, by a load instruction or an instruction fetch. A parity error detected in a fetched instruction prevents the execution of that instruction.

Parity is reset when the device is powered on or reset.

Programming model

The main difference between the IPU and other processors is the parallel execution of code on the tiles. When the IPU runs a program, the tiles all work together in parallel on the task. Each tile can execute different code and operates on data stored in its local memory. Tiles can exchange data at the end of each computation task; all tiles must synchronise to do this communication.

Typical IPU applications operate on variables that are large multi-dimensional arrays of data (tensors); these can be distributed across the tiles, with parts of each variable stored on different tiles.

Graph representation

When it executes, an IPU program reads data from one or more tensors and writes its results back to other tensors. The tensors can be processed by multiple tiles, each operating on the elements of the tensors stored locally.

This computation can be represented as a graph where a vertex represents the code executed by a tile. The edges are the data operated on by the vertex. If the data is in the memory of the tile executing the vertex code, then these edges represent reads and writes to local memory. If there are variables stored on another tile, then the edges represent communication through the exchange fabric.

Graph representation of variables and processing

Graph representation of variables and processing

The function performed by a vertex can be anything from a simple arithmetic operation to reshaping/transposing tensor data, or performing an N-dimensional convolution.

As an example, here’s a simple vertex that adds two floating-point inputs, x and y, with the sum available as the output sum. The inputs and outputs could be connected to other vertices in a larger graph that is performing more complex computation.

A vertex that adds two numbers

A vertex that adds two numbers

Code to implement this vertex is shown in Writing vertex code.

The Poplar framework includes a large number of predefined functions that implement common operations on tensors. See the Poplar and Poplibs User Guide for more information on creating graphs and writing vertex code.

Executing code in parallel

A graph can be distributed across the IPU so that vertices are executed in parallel. Each tile executes one or more vertices, operating on data stored locally, and then communicates results to other tiles.

The set of vertices that execute in parallel is known as a compute set. The order in which steps are executed is defined by a control program which is loaded onto every tile. The resulting computation graph can be visualised as shown below.

Execution of compute sets

Execution of compute sets

The IPU uses the bulk-synchronous parallel (BSP) model of execution, where the execution of a task is split into steps. Each step consists of the following phases:

  • local compute
  • global synchronisation
  • data exchange

In the compute phase, all tiles execute in parallel, operating on their local data. After each tile finishes executing, it enters the synchronisation state. When all tiles have reached the synchronisation state, the IPU enters the exchange phase where data is copied between the tiles.

Phases of task execution

Phases of task execution

After the exchange phase, the process repeats: the tiles move into a new compute phase, performing computation using their local data and the new data received during the exchange.

The program continues by executing a series of such steps, alternating between exchange and compute phases. Viewed as a timeline, we can see that each tile repeatedly performs the sequence of sync, exchange and compute, as shown in the diagram below.

Sync, exchange and compute activity across tiles

Sync, exchange and compute activity across tiles

Each step occurs in parallel across all tiles, but the whole IPU can be viewed as executing a sequence of steps in a serial fashion, each consisting of the sync, exchange and compute phases.

Execution activity of all tiles in IPU

Execution activity of all tiles in IPU

To determine the order of steps to be executed, each tile contains a control program. This is loaded onto every tile, by the host, to control the execution of compute and exchange sequences.

The synchronization and exchange phases are normally handled implicitly by the Poplar library functions and so you do not need to specify them explicitly.

Host interface

The host starts by loading the code and initial data onto the IPU. This can include multiple control programs.

Loading programs on to the IPU

Loading programs on to the IPU

Once the program is deployed, all the code and data structures required to run the program reside in the IPU’s distributed memory. The CPU can then instruct the accelerator to run one of the control programs in order to execute the appropriate vertices.

Selecting a control program to run

Selecting a control program to run

During the program’s execution, the accelerator can read and write data from host memory. For example, the program can issue instructions to pull items of training data onto the processor when training a neural network model (the model itself is already resident on the processor).

Programming tools

The Poplar framework is shown below. This supports programming the IPU at multiple levels from high-level machine-learning frameworks to C++ and assembly language.

Software development infrastructure

Software development infrastructure

At the highest level of abstraction, you can use standard ML frameworks such as TensorFlow, ONNX, PyTorch and Keras to generate code to run on the IPU. This allows existing code to be run largely unchanged. You can then optimise the code to take advantage of the parallel execution and other features of the IPU.

Graphcore provides an implementation of TensorFlow for the IPU. The PopART library enables models in ONNX format to be imported and run on the IPU.

For more information, refer to the TensorFlow and PopART user guides.

Poplar libraries

These high-level frameworks call the functions defined in the Poplar and Poplibs libraries. You can also call these directly from your code to implement algorithms on the IPU.

The Poplar library contains the functions required for creating, executing and profiling graphs on IPUs. The Poplibs library contains many predefined functions such as linear algebra operations, element-wise tensor operations, non-linearities and reductions.

For more information, refer to the Poplar and Poplibs User Guide.

Writing vertex code

Finally, it is possible to write your own code that runs directly on the IPU as a codelet. A codelet is a multi-input, multi-output function which defines the operations performed by a vertex.

For example, the Adder vertex described earlier (see A vertex that adds two numbers) can be implemented in C++ as shown below:

 #include <poplar/Vertex.hpp>

using namespace poplar;

class AdderVertex : public Vertex {
  Input<float> x;
  Input<float> y;
  Output<float> sum;

  bool compute() {
    *sum = x + y;
    return true;

Note that the output of the vertex function is defined by the Output member, rather than the return value of the compute function. A vertex can have more than one output.

Writing codelets in C++ is documented in the Poplar and Poplibs User Guide.

You can also write vertices in assembly code. See the Vertex Assembly Programming Guide for more information.



Bulk-synchronous parallel. A programming methodology for parallel algorithms.

Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (August 1990), 103-111. DOI=10.1145/79173.79181

A multi-input, multi-output function which defines the state and behaviour of vertices.
Compute set
A set of vertices that are executed in parallel during the BSP compute phase.
Exchange phase
Communication phase of a step, where data is communicated between tiles and between IPUs.
Exchange fabric
The communication network used to transfer data between tiles and external interfaces.
External exchange
An exchange phase where data is communicated between tiles on different IPUs, or between an IPU and the host processor.
A 16-bit floating-point value.
Intelligence Processing Unit
The Open Neural Network Exchange (ONNX) is an open format for representing machine learning and AI models.
A library of functions for creating and deploying parallel programs on IPUs.
The Poplar advanced run-time (PopART) provides support for importing, creating and running ONNX graphs on the IPU.
A 32-bit floating-point value.
A sequence of phases consisting of: local computation, system-wide synchronisation and global communication (exchange phase).
The code responsible for initiating worker threads and performing the exchange phase and synchronisation phases of a step. Supervisor code cannot perform floating point operations.
Synchronisation (sync)
A system-wide synchronisation; the first phase in a step, following which it is safe to perform an exchange phase. Synchronisation can be internal (between all of the tiles on a single IPU) or external (between all tiles on every IPU).
A tensor is a variable that contains a multi-dimensional array of values. In the IPU, the storage of a tensor can be distributed across the tiles.
An open-source library of high-level functions for machine learning and AI, created by Google.
An individual processor core in the IPU consisting of a processing unit and memory. All tiles are connected to the exchange fabric.
Code that forms a unit of computation in the graph. Vertices have inputs and outputs that are connected to tensors, and a compute function to process the tensor data. Each vertex is stored and executed on a single tile.
Code that can perform floating point operations and is typically responsible for performing the compute phase of a step. A tile has hardware support for multiple worker contexts.