Poplar and Poplibs User Guide


Poplar™ is a graph-programming framework for the Graphcore Intelligence Processing Unit (IPU), a new type of processor aimed at artificial intelligence and machine learning applications. An overview of the IPU architecture and programming model can be found in the IPU Programmer’s Guide. You should familiarize yourself with this document before reading this guide.

The Poplar SDK includes tools and libraries to support programming the IPU. The Poplar SDK libraries provide a C++ interface. Poplar also supports industry-standard machine learning frameworks such as TensorFlow, MXNET, ONNX, Keras, and PyTorch which can be accessed from Python.

There are a number of example programs included with the SDK in the examples directory of the Poplar installation. Further examples and benchmarks are available on the Graphcore GitHub (contact your support representative to get access).

The Poplar library provides classes and functions to implement and deploy parallel programs on the IPU. It uses a graph-based method to describe programs and, although it can be used to describe arbitrary accelerated computation, it has been designed specifically to suit the needs of artificial intelligence and machine learning applications.

The Poplibs libraries are a set of application libraries that implement operations commonly required by machine learning applications, such as linear algebra operations, elementwise tensor operations, non-linearities and reductions. These provide a fast and easy way to create programs that run efficiently using the parallelism of the IPU.

There are several command line tools to manage the IPU hardware. These are described in the “Getting Started” guide for your IPU system and the IPU Command Line Tools document.

Programming with Poplar

You can use Poplar library functions to define graph operations and control the execution and profiling of code on the IPU.

Code can be compiled to run on IPU hardware, a simulated IPU Model or the host CPU. Running on an IPU Model or the CPU is useful when you do not have access to IPU hardware.

The IPU Model is a simulation of the behaviour of the IPU hardware. It does not completely implement every aspect of a real IPU. For example, the IPU Model does not fully support replicated graphs (see Replicated graphs).

If you encounter an out of memory error, it may be useful to run on the IPU Model device to debug the problem.

Consider the situation in which the event trace is being used to investigate a graph that creates a tile memory imbalance. In this case, running on the IPU will lead to an out of memory exception before the report is generated. Running on the IPU Model instead of actual hardware will still run out of memory, but the code will run to completion so the report can be generated.

Code running on the CPU will be faster than the IPU Model, because it does not have the overhead of modelling the IPU. CPU code runs with a single worker thread as if on a single tile. This means you do not need to think about tile allocation or the limited tile memory when initially developing your code.

If you want to profile your code, you will need to run on either IPU hardware or the IPU Model.

Poplar programming model

For a more detailed introduction to the IPU architecture and programming model, see the IPU Programmer’s Guide.

A Poplar computation graph defines the input/output relationship between variables and operations. Each variable is a multi-dimensional tensor of typed values and can be distributed across multiple tiles.

Graph representation of variables and processing

Graph representation of variables and processing

The vertices of the graph are the code executed in parallel by the tiles. Each tile executes a sequence of steps, which form a compute set containing one or more vertices.

The edges of the graph define the data that is read and written by the vertices. Each tile only has direct access to the tensor elements that are stored locally.

Each vertex always reads and writes the same tensor elements. In other words, the connections defined by the execution graph are static and cannot be changed at run time. However, the host program can calculate the mapping and graph connectivity at run time when it constructs the execution graph. See Tutorial 7: matrix-vector multiplication optimisation for an example.

The placement of vertices and tensor elements onto tiles is known as the tile mapping.

Mapping tensors and vertices to tiles

Mapping tensors and vertices to tiles

The structure of a Poplar program

A Poplar program performs the following tasks:

  • Find or create the target device type as a Device representing physical IPU hardware, a simulated IPUModel or code running on the CPU.
  • Create a Graph object which will define the connections between computation operations and data, and how they are mapped onto the IPUs.
  • Create one or more Program objects which will control the execution of the graph operations.
  • Define the computations to be performed and add them to the Graph and Program objects. You can use the functions defined in Poplar and Poplibs, or you can write your own device code.
  • Create an Engine object, which represents a session on the target device, using the Graph and Program objects.
  • Connect input and output streams to the Engine object, to allow data to be transferred to and from the host.
  • Execute the computation with the Engine object. This will compile your graph code and load it onto the IPU, along with any library functions required, and start execution.

A program object can be constructed by combining other program objects in various ways. For example, Poplar provides several standard Program sub-classes such as Sequence, which executes a sequence of sub-programs, Repeat for executing loops, and If for conditional execution. The Poplar and Poplibs libraries also include programs for a wide range of operations on tensor data.

For more detailed descriptions and examples of each of these steps, see the Tutorials.

What happens at run time

When you run your program on the host, the Poplar run-time will compile your graph to create object code for each tile. The code may come from Poplar or Poplibs library functions, or from vertex code you write yourself (see Device code), and will be linked with any required libraries.

This object will contain:

  1. The control-program code from your graph
  2. Code to manage exchange sequences
  3. Initialised vertex data
  4. The tensor data mapped to that tile

The host program will load the object code onto the target device, which is then ready to execute the program.

Virtual graphs

A graph is created for a target device with a specific number of tiles. It is possible to create a new graph from that, which is a virtual graph for a subset of the tiles. This is effectively a new view onto the parent graph for a virtual target, which has a subset of the real target’s tiles and can be treated like a new graph. You can add vertices and tensors to the virtual sub-graphs. These will also appear in the parent graph.

Any change made to the parent graph, such as adding variables or vertices, may also affect the virtual sub-graph. For example, a variable added to the parent graph will appear in the sub-graph if it is mapped to tiles that are within the subset of tiles in the virtual target.

Virtual graphs can be used to manage the assignment of operations to a subset of the available tiles. This can be used, for example, to implement a pipeline of operations by creating a virtual graph for each stage of the pipeline and adding the operations to be performed on those tiles.

Mapping a pipeline of operations to tiles using virtual graphs

Mapping a pipeline of operations to tiles using virtual graphs

There are several versions of the createVirtualGraph function, which provide different ways of selecting the subset of tiles to include in the virtual target.

Replicated graphs

You can also create a replicated graph. This effectively creates a number of identical copies, or replicas, of the same graph. Each replica targets a different subset of the available tiles (all subsets are the same size). This may be useful, for example, where the target consists of multiple IPUs and you want to create a replica to run on each IPU (or group of IPUs) in parallel.

Any change made to the replicated graph, such as adding variables or vertices, will affect all the replicas. A variable mapped to tile 0, for example, will have an instance on tile 0 in each of the replicas.

Replicated graphs can be created in two ways:

  • Splitting an existing graph into a number replicas with the createReplicatedGraph function (see Replicating an existing graph).

  • Creating a new replicated top-level graph by passing a replication factor to the Graph constructor (see Creating a replicated graph).

    Note: Replicated graphs created in this way are not supported when running on an IPU Model.

As an example, imagine you have a graph which targets two IPUs. You can run four copies of it, in parallel, on eight of the IPUs in your system by creating the two-IPU graph and replicating it four times. This can be done using either of the techniques above, each of which has advantages and disadvantages, summarised in the following descriptions.

Replicating an existing graph

Replicating an existing graph

Replicating an existing graph

We can start by creating a graph for eight IPUs, and then creating a replicated graph from that:

 // Create a graph for 'target' which has 8 IPUs
Graph g  = Graph(target);
// Create 4 replicas each of which targets 2 IPUs
Graph rg = g.createReplicatedGraph(4);

Any changes, such as adding code or variables, made to the replica rg will be duplicated over all four replicas.

However, you can still do things with the original “parent” graph g that do not affect all the replicas. For example, a variable or an operation can be added to the parent graph and mapped to only one IPU. This will only be present on the replica that targets that IPU. It is also possible to access a variable that exists on all the replicas as a single tensor, using the getNonReplicatedTensor function. This adds an extra dimension to the variable to represent the mapping across the replicas.

This approach provides more flexibility but means that the graph of each replica needs to be compiled separately. This can make it slower to build the program.

Creating a replicated graph

Creating a replicated graph

Creating a replicated graph

In this case, we start by creating a replicated graph using the graph constructor:

 // Create a graph with 4 replicas for each 2 IPUs
Graph rg = Graph(target, replication_factor(4));

We can add variables and vertices to this graph as usual. These additions will be applied to every replica. This graph only exists as a replica, with no parent graph that can be used to make modifications differently to each replica. Therefore, as all the replicas are guaranteed to be identical, the graph only needs to be compiled once. Copies of the object code are then loaded onto each of the pairs of IPUs when the program runs. Each instance of the replica is given a unique ID at load time; this can be used to identify it in functions such as crossReplicaCopy.

Any functions that rely on the existence of a parent, such as getTopLevelGraph or getNonReplicatedTensor, will fail.

Data streams and remote buffers

Memory external to the IPU can be accessed in two ways. Data streams enable the IPU to transfer data to and from host memory. Remote buffers enable the IPU to store data in external (off-chip) memory.

Data streams

Data streams are used for communication between the host and the IPU device. The data transfers are controlled by the IPU.

Each stream is a unidirectional communication from the host to the device, or from the device to the host. A stream is defined to transfer a specific number of elements of a given type. This means the buffer storage required by the stream is known (the size of the data elements times the number of elements).

The Poplar graph compiler will merge multiple stream transfers into a single transfer (up to the limits described in Stream buffer size limit).

Device-side streams

A stream object, represented by the DataStream class, is created and added to a graph using the addHostToDeviceFIFO or addDeviceToHostFIFO functions. The stream is defined to have:

  • A name for the stream
  • The type of data to be transferred
  • The number of elements to be transferred

A host-to-device stream can also have a replication mode, if it is connected to a replicated graph. This defines whether a single stream will send the same data to all the replicated graphs (broadcast mode) or there will be a stream per replica.

Stream data transfer is done with a Copy program which copies data from the stream to a tensor, or from a tensor to the stream.

Host-side stream access

On the host side, a data stream is connected to a buffer allocated in memory. The buffer is connected to the stream using the connectStream function of an Engine object. This can, optionally, be implemented as a circular buffer to support more flexible transfers.

In order to synchronise with the data transfers from the IPU, a callback is connected to the stream using the Engine::connectStreamToCallback function. Callback implementations are derived from the StreamCallback interface and have a pointer to the stream buffer as an argument.

  • For a device-to-host transfer, the callback function will be called when the transfer is complete so that the host can read the data from the buffer.
  • For a host-to-device stream, by default the callback function will be called immediately before the IPU transfers the buffer contents to device memory. The host-side code should populate the stream buffer and then return.

Optimising host data transfers

There are several things you can do to optimise the use of data streams to and from the host. These are described below.


You can specify that the the IPU should call the callback function as early as possible (for example, immediately after it releases the stream buffer from a previous transfer). The host is then able to fill the buffer in advance of the transfer, meaning the IPU spends less time waiting for the host.

This mode of operation, known as prefetch, is enabled by setting the exchange.enablePrefetch option to “true” when the engine object is created.

Prefetch is only possible if the address range of the stream’s data buffer does not overlap with another stream’s buffer (this may be done to optimise memory use).

This means that the engine option exchange.streamBufferOverlap must be set to either “HostRearrangeOnly” or “None”. The first of these is most useful as the performance of streams that are being rearranged is often less important. Setting the option to “None” may use too much memory.

The callback function returns a value that indicates if the buffer was filled.

If there is data available to fill the buffer, the callback function should return Result::Success. The device code will then call the complete callback when it has transferred the data.

Otherwise, if data is not available (either because it is the end of the stream, or the data is not ready yet), then the callback returns Result::NotAvailable.

Sync configurations

In a multi-IPU system, synchronisation (sync) signals are used to ensure that IPUs are ready to exchange data and that data exchange is complete. These sync signals are also used to synchronise host transfers and access to remote buffers.

Each IPU can be allocated to one or more “sync groups”. At a synchronization point, all the IPUs in a sync group will wait until all the other IPUs in the group are ready.

Sync groups can be used to to allow subsets of IPUs to overlap their operations. For example, one sync group can be performing data transfers to or from the host, while another group is processing a previous batch of data.

You can configure the sync groups as appropriate for your application. The allocation of IPUs to the sync groups (GS1 and GS2) can be configured using the syncConfiguration option when creating a target.

The options are:

  • intraReplicaAndAll:
    • GS1 is used for synchronisation between the IPUs in each replica of a replicated graph (or all IPUs if there is no replication).
    • GS2 is used for synchronisation between all IPUs.
  • ipuAndAll:
    • GS1 is used for synchronisation of each IPU individually.
    • GS2 is used for synchronisation between all IPUs.
  • IntraReplicaAndLadder:
    • GS1 is used for synchronisation between the IPUs in each replica of a replicated graph (or all IPUs if there is no replication).
    • GS2 is used by two independent subsets of IPUs. These can then synchronise independently of one another, so that they can alternate between one set doing host I/O, for example, while the other is computing.

The way in which Poplar uses these sync groups is summarised in the following table:

syncConfiguration syncReplicasIndependently
false (default) true
intraReplicaAndAll (default) and intraReplicaAndLadder GS1 Communication between IPUs within each replica (or all IPUs if the graph is not replicated). Remote buffer access. Communication between IPUs within each replica (or all IPUs if the graph is not replicated). Remote buffer access. Host communication.
GS2 Communication between replicas (all IPUs). Host communication. Communication between replicas (all IPUs).
ipuAndAll GS1 Remote buffer access. Remote buffer access. Host communication.
GS2 Communication between all IPUs. Host communication. Communication between all IPUs.
Software sync

Software sync provides a third synchronisation mechanism that can replace the hardware sync that happens after a host exchange. Software sync is disabled by default. You can enable it by setting the option opt.enableSwSyncs to true when creating the engine object.

With software sync enabled, each IPU synchronises with the host independently. This means that each IPU can move onto the next operation as soon as its host data transfer is complete, instead of having to wait for all the other IPUs to finish.

If two IPUs don’t need to synchronise then they can operate in parallel, completely independently. For example, this allows one to do I/O while the other is computing. But this applies more generally: each IPU can do an arbitrary sequence of compute and I/O operations without needing to synchronise with the other IPU until they need to communicate with one another.


If you use software sync then the default sync configuration (intraReplicaAndAll) must be used and the target.syncReplicasIndependently option must not be set.

Remote memory buffers

The IPU can also access off-chip memory as a remote buffer. This may be host memory or memory associated with the IPU system. This is not used for transferring data to the host, but just for data storage by the IPU program.

A RemoteBuffer object is created and added to the graph with the addRemoteBuffer function of the graph object. Data transfers to and from the remote buffer are performed using a Copy program which copies data from the buffer to a tensor, or from a tensor to the buffer.

Stream buffer size limit

The IPU has a memory address translation table which defines the external memory address range it can access. As a result, there is a maximum buffer size for data transferred by a stream. This limit is currently 128 MBytes per stream copy operation. More data can be transferred by a sequence of copies, separated by sync operations, so that the buffer memory can be reused for each transfer.

Each IPU has its own translation table. So, if there are multiple IPUs, this limit applies to each IPU individually.

Device code

Each vertex of the graph is associated with some device code. This can come from a library function or you can write your own as a codelet. Codelets are specified as a class that inherits from the poplar::Vertex type. For example:

 #include <poplar/Vertex.hpp>

using namespace poplar;

class AdderVertex : public Vertex {
  Input<float> x;
  Input<float> y;
  Output<float> sum;

  bool compute() {
    *sum = x + y;
    return true;

The Input and Output fields connect the vertex to the tensor data that it reads and writes. An Input field should not be written and an Output field should not be read; the results are undefined. If you need a field that is read and written, then it should be defined as InOut.

These fields have begin, end, operator[] and operator* methods so they can be iterated over and accessed like other C++ containers. For Input fields all of these methods are const.

The Output field can be successfully updated even if the corresponding tensor is on another tile. This is because the data is not transferred to the destination tile until the compute is complete. However, reading an Output field is not guaranteed to return the expected value. If you need to both write to and read from a field, then it should be declared as an InOut type.

The types used in vertex code are described in the runtime API section of the Poplar and Poplibs API Reference.

You can add a codelet to your graph by using the Graph::addCodelets function. This will load the source file and compile the codelet when the host program runs. See the adder example provided with the Poplar distribution.

You can also pass compilation options (for example “-O3”). The code is compiled for both the host and for the IPU so the program can be run on IPU hardware or on the host.

There are a couple of predefined macros that may be useful when writing vertex code. __POPC__ is defined when code is compiled by the codelet compiler. The macro __IPU__ is defined when code is being compiled for the IPU (rather than the host).

You can also write codelets in assembly language for the IPU. See the Vertex Assembly Programming Guide for more information. You might find that document useful even if you are not programming in assembly, as it contains a lot of information about calling conventions, memory use and the implementation of various data structures.

Pre-compiling codelets

There is a command line tool to pre-compile codelets. This reduces loading time, and allows you to check for errors before running the host program.

The codelet compiler, popc, takes your source code as input and creates a graph program object file (conventionally, with a .gp file extension). For example:

 $ popc codelets.cpp -o codelets.gp

This object file can be added to your graph in the same way as source codelets, using the same Graph::addCodelets function. See the adder_popc example provided with the Poplar distribution.

The general form of the popc command is:

 $ popc [options] <input file> -o <output file>

The command takes several command line options. Most are similar to any other C compiler. For example:

-D<macro> Add a macro definition
-I<path> Add a directory to the include search path
-g Enable debugging
-On Set the optimization level (n = 0 to 3)

For a full list of options, use the --help option.

Using the Poplar library

The Poplar library provides classes and functions to implement graph programs. These can be accessed by including the appropriate header files. For example:

 #include <poplar/Graph.hpp>
#include <poplar/Program.hpp>
#include <poplar/Engine.hpp>

using namespace poplar;

You do not need any special command line tools to compile Poplar programs. You just use the standard host C++ compiler and link with the Poplar library, as shown below:

 $ g++ -std=c++11 my-program.cpp -lpoplar

The header files are in the include/poplar directory of the Poplar installation. The library file is the lib directory.

The main classes defined in the Poplar library are summarised below.

  • Graph: The class used to create a graph structure defining the connections between tensors and operations.
  • Device: Represents a physical device or simulation that will execute a graph.
  • Tensor: A class for representing and operating on tensors.
  • Type: Represents data types on the target device (to distinguish them from types on the host). These include:
    • INT: 32-bit integer
    • SHORT: 16-bit integer
    • CHAR: 8-bit integer (signed by default)
    • FLOAT: IEEE 32-bit floating point
    • HALF: IEEE 16-bit floating point
    • BOOL: Boolean value (stored as one byte)
  • Program: The base class for creating control programs that define how vertices will be executed. Complex control programs can be built up by combining sub-programs in various ways. The sub-classes for creating and combining programs include:
    • Execute: The basic class for creating a program from a compute set
    • Sequence: Executes a sequence of sub-programs sequentially
    • Repeat: Execute a sub-program a fixed number of times
    • If: Conditionally execute a sub-program
  • Engine: From a graph and one or more control programs, creates an object that can be used to execute the graph on a device.

For full details of all the classes and functions in the Poplar library, see the Poplar and Poplibs API Reference.

The Poplibs libraries

The Poplibs libraries provide application-level functions that can be used in programs for the IPU. The available libraries are listed in the table below.

Library Description
poputil General utility functions for building graphs
popops Functions for operations on tensors in control programs (elementwise functions and reductions)
poplin Linear algebra functions (matrix multiplications, convolutions)
poprand Functions for populating tensors with random numbers
popnn Functions used in neural networks (for example, non-linearities, pooling and loss functions)
popsolver Model solving functions

Examples of using the library functions can be found in the Tutorials.

For details of all the functions in the Poplibs libraries, see the Poplar and Poplibs User Guide.

Using Poplibs

The Poplibs libraries are in the lib directory of the Poplar installation. Each library has its own include directory and library object file. For example, the include files for the popops library are in the include/popops directory:

 #include <include/popops/ElementWise.hpp>

You will need to link the relevant Poplibs libraries with your program, in addition to the Poplar library. For example:

 $ g++ -std=c++11 my-program.cpp -lpoplar -lpopops

Some libraries are dependent on other libraries, which you will also need to link with your program. See the Poplar and Poplibs API Reference for details.


These tutorials provide hands-on programming exercises to enable you to familiarise yourself with creating and running programs using Poplar and Poplibs. They are intended to complement the rest of this user guide. It is assumed that you have already downloaded and installed Poplar, and that you are familiar with C++ and command-line compilation tools.

You can find the tutorials in the examples/tutorials directory of the Poplar installation. For most of the tutorials we’ve included two directories. One, called start_here, contains the bare structure of the tutorial as a starting point and the other, complete, contains the finished code for reference.

All the tutorials are in C++ and by default use a simulated IPU, so you should be able to create the code, compile and run them as you work through this text.

Tutorial 1: programs and variables

Copy the file tut1_variables/start_here/tut1.cpp to your working directory and open it in an editor. The file contains just the bare bones of a C++ program including some Poplar library headers and a namespace.

Graphs, variables and programs

All Poplar programs require a Graph object to construct the computation graph. Graphs are always created for a specific target (where the target is a description of the hardware being targeted, such as an IPU). To obtain the target we need to choose a device.

All the tutorials here use a simulated target by default, so will run on any machine even if it has no Graphcore hardware attached. On systems with accelerator hardware, the header file poplar/DeviceManager.hpp contains API calls to enumerate and return Device objects for the attached hardware.

Simulated devices are created with the IPUModel class, which models the functionality of an IPU on the host. The createDevice function creates a new virtual device to work with. Once we have this device we can create a Graph object to target it.

  • Add the following code to the body of main:

     // Create the IPU Model device
    IPUModel ipuModel;
    Device device = ipuModel.createDevice();
    Target target = device.getTarget();
    // Create the Graph object
    Graph graph(target);

Any program running on an IPU needs data to work on. These are defined as variables in the graph.

  • Add the following code:

     // Add variables to the graph
    Tensor v1 = graph.addVariable(FLOAT, {4}, "v1");

This adds one vector variable with four elements of type float to the graph. The final string parameter, "v1", is used to identify the data in debugging/profiling tools.

  • Add three more variables:

    • v2: another vector of 4 floats.
    • v3: a two-dimensional 4x4 tensor of floats.
    • v4: a vector of 10 integers (of type INT).

Note that the return type of addVariable is Tensor. The Tensor type represents data on the device in multi-dimensional tensor form. This type is used to reference the whole variable but, as we will see later, it can also be used to reference partial slices of variables, or data constructed from multiple variables.

Variables must be allocated to tiles. One option is to allocate the whole variable to one tile.

  • Add the following code:

     // Allocate v1 to reside on tile 0
    graph.setTileMapping(v1, 0);

Most of the time, programs actually deal with data spread over multiple tiles.

  • Add the following code:

     // Spread v2 over tiles 0..3
    for (unsigned i = 0; i < 4; ++i)
      graph.setTileMapping(v2[i], i);

This calls setTileMapping on sub-tensors of the variable v2 to spread it over multiple tiles.

  • Add code to allocate v3 and v4 to other tiles.

Now that we have created some variables in the graph, we can create a control program to run on the device. Programs are represented as sub-classes of the Program class. In this example we will use the Sequence sub-class, which represents a number of steps executed sequentially.

  • Add this declaration:

     // Create a control program that is a sequence of steps
    program::Sequence prog;
    // Debug print the tensor to the host console
    prog.add(program::PrintTensor("v1-debug", v1));

Here, the sequence has one step that will perform a debug print (via the host) of the data on the device.

Now that we have a graph and a program, we can see what happens when it is deployed on the device. To do this we must first create an Engine object.

  • Add to the code:

     // Create the engine
    Engine engine(graph, prog);

This object represents the compiled graph and program, which are ready to run on the device.

  • Add code to run the control program:

     // Run the control program
    std::cout << "Running program\n";
    std::cout << "Program complete\n";
  • Now compile the host program (remembering to link in the Poplar library using the -lpoplar flag):

     $ g++ --std=c++11 tut1.cpp -lpoplar -o tut1
  • Then run the compiled program:

     $ ./tut1

When the program runs, the debug output prints out uninitialised values, because we allocated a variable in the graph which is never initialised or written to:

 v1-debug: {0,0,0,0}

Initialising variables

In addition to variables, the graph can contain constant values. This is one way to initialise data in the graph.

  • After the code adding variables to the graph, add the following:

     // Add a constant tensor to the graph
    Tensor c1 = graph.addConstant<float>(FLOAT, {4}, {1.0, 1.5, 2.0, 2.5});

This line adds a new constant tensor to the graph whose elements have the values shown.

  • Allocate the data in c1 to tile 0:

     // Allocate c1 to tile 0
    graph.setTileMapping(c1, 0);
  • Now add the following to the sequence program, just before the PrintTensor program:

     // Add a step to initialise v1 with the constant value in c1
    prog.add(program::Copy(c1, v1));

Here we have used a predefined control program called Copy, which copies data between tensors on the device. Copying the constant tensor c1 into the variable v1 will result in v1 containing the same data as c1.

Note that the synchronisation and exchange phases of IPU execution described in the IPU Programmer’s Guide are performed automatically by the Poplar library functions and do not need to be specified explicitly.

If you recompile and run the program you should see the debug print of v1 has initialised values:

 v1-debug: {1,1.5,2,2.5}

Copying also works between variables.

  • After the v1 debug print command, add the following:

     // Copy the data in v1 to v2
    prog.add(program::Copy(v1, v2));
    // Debug print v2
    prog.add(program::PrintTensor("v2-debug", v2));

Now running the program will print both v1 and v2 with the same values.

Getting data into and out of the device

Most initial data will not be constant, but will come from the host. There are a couple of ways of getting data in and out of the device from the host, the simplest of which is to create a read or write handle connected to a tensor. This allows the host to transfer data directly to and from that variable.

  • Add code (before the engine creation instruction) to create read and write handles for the v3 variables:

     // Create host read/write handles for v3
    graph.createHostWrite("v3-write", v3);
    graph.createHostRead("v3-read", v3);

These handles are used after the engine is created.

  • Add the following code after the engine creation instruction:

     // Copy host data via the write handle to v3 on the device
    std::vector<float> h3(4 * 4, 0);
    engine.writeTensor("v3-write", h3.data());

Here, h3 holds data on the host (initialised to zeros) and the writeTensor call performs a synchronous write over the PCIe bus (simulated in this case) to the tensor on the device. After this call, the values of v3 on the device will be set to zero.

  • After the call to engine.run(0), add the following:

     // Copy v3 back to the host via the read handle
    engine.readTensor("v3-read", h3.data());
    // Output the copied back values of v3
    std::cout << "\nh3 data:\n";
    for (unsigned i = 0; i < 4; ++i) {
      std::cout << "  ";
      for (unsigned j = 0; j < 4; ++j) {
        std::cout << h3[i * 4 + j] << " ";
      std::cout << "\n";

Here, we are copying device data back to the host and printing it out. When the program is re-compiled and re-run, this prints all zeros (because the program on the device doesn’t modify the v3 variable):

 h3 data:
  0 0 0 0
  0 0 0 0
  0 0 0 0
  0 0 0 0

Let’s see what happens when v3 is modified on the device. We will use Copy again, but also start to look at the flexible data referencing capabilities of the Tensor type.

  • Add the following code to create slices of v1 and v3 immediately

after the creation of the host read/write handles for v3:

 // Copy a slice of v1 into v3
Tensor v1slice = v1.slice(0, 3);
Tensor v3slice = v3.slice({1,1},{2,4});

These lines create a new Tensor object that references data in the graph. They do not create new state but reference parts of v1 and v3.

  • Now add this copy program:

     prog.add(program::Copy(v1slice, v3slice));

This step copies three elements from v1 into the middle of v3. Re-compile and re-run the program to see the results:

 h3 data:
  0 0 0 0
  0 1 1.5 2
  0 0 0 0
  0 0 0

Data streams

The most efficient way to get data in and out of the device is to use data streams (see Data streams and remote buffers).

During machine learning training, for example, data streams are the best mechanism to use to get example data into the device. Data streams need to be created and explicitly named in the graph.

  • Add the following code to the program definition:

     // Add a data stream to fill v4
    DataStream inStream = graph.addHostToDeviceFIFO("v4-input-stream", INT, 10);
    // Add program steps to copy from the stream
    prog.add(program::Copy(inStream, v4));
    prog.add(program::PrintTensor("v4-0", v4));
    prog.add(program::Copy(inStream, v4));
    prog.add(program::PrintTensor("v4-1", v4));

These instructions copy from the input stream to the variable v4 twice. After each copy, v4 holds new data from the host.

After the engine is created, the data streams need to be connected to data on the host. This is achieved with the Engine::connectStream function.

  • Add the following code after the creation of the engine:

     // Create a buffer to hold data to be fed via the data stream
    std::vector<int> inData(10 * 3);
    for (unsigned i = 0; i < 10 * 3; ++i)
      inData[i] = i;
    // Connect the data stream
    engine.connectStream("v4-input-stream", &inData[0], &inData[10 * 3]);

Here, we’ve connected the stream to a data buffer on the host, using it as a circular buffer of data. Recompile and run the program again, and you can see that after each copy from the stream, v4 holds new data copied from the host memory buffer:

 v4-0: {0,1,2,3,4,5,6,7,8,9}
v4-1: {10,11,12,13,14,15,16,17,18,19}

(Optional) Using the IPU

This section describes how to modify the program to use the IPU hardware.

  • Copy tut1.cpp to tut1_ipu_hardware.cpp and open it in an editor.

  • Remove the import declaration:

     #include <poplar/IPUModel.hpp>
  • Add this import declaration:

     #include <poplar/DeviceManager.hpp>
  • Replace the following lines from the start of main:

     // Create the IPU Model device
    IPUModel ipuModel;
    Device device = ipuModel.createDevice();

    with this code:

     // Create the DeviceManager which is used to discover devices
    DeviceManager manager = DeviceManager::createDeviceManager();
    // Attempt to attach to a single IPU:
    Device device;
    bool success = false;
    // Loop over all single IPU devices on the host
    // Break the loop when an IPU is successfully acquired
    for (auto &hwDevice : manager.getDevices(poplar::TargetType::IPU, 1)) {
      device = std::move(hwDevice);
      std::cerr << "Trying to attach to IPU " << device.getId() << std::endl;
      if ((success = device.attach())) {
        std::cerr << "Attached to IPU " << device.getId() << std::endl;
    if (!success) {
      std::cerr << "Error attaching to device" << std::endl;
      return -1;

This gets a list of all devices consisting of a single IPU that are attached to the host and tries to attach to each one in turn until successful. This is a useful approach if there are multiple users on the host. It is also possible to get a specific device by its device manager ID using the getDevice function.

  • Compile the program.

     $ g++ --std=c++11 tut1_ipu_hardware.cpp -lpoplar -o tut1_ipu_hardware

Before running it you need to make sure that you have set the environment variables for the Graphcore drivers (see the Getting Started Guide for your IPU system).

  • Run the program to see the same results.

     $ ./tut1_ipu_hardware

You can make similar modifications to the programs for the other tutorials in order to use the IPU hardware.

Tutorial 2: using Poplibs

Make a copy of the file tut2_operations/start_here/tut2.cpp from the Poplar tutorials, and open it in an editor. This file contains a basic Poplar program structure similar to that seen in tutorial 1. It creates a graph with a couple of variables and initialises them. However, this time it includes some extra headers from the popops library:

 #include <popops/codelets.hpp>
#include <popops/ElementWise.hpp>

This gives us access to library functions for data manipulation, which have been highly optimised for IPU devices.

  • To use this, you need to add the device-side library code to the graph, so that it is loaded when the code is run:


A similar addCodelets call is required for each of the Poplibs libraries you use in your program.

  • Compile and run the code (remember to link in the popops library):

     $ g++ --std=c++11 tut2.cpp -lpoplar -lpopops -o tut2
    $ ./tut2

The code doesn’t do anything at the moment so let’s add an operation to the graph.

  • Add the following, before the engine creation, to extend the program sequence with an add operation:

     // Extend program with elementwise add (this will add to the sequence)
    Tensor v3 = popops::add(graph, v1, v2, prog, "Add");
    prog.add(PrintTensor("v3", v3));

The popops::add function extends the sequence prog with extra steps to perform an elementwise add. We’ve also created a new variable, v3, in the graph for the returned result. So, after the add operation, v3 holds the result of adding the elements of v1 to v2.

  • Re-compile and re-run the program. You should see the results of the addition:

     v3: {
  • Add code to add v2 to the result tensor v3 and print the result.

That is all that is required to use the Poplibs library functions. You can see the capability of these libraries by browsing the Poplibs API documentation or the header files in the include directories of the Poplar installation.

Reshaping and transposing data

When calling libraries to perform operations, there are many functions to arrange how data is passed to the operation. These can be found in the Tensor.hpp header. In tutorial 1 we used slicing, but there are also various functions for reshaping and transposing data.

  • Add the following code:

     // Example element wise addition using a transposed view of the data
    Tensor v5 = popops::add(graph, v1, v2.transpose(), prog, "Add");
    prog.add(PrintTensor("v5", v5));

Here the add function adds v1 to the transpose of the 2x2 matrix v2.

  • Re-compile and re-run the program to see the result.

Tutorial 3: writing vertex code

In this tutorial we will look at how compute steps are built up from parallel pieces of work (the vertices of the compute graph) called compute sets. The process for constructing compute sets described here is the same method that the Poplibs libraries use.

Make a copy of the file tut3_vertices/start_here/tut3.cpp and open it in an editor. This file has a skeleton program like tutorial 2, but does not use the Poplibs libraries. Instead, we will write the device code for the vertices in C++.

The program initially adds two 4-element vectors to the graph (v1 and v2). The step we are going to add will set each element of v2 to the suffix sum of v1. So v2[0] will contain the sum of all the elements of v1, v2[1] will contain the sum of the suffix of v1, starting at element 1, and so on.

Creating a codelet

To implement this operation, we have to write some code to run on the device, known as a codelet. A file is provided for this in the tutorial directory, called tut3_codelets.cpp. Make a copy of this file in your local directory.

  • Add the following code to tut3.cpp after the graph object is created:

     // Add codelets to the graph

This instructs the host program to load the device code into the graph and compile it to run on the device.

Inside tut3_codelets.cpp is the skeleton of a codelet. Like all Poplar codelets, it is a C++ class derived from the poplar::Vertex class, with a single member function called compute. This function defines the work done by the vertex. The compute function returns true to indicate successful completion.

We’ll code this vertex to take in a set of numbers and write the sum of those numbers out.

  • Alter the class in the codelets file, adding the following fields to the vertex definition:

     class SumVertex : public poplar::Vertex {
      // Fields
      poplar::Input<poplar::Vector<float>> in;
      poplar::Output<float> out;

The fields named in and out represent the vertex’s connections to external tensors. They are used in the body of the compute function to read and write the tensor data being operated on.

  • Fill in the body of the compute function to calculate the output as the sum of the inputs:

     // Compute function
    bool compute() {
      *out = 0;
      for (const auto &v : in) {
        *out += v;
      return true;

Note that the out field can be updated even if the destination tensor is on another tile. This is because the vertex operates on a local copy of the data. The final result is transferred to the destination tile in the exchange phase after the compute is complete.

Creating a compute set

Now that we have some device code, we can build a step to execute it and add this to our control program. To do this, you need to:

  1. Create a compute set, which defines the set of vertices that are executed in parallel at each step
  2. Add vertices to the compute set to execute the task
  3. Connect data to the vertices (in other words, define the edges of the graph)
  4. Set the tile mapping of the vertices

These are described in more detail below.

  • Create a compute set: add the following declaration to the control program in tut3.cpp, after the code to initialise v1 (the string argument is a debug identifier):

     ComputeSet computeSet = graph.addComputeSet("computeSet");
  • Add vertices to the compute set: add four vertices to the compute set. Add the following loop to the code, after the compute set definition. This passes the name of the class defined in the codelet, which will create an instance of that class for each vertex. Each vertex will output to a different element of v2.

     for (unsigned i = 0; i < 4; ++i) {
      VertexRef vtx = graph.addVertex(computeSet, "SumVertex");

    Note that the "SumVertex" argument specifies the type of vertex to use, in this case it’s the one we defined in the tut3_codelets.cpp file that was loaded into the graph.

  • Define the connections: add the following code to the body of the loop you just created to connect the input and output variables to the vertices. By using tensor operators and the loop index, each vertex is connected to different tensor elements.

     graph.connect(vtx["in"], v1.slice(i, 4));
    graph.connect(vtx["out"], v2[i]);
  • Set the tile mapping: Add the following code to the body of the same loop:

     graph.setTileMapping(vtx, i);

    Here, each vertex is mapped to a different tile.

Executing the compute set

If you are using the IPU Model simulation and want to profile the performance, you can set a cycle estimate for the vertex, if known. This is the number of cycles it takes to execute the codelet on the IPU. Here we set the cycle estimate to be 20 cycles.

 graph.setCycleEstimate(vtx, 20);

After creating the compute set, the final task is to add a step to the control program to execute the compute set:

  • Add the following code (anywhere after the prog sequence has been defined, but before v2 is printed):

     // Add step to execute the compute set
  • Now you can compile and run the program. You do not need to compile the codelet. Your program will load and compile the vertex at run time.

You should now see that the v2 tensor has been updated to the expected values:

 v2: {7,6,4.5,2.5}

You can also compile the vertex code from the command line, with the popc command:

 $ popc tut3_codelets.cpp -o tut3_codelets.gp

You can then use the compiled code by loading it, instead of the source, in your program:

 // Add codelets to the graph

Tutorial 4: profiling output

Make a copy of the file tut4_profiling/start_here/tut4.cpp from the Poplar installation and open it in an editor.

  • Use the MatMul function from the poplin library to extend this example to calculate ((m1 * m2) * m3). The MatMul function is documented in the Poplar and Poplibs API Reference.

  • Compile and run the program.

    When the program runs it prints profiling data. You should redirect this to a file to make it easier to study.

    Take some time to review and understand the execution profile. Refer to the Profiling section for an explanation of the profiling data. For example:

    • Determine what percentage of the memory of the IPU is being used
    • Determine how long the computation took
    • Determine which steps belong to which matrix-multiply operation
    • Identify how much time is taken by communication during the exchange phases

Tutorial 5: a basic machine learning example

This tutorial contains a complete training program that performs a logistic regression on the MNIST data set, using gradient descent. The files for the demo are in tut5_ml. There are no coding steps in the tutorial. The task is to understand the code, build it and run it. The program accepts an optional command line argument to make it use the IPU hardware instead of a simulated IPU. As you would expect, training is significantly faster on the IPU hardware.

Before you can run the code you will need to run the get_mnist.sh script to download the MNIST data.

Tutorial 6: matrix-vector multiplication

This tutorial builds up a more complex calculation on vertices: multiplying a matrix by a vector. Make a copy of the files in tut6_matrix_vector/start_here in your local directory.

The file matrix-mul-codelets.cpp contains the outline for the vertex code that will perform a dot product. Its input and output fields are already defined:

 class DotProductVertex : public Vertex {
  Input<Vector<float>> a;
  Input<Vector<float>> b;
  Output<float> out;
  • Complete the compute function of DotProductVertex.

The host code follows a similar pattern to the host code in the previous tutorials. There are three tensors defined for the input matrix, input vector and output vector:

 Tensor matrix = graph.addVariable(FLOAT, {numRows, numCols}, "matrix");
Tensor inputVector = graph.addVariable(FLOAT, {numCols}, "inputVector");
Tensor outputVector = graph.addVariable(FLOAT, {numRows}, "outputVector");

The function buildMultiplyProgram creates the graph and control program for performing the multiplication. The control program executes a single compute set called mulCS. This compute set consists of a vertex for each output element of the output vector (in other words, one vertex for each row of the input matrix).

The next task in this tutorial is to write the host code to add the vertices to the compute set.

  • Create a loop that performs numRows iterations, each of which will add a vertex to the graph.

    • Use the addVertex function of the graph object to add a vertex of type DotProductVertex to the mulCS compute set.
    • Use the final argument of addVertex to connect the fields of the vertex to the relevant tensor slices for that row. Each vertex takes one row of the matrix (you can use the index operator on the matrix tensor), and the entire in tensor, and outputs to a single element of the out tensor.

After adding this code, you can build and run the example. As you can see from the host program code, you’ll need to provide two arguments to the execution command that specify the size of the matrix. For example, running the program as shown below will multiply a 40x50 matrix by a vector of size 50:

 $ ./matrix-vector 40 50

The host code includes a check that the result is correct.

Tutorial 7: matrix-vector multiplication optimisation

For a massively parallel machine such as the IPU, the strategy in the last tutorial is not the most efficient. In particular:

  • Allocating one vertex to each row may not create enough vertices to occupy all the workers on the machine.
  • The input vector needs to be broadcast to every tile, which results in a large communication cost.

A more efficient strategy is to split each row into several segments and have the vertices calculate the dot product of that row segment with the corresponding segment of the input vector. After these partial sums have been calculated, a reduction is needed to add all the partial sums together for each output element to get the final output value.

This tutorial uses a simple algorithm to estimate the best way of splitting the data across the tiles in order to get the best performance. The Poplibs matrix-multiply functions use a similar, but more sophisticated, method that also considers the best instructions to use and different ways of reshaping the tensor data.

Make a copy of the files in tut7_matrix_vector_opt from the Poplar installation. In this tutorial, there is no code for you to complete; the aim is to understand the code and experiment with different matrix sizes.

The device code in matrix-mul-codelets.cpp includes an extra vertex class, called ReduceVertex, which sums a set of values in a vector.

The host file follows the same structure as the previous tutorial. The difference in this example is in the buildMultiplyProgram function. The first thing this does is work out how many segments to split the matrix rows into:

 // Get the optimal column axis split to split the number of columns
// into partial sums
unsigned colAxisSplit = calcOptimalColAxisSplit(graph, numRows, numCols);

Looking at the calcOptimalColAxisSplit function, you can see that it just iterates through all possible splits and calls the estimateCycles function for that split. The estimateCycles function itself tries to estimate how many cycles the calculation will take to perform. This is done by looking at the worst-case running time and exchange time of the tiles involved in both the partial-sum calculation phase and the reduction phase.

Once the split is determined, the code creates a new tensor to hold the intermediate partial-sum calculations:

 // Create a tensor to hold the intermediate calculated partial sums
auto partials = graph.addTensor("float", {numRows, colAxisSplit}, "partials");

The calculation is split into two phases. The first phase calculates the dot product of all the row segments and writes to the partials tensor. The second phase reads the partials tensor, adds up the partial sums and writes the output to the final out tensor.

These two phases are built with two loops. The first populates the mulCS compute set:

 // Create a compute set to hold the vertices to perform the
// partial sum calculations.
ComputeSet mulCS = graph.addComputeSet("mulCS");

// Create a vertex for each segment, for each row.
for (unsigned i = 0; i < colAxisSplit; ++i) {
    auto v = graph.addVertex(mulCS, "DotProductVertex",

The second loop builds up the reduceCS compute set:

 // Create a compute set to calculate the reduction.
auto reduceCS = graph.createComputeSet("reduceCS");

// For each output element create a vertex.
for (unsigned row = 0; row < numRows; ++row) {
auto v = graph.addVertex(reduceCS, "ReduceVertex",

The final program, which performs the entire multiplication, consists of executing the two compute sets in order:

 return Sequence(Execute(mulCS), Execute(reduceCS));

This example has a Makefile so you can build it by running make. After that, try running the program on some input data:

 $ ./matrix-vector 10000 1000
Multiplying matrix of size 10000x1000 by vector of size 1000
Creating environment (compiling vertex programs)
Constructing compute graph and control program
Best split chosen:
colsAxisSplit=5, total cost=4751 (compute cost=4410, exchange cost=200,
                                reduce exchange cost=45,
                                reduce compute cost=96)
Worst cost seen: 64373
Running graph program to multiply matrix by vector
Multiplication result OK
Program cycles: 5071

Here you can see that the program splits each row into five segments with an estimated cycle cost of 4,751 cycles. The IPU Model report shows that the program simulation ran with an actual runtime of 5,071 cycles.

  • Try running the program for other matrix sizes.



Profiling is a rapidly changing part of Poplar, so this information may be slightly out of date.

This section describes the profiling information that can be generated by Poplar.

Poplar is able to instrument the code to produce detailed compile-time information about the graph program and run-time information about the execution, including how memory is used, and where memory and processor cycles are consumed.

This information can be output to JSON (JavaScript Object Notation) format file. You can generate a summary version of the information, as described below. There is also a graphical viewer, called gc-profile, available for download from the Graphcore customer support portal.

The profiling information available depends on the target:

Target Memory Profiling Execution Profiling
CPU None None
IPU Exact Hardware measurement with limitations
IPU Model Exact (optional) Detailed but based on estimates

The IPUModel::compileIPUCode option, described below, can be used to generate exact memory profiling information for an IPU Model.

Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.

Profile summary output

A summary of the profiling information can be generated in a more readable form. For example, a histogram of memory usage per tile is displayed, as well as the raw numbers.

The format of the summary output is described in Profiling summary format.

A command line program is provided to summarise the profiling output. Alternatively, the Poplar program can write the summary information directly. These two options are described below.

Command line conversion

A command is including with the Poplar SDK to convert the generated JSON files into a readable summary. This has options --graph-profile and --execution-profile to specify the graph and execution profile files to use. For example:

 $ poplar_profile_summary --graph-profile graph_profile.json

The following command line options can be used to control the summary output:

  • --show-execution-steps: Show execution steps
  • --show-optimizations: Show optimisation information
  • --show-per-ipu-memory-usage: Report the memory usage for each IPU
  • --show-var-storage: Show liveness information for each program

From a Poplar program

To print a summary of both the graph and execution profiling information you can call the function:

 poplar::printProfileSummary(std::cout, graphProfile,
                            executionProfile, options);

The parameter executionProfile is optional. If empty, execution profile data will not be printed.

You can also print the graph and execution summaries separately by calling printGraphSummary or printExecutionSummary (see the Poplar and Poplibs API Reference for more information).

The following options are available:

  • colours: Control the use of colours in the summary output
  • showVarStorage: Show liveness information for each program
  • showOptimizations: Show compile optimisation details
  • showExecutionSteps: Show the execution steps
  • showPerIpuMemoryUsage: Show memory usage for each IPU

Colours can be used to highlight different sections of the output, to make it easier to understand. By default, colour is enabled when output is to a supported terminal. By setting the colours option to "true", colour output will be generated even if outputting to a file or piping the output to another program.

Generating profiling information

After you have loaded your Graph into an Engine, you can get static profile information about the graph and the resources required. This includes cycle estimates (for an IPU Model) and memory information.

 ProfileValue graphProfile = engine.getGraphProfile();

After you have run the program one or more times you can get dynamic profiling information (which programs were run, cycle counts, and so on).

 ProfileValue executionProfile = engine.getExecutionProfile();

ProfileValue contains JSON-compatible data. You can serialise it to JSON or CBOR using the global functions:

 poplar::serializeToJSON(std::cout, graphProfile, true);
poplar::serializeToCBOR(std::cout, graphProfile);

The last parameter of serializeToJSON() controls whether or not to pretty print the data.

For example, the output for the graph profile contains the following:


Profiling options

There are some options for controlling profiling on IPU targets (hardware or simulator).

By default, profiling is disabled. The instrumentation of compute cycles and external exchange cycles can be enabled with the following options:

  • debug.instrument Set to “true” enable instrumentation.
  • debug.instrumentCompute Set to “true” or “false” to enable or disable instrumentation of compute cycles.
  • debug.instrumentExternalExchange Set to “true” or “false” to enable or disable instrumentation of cycles used for exchanges between IPUs, or between the IPU and the host.

Note that there is no option to instrument internal exchanges because all internal exchange is statically scheduled.

If the instrumentation of compute is enabled, then the compute cycles counted can be specified with the debug.computeInstrumentationLevel option. This can have the following values:

  • “tile” Store the cycle count for the last execution of each compute set on every tile (default).
  • “vertex” Store the cycle count for the last execution of each vertex on every tile.
  • “device” Store the cycle count for the last execution of each compute set on a single tile. This measures the execution time of the longest-running tile in the compute set. This saves memory compared to “tile” but loses all the per-tile cycle information.
  • “ipu”: Similar to “device”, but instead of storing the cycle counts on a single tile across all IPUs, it stores them on one tile per IPU which avoids the need for global syncs.

These can be specified when the Engine constructor is called. For example:

 Engine engine(graph, prog,

These options can also be defined in the environment variable POPLAR_ENGINE_OPTIONS. For example:

 export POPLAR_ENGINE_OPTIONS='{"debug.instrument": "true",
                               "debug.computeInstrumentationLevel": "vertex"}'

For IPU Model targets you can optionally tell Poplar to compile code for the IPU (in addition to the CPU code that is actually executed by the model). If this is not done, then the reported memory usage will not include memory used for code. If this is enabled then the memory profile should give the same results as an IPU target. This option is specified by setting the compileIPUCode members of the model, for example:

 // Create the IPU Model device
IPUModel ipuModel;
ipuModel.compileIPUCode = true;

Graph profile

The structure of the graph profile is organised in the following areas:

  • Target information
  • Optimisation information
  • Graph information
  • Vertex types
  • Compute sets
  • Exchanges
  • Program structure
  • Memory use

These are described in detail in the following sections.

Target information

The target contains some useful information about the target hardware.

  • type: The target type, which is one of CPU, IPU or IPU_MODEL.
  • bytesPerIPU: The number of bytes of memory on an IPU.
  • bytesPerTile: The number of bytes of memory on a tile.
  • clockFrequency: The tile clock frequency in Hertz.
  • numIPUs: The number of IPU chips in the system.
  • tilesPerIPU: The number of tiles on each IPU chip.
  • numTiles: The total number of tiles. This is the product of numIPUs and tilesPerIPU. It is stored redundantly for convenience.
  • totalMemory: The total memory. This is the product of bytesPerTile and numTiles (or bytesPerIPU and numIPUs). It is stored redundantly for convenience.
  • relativeSyncDelayByTile: The sync delay for each tile (relative to the minimum value).
  • minSyncDelay: The minimum sync delay for any tile.

The sync delay for a tile is the number of cycles that it takes for the tile to send a sync request to the sync controller and receive a sync release signal back from the sync controller. It is smaller for tiles closer to the sync controller. This can be used for calculating how long a sync takes. The values are given for each tile on one IPU, In other words, there are tilesPerIPU values, not numTiles, because the sync delay values are the same on every IPU. The sync delay for each tile is given by minSyncDelay + relativeSyncDelayByTile[tile].

Optimisation information

optimizationInfo contains a map<string, double> of internal metrics related to compilation. The keys may change but this will always be a map from strings to doubles.

Graph information

graph includes some basic information about the graph, such as the number of compute sets.


Vertex types

vertexTypes lists the vertex types that are actually used in the graph. There may be many more vertex types but unused ones are ignored. In the rest of the profile data, references to vertex types are specified as an index into these arrays.

  • names lists the names of the vertex types. This includes built-in vertices like poplar_rt::LongMemcpy.
  • sizes contains the size of the vertex state (the class members) of each vertex type. For example Doubler might have 4 bytes of state.

Compute sets

computeSets contains cycle estimates, names and the number of vertices in each compute set. For the IPU_MODEL target it also includes a cycleEstimates field.

  • names: The name of each compute set. These are mainly for debugging purposes and are not necessarily unique. This includes compute sets generated during compilation.
  • vertexCounts and vertexTypes: The number of each type of vertex in the compute set. For each compute set there are vertexCounts[compute_set][i] vertices of type vertexTypes[compute_set][i]. The type is an index into the top-level "vertexTypes" array.
  • cycleEstimates: A cycle estimate is calculated for each vertex and then the vertices are scheduled in the same way that they would be run on real hardware. This results in three cycleEstimates:
    • activeCyclesByTile: This is the number of cycles during which a vertex was being run. Tiles have six hardware threads that are serviced in a round-robin fashion. If only one vertex is running then out of every six cycles only one cycle is “active”, and the other five cycles are idle. activeCyclesByTile counts the total number of active cycles in each compute set for each tile. It is indexed as [compute_set][tile].
    • activeCyclesByVertexType: The is the total number of active cycles in each compute set, by vertex type. It is indexed as [compute_set][vertex_type] where vertex_type is an index into "vertexTypes".
    • cyclesByTile: This is similar to activeCyclesByTile but it also counts idle cycles where a thread is not executing. This therefore gives the actual number of cycles that each tile takes running this compute set.


exchanges lists some basic information about internal exchanges.

  • bytesReceivedByTile is the number of bytes received by each tile in the exchange. It is indexed as [internal_exchange_id][tile].
  • bytesSentByTile is the number of bytes sent by each tile in the exchange. It is indexed as [internal_exchange_id][tile].
  • cyclesByTile is the number of cycles that each tile used for internal exchanges. It is indexed as [internal_exchange_id][tile]. This is known exactly for internal exchanges, which are statically scheduled.

externalExchanges lists the same information for IPU-to-IPU exchanges.

  • bytesReceivedByTile is the number of bytes received by each tile in the exchange. It is indexed as [external_exchange_id][tile].
  • bytesSentByTile is the number of bytes sent by each tile in the exchange. It is indexed as [external_exchange_id][tile].
  • estimatedCyclesByTile is the estimated number of cycles that each tile used for exchanges with other IPUs. It is indexed as [external_exchange_id][tile].

hostExchanges lists the same information for exchanges between the host and IPU.

  • bytesReceivedByTile is the number of bytes received by each tile in the exchange. It is indexed as [host_exchange_id][tile].
  • bytesSentByTile is the number of bytes sent by each tile in the exchange. It is indexed as [host_exchange_id][tile].
  • estimatedCyclesByTile is the estimated number of cycles that each tile used for exchanges to or from the host. It is indexed as [host_exchange_id][tile].

Program structure

The graph profile includes a serialisation of the program structure. This can include some programs generated during compilation, such as exchange and sync operations, in addition to the programs explicitly specified in the source code.

programs is a flattened array of all the programs given to the engine. This includes control programs (programs the user has provided) and functions (internally generated programs to reduce code duplication).

The arrays controlPrograms and functionPrograms contain the indexes of control and function programs in the programs array. Normally user programs are wrapped in a single control program so controlPrograms will nearly always contain only [0].

"functionPrograms":[31, 45],

Each entry in the programs array is a tagged union. The tag is type and has to be one of the following values, to indicate the type of the program. The following table summarises the tags generated by each program class.

Program class Program type tags


This may be preceded or followed by DoExchange or GlobalExchange
if exchanges are needed before/after execution.
Repeat Repeat



  1. SetLocalConsensusFromVar
  2. Sync
  3. If or IfElse
Switch Switch
Sequence Sequence

DoExchange, GlobalExchange or StreamCopy

Corresponding to internal exchange, inter-IPU exchange
and host exchange respectively.
This may be preceded or followed by OnTileExecute or DoExchange
if data rearrangement is needed before/after the copy.
WriteUndef WriteUndef
Sync Sync
Call Call
PrintTensor StreamCopy

The type determines which other fields are present. The most useful are described below.

Programs that have sub-programs encode this with the children field (even those with a fixed number of children like If). The sub-programs are specified as indexes into the programs array.


The exchange programs (DoExchange, GlobalExchange and StreamCopy) reference the exchange ID, which is an index into exchanges, externalExchanges or hostExchanges respectively.


DoExchange also includes a breakdown of the number of type of exchange instruction.


OnTileExecute contains the compute set ID, which is an index into the arrays in computeSets.


Programs can have a name field:


Call programs call a sub-graph as a function. They contain an index into the functionPrograms array that identifies the function called.


Memory use

The memory object contains a lot of information about memory use. All memory is statically allocated so you don’t need to run the program to gather this data.

The memory usage is reported for each tile, and also by category (what the memory is used for), by compute set and by vertex type. There is also a report of variable liveness*, including a tree of the liveness for all possible call stacks (this is a finite list because recursion is not allowed).

There are two memory regions on each tile, interleaved and non-interleaved, the use of each of these is reported separately. If the memory requirement is greater than the available memory, then this is reported as overflowed. The memory usage in each region is provided, both with and without gaps. Gaps arise because of memory allocation constraints, such as alignment requirements. For more information on the tile memory architecture, refer to the IPU Programmer’s Guide.

The memory used by some variables can be overlapped with others, because they are not live at the same time. Hence, the usage is split into overlapped and nonOverlapped components.

For top-level replicated graphs (those created by Graph(target, replication_factor)) the memory use will be reported for a single replica (the memory used by all replicas will be identical).

Memory per tile
 "byTile": {
  "interleaved": [ 536, 408 ],
  "interleavedIncludingGaps": [ 536, 408 ],
  "nonInterleaved": [ 19758, 3896 ],
  "nonInterleavedIncludingGaps": [ 65568, 19596 ],
  "overflowed": [ 0, 0 ],
  "overflowedIncludingGaps": [ 0, 0 ],
  "total": [ 20294, 3896 ],
  "totalIncludingGaps": [ 131608, 19596 ]
  • total is the sum of interleaved, nonInterleaved and overflowed. This is the total amount of memory used for data (not including padding) on each tile. However, due to memory constraints leading to padding, more memory may actually be required. Therefore this is usually not the number you want.
  • totalIncludingGaps is the actual amount of memory that is required on each tile. This is not simply the sum of the previous “including gaps” figures because adding those up does not take account of the gaps between the regions.

If any of these numbers is larger than the number of bytes per tile then the program will not fit on the hardware.

Memory by category

byCategory is a breakdown of memory usage across the whole system by the type of data, and the region it is in.

  "controlCode": {
    "interleaved": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    "nonInterleaved": {
      "nonOverlapped": [ 1216, 356 ],
      "overlapped": [ 0, 0 ]
    "overflowed": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    "total": [ 1216, 356 ]

The list of categories are:

  • constant: Constants added by the user. Variables added by the compiler that happen to be constant will be in Variable
  • controlCode: Code for Programs and running compute sets.
  • controlId: Program and sync IDs.
  • controlTable: A table that lists the vertices to run in each compute set. Only used if the table scheduler is enabled.
  • copyDescriptor: Copy descriptors are special variable-sized fields used by copy vertices.
  • globalExchangeCode: Code for performing exchange operations between IPUs.
  • globalExchangePacketHeader: Packet headers for inter-IPU exchanges.
  • globalMessage: Message data for inter-IPU exchanges.
  • hostExchangeCode: Code for performing exchange operations to and from the host
  • hostExchangePacketHeader: Packet headers for host exchanges.
  • hostMessage: Message data for host exchanges.
  • instrumentationResults: Variables to store profiling information.
  • internalExchangeCode: Code for performing internal exchange operations.
  • message: Message data for internal exchanges.
  • multiple: Space shared by variables from multiple different categories.
  • outputEdge: Storage for output edge data before an exchange takes place.
  • rearrangement: Variables holding rearranged versions of tensor data.
  • sharedCodeStorage: Code shared bey vertices.
  • sharedDataStorage: Data shared by vertices.
  • stack: Worker and supervisor stacks.
  • variable: Space allocated for variables in worker and supervisor code.
  • vectorListDescriptor: The data for VectorList<Input<...>, DeltaN> fields.
  • vertexCode: Code for vertex functions (codelets).
  • vertexFieldData: Variable-sized fields. For example, the data for Vector<float> or Vector<Input<...>> fields.
  • vertexInstanceState: Vertex class instances. This will be sizeof(VertexName) for each vertex.
Memory by compute set

byComputeSet is a breakdown of memory usage across the whole system. It includes several 2D arrays indexed by compute set, then tile.

 "byComputeSet": {
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...]
  • codeBytes is the amount of memory used for code by a compute set. Because that code may be shared by several compute sets, these numbers cannot be added in a meaningful way.
  • totalBytes is the sum of the above for convenience. Because it includes codeBytes it cannot be added in a meaningful way.
Memory by vertex type

byVertexType is a breakdown of memory usage across the whole system, like byComputeSet but for vertex types instead. The index into these arrays is also an index into the top level vertexTypes object.

  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...]

Execution profile

The execution profile contains information about the programs that have been run since the execution profile was last reset. Because the profiling data varies for different target types and profiling methods, the entire object is a tagged union.

Profiler mode

The profilerMode is the tag for this object. It can be one of the following:

  • NONE
  • CPU

It has the following fields, some of which are only present for certain modes.


  • computeSetCyclesByTile: A 2D array indexed by compute set id, then tile, that gives the total number of cycles taken to execute that compute set on that tile.


  • computeSetCycles: A 1D array indexed by compute set id that gives the total number of cycles taken to execute that compute set on all tiles. For this mode an internal sync is inserted before & after the compute set.


  • vertexCycles: A 1D array indexed by vertex ID that contains the number of cycles each vertex took the last time it was run.
  • vertexComputeSet: A 1D array indexed by vertex ID giving the compute set the vertex is in.
  • vertexType: A 1D array indexed by vertex ID giving an index into the list of vertex types.


  • externalExchangeCycles: A 2D array indexed by external exchange ID, and then tile, that gives the number of cycles used for each external (that is, from one IPU to another) exchange on each tile.


  • hostExchangeCycles: This is the same as externalExchangeCycles but for host<->IPU exchanges.

Additionally for all modes except NONE and CPU there profile contains program trace and simulation information.

Program trace information

  • programTrace is a 1D array of the programs IDs that were run. These are indexes into programs in the graph profile.

Simulation information

  • simulation has a list of execution steps based on the simulation of the programs that are listed in programTrace. This information is redundant. It is calculated entirely from the graph profile and the programTrace but it is included for convenience.

The fields of simulation are as follows.

  • cycles is the total number of cycles it took to execute all of the programs in programTrace.
  • tileCycles is the number of cycles spent doing each kind of activity. Unlike cycles this counts cycles from different tiles as distinct. That is, if two tiles both do a computation that takes 10 cycles in parallel, then cycles will be 10, but tileCycles.compute will be 20. activeCompute is a compute cycle where the active thread is computing, and cycles is a compute cycle where the active thread or any of the other threads is computing.
  • steps lists the compute, sync and exchange steps that are run. Each entry is a tagged union based on the type field which may be one of
    • OnTileExecute
    • StreamCopy
    • CopySharedStructure
    • Sync
    • DoExchange
    • GlobalExchange.

When running on actual hardware, the simulation uses computeSetCycles or computeSetCyclesByTile for the compute set cycles. If hardware cycles are not available (for example, under IPU_MODEL) then cycle estimates are used.

The other fields in each step depend on its type. Sync only contains the sync type: External or Internal


All other types contain the following fields:

  • type: The step type as described above.
  • program: The program ID for this step (an index into programs).
  • name: This field may be present if the program has a name. If the program has no name this field is omitted.
  • tileBalance: A fraction from 0-1 which indicates how balanced computation was between the tiles. It is calculated as the total number of compute cycles used / cycles * numTiles. If all tiles take the same number of cycles to finish this then this will be 1.0. If for example you have one tile that takes 10 cycles and one that takes 5 then this will be 0.75.
  • activeTiles: The number of tiles that are computing (or exchanging for exchanges).
  • activeTileBalance: The same as tileBalance but it ignores completely idle tiles.
  • cycles: The number of cycles taken by the longest running tile. Because OnTileExecute calls can overlap with each other and with exchanges this may be non-zero even if the execution doesn’t actually take any extra time.
  • cyclesFrom: The first cycle number where this program was executing on any tile.
  • cyclesTo: The last cycle number where this program was executing on any tile.

The exchange types (DoExchange, StreamCopy, GlobalExchange and SharedStructureCopy) also contain these fields:

  • totalData: The total amount of data transferred during the exchange.
  • dataBalance: Exactly like tileBalance but for the amount of data sent and received by each tile, instead of cycles.

OnTileExecute also contains these fields:

  • threadBalance: Similar in concept to tileBalance except it measures how well-utilised the hardware threads are. If you always run 6 threads or 0 threads this will be 1.0 even if the total computation on each tile takes a different amount of time.
  • computeSet: The ID of the compute set executed by this step.

DoExchange, GlobalExchange and StreamCopy contain a field that is an index into the corresponding exchange lists, called exchange, externalExchange or hostExchange respectively.

Finally, OnTileExecute, DoExchange and CopySharedStructure contain this field:

  • cyclesOverlapped: How many cycles were overlapped with previous steps.

Profiling summary format

There are two environment variables that can also be used to control colour output. These are:

  • CLICOLOR: If set to 0, then colour output will not be generated. This overrides the colours option value.
  • CLICOLOR_FORCE: If set to 1, then colour output will always be generated, even if output is not to the terminal. This overrides the colours option value and the CLICOLOR environment variable.

If there are replicated graphs then the memory usage will only be shown for one replica, as it will be the same for all of them.

If any tiles are out of memory, the memory used a few tiles with the largest usage will be shown.

The output with showVarStorage and showExecutionSteps will be similar to that shown below.

If any tiles are out of memory, the memory used by a few tiles with the largest usage will be shown.

  Number of IPUs:         1
  Tiles per IPU:          1,216
  Total Tiles:            1,216
  Memory Per-Tile:        256.0 kB
  Total Memory:           304.0 MB
  Clock Speed (approx):   1,600.0 MHz
  Number of Replicas:     1
  IPUs per Replica:       1
  Tiles per Replica:      1,216
  Memory per Replica:     304.0 MB

  Number of vertices:            5,564
  Number of edges:              18,564
  Number of variables:          47,973
  Number of compute sets:           43

Memory on all IPUs:
Memory Usage:
    Including Gaps:         67,725,024 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:     6,522,641 B
        Interleaved:           197,696 B
        Overflowed:                  0 B
      Total:                 6,720,337 B
      By Data Type:
        Not Overlapped
            Variables:                              254,280 B
            Internal Exchange Message Buffers:      109,788 B
            Data Rearrangement Buffers:                  96 B
            Host Exchange Packet Headers:            10,720 B
            Stack:                                3,852,288 B
            Vertex Instances:                        65,768 B
            Copy Descriptors:                        16,161 B
            VectorList Descriptors:                     528 B
            Vertex Field Data:                       24,052 B
            Control Code:                           467,696 B
            Vertex Code:                          1,097,868 B
            Internal Exchange Code:                 215,364 B
            Host Exchange Code:                     211,132 B
          Total:                                  6,325,741 B
            Variables:                              340,056 B
            Internal Exchange Message Buffers:      697,128 B
            Data Rearrangement Buffers:              15,424 B
          Total:                                  1,052,608 B
          Total After Overlapping:                  394,596 B
      Vertex Data (106,509B):
        By Category:
          Internal vertex state:         34,114 B
          Edge pointers:                 38,876 B
          Copy pointers:                 16,668 B
          Padding:                          346 B
          Descriptors:                   16,505 B
        By Type:
          poplin::ConvPartial1x1Out<float,float,true,false>                                185,792 B
          poplin::ReduceAdd<float,float>                                                   117,728 B
          poprand::SetSeedSupervisor                                                        34,048 B
          poplar_rt::Memcpy64BitSupervisor                                                  31,468 B
          poplar_rt::DstStridedCopyDA32                                                     30,242 B
          poplin::Transpose2d<float>                                                        23,784 B
          popops::ScaledAddSupervisor<float,float,true>                                     18,336 B
          popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID>         13,024 B
          popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID>             12,512 B
          poplar_rt::StridedCopyDA32                                                         9,043 B
          popops::EncodeOneHot<unsigned int,float>                                           3,120 B
          poplar_rt::DstStridedCopy64BitMultiAccess                                          2,677 B
          popops::Reduce<popops::ReduceAdd,float,float,false,2>                              1,632 B
          poplar_rt::ShortMemcpy                                                             1,324 B
          popops::ScaledAdd2D<float,true>                                                    1,164 B
          popops::ScaledReduce<popops::ReduceAdd,float,float,true,1>                           888 B
          popnn::LossSumSquaredTransform<float>                                                816 B
          poplar_rt::StridedCopy64BitMultiAccess                                               719 B
          poplar_rt::DstStridedMemsetZero64Bit                                                 416 B
          popnn::ReduceMaxClassGather<float,unsigned int>                                      384 B
          popops::Reduce<popops::ReduceAdd,float,float,false,1>                                368 B
          poplar_rt::MemsetZero                                                                172 B
          popops::Reduce<popops::ReduceAdd,float,float,false,3>                                152 B
          popnn::CalcAccuracy<unsigned int>                                                     72 B
          poplar_rt::MemsetZero64Bit                                                            68 B
          popops::ScaledReduce<popops::ReduceAdd,float,float,true,0>                            56 B

  By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
        3 -  4 ****************************************    824
        4 -  5                                               0
        5 -  6                                               0
        6 -  7                                               0
        7 -  8 ********                                    168
        8 -  9 ***                                          44
        9 - 10 ******                                      126
      10 - 11 **                                           30
      11 - 12 *                                            16
      12 - 13                                               0
      13 - 14                                               0
      14 - 15 *                                             2
      15 - 16 *                                             2
      16 - 17 *                                             3
      17 - 18                                               0
      18 - 19                                               0
      19 - 20 *                                             1

    Maximum (Including Gaps): 131,608 (128.5 K) on tile 0
    Maximum (Excluding Gaps): 20,294 (19.8 K) on tile 0
    0 tile(s) out of memory

  Variable Storage Liveness: All tiles

    Always-live bytes: 5,976,425
    Always-live variables:
      <anon>                                                                                               48
      <const>                                                                                             120
      Layer1/Fwd/Conv_1/worklists                                                                      14,112
      Layer3/Bwd/Conv_1/worklists                                                                          72
      Layer3/Fwd/Conv_1/worklists                                                                         144
      Layer3/Wu/ReduceFinalStage/IntermediateToOutput/numPartials                                          60
      Layer5/Bwd/LossSumSquared/offset                                                                    112
      Layer5/Bwd/LossSumSquared/reduce_loss/ReduceOnTile/InToIntermediateNoExchange/numPartials            32
      Layer5/Bwd/LossSumSquared/sliceLen                                                                  128
      ValuePadder/padding                                                                                  16
      controlCode                                                                                     467,696
      copyDescriptor                                                                                   16,161
      hostExchangeCode                                                                                211,132
      hostExchangePacketHeader                                                                         10,720
      internalExchangeCode                                                                            215,364
      numCorrect                                                                                            4
      stack                                                                                         3,852,288
      vectorListDescriptor                                                                                528
      vertexCode                                                                                    1,097,868
      vertexFieldData                                                                                  24,052
      vertexInstanceState                                                                              65,768

    Maximum live bytes (including always-live): 6,556,737

        Live Bytes (excluding always-live): 95,452
        Live Vars (excluding always-live):
          <anon>                        12
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
        Live Bytes (excluding always-live): 95,456
        Live Vars (excluding always-live):
          <anon>                        12
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          programId                      4
      DoExchange: switchControlBroadcast13/ExchangePre
        Live Bytes (excluding always-live): 100,312
        Live Vars (excluding always-live):
          <anon>                         8
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          broadcastProgramId         4,860
          programId                      4
        Live Bytes (excluding always-live): 100,312
        Live Vars (excluding always-live):
          <anon>                         8
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          broadcastProgramId         4,860
          programId                      4
          DoExchange: init/setMasterSeed/ExchangePre
            Live Bytes (excluding always-live): 105,168
            Live Vars (excluding always-live):
              <anon>                         8
              <message:anon>             9,720
              Layer1/Fwd/biases            120
              Layer1/Fwd/weights        94,080
              Layer3/Fwd/biases             40
              Layer3/Fwd/weights         1,200
          OnTileExecute: init/setMasterSeed
            Live Bytes (excluding always-live): 105,168
            Live Vars (excluding always-live):
              <anon>                         8
              <message:anon>             9,720
              Layer1/Fwd/biases            120
              Layer1/Fwd/weights        94,080
              Layer3/Fwd/biases             40
              Layer3/Fwd/weights         1,200


  Total cycles:                                  54,202 (approx 33.9 microseconds)
  Total compute cycles (including idle threads): 1,933,324
  Total compute cycles (excluding idle threads): 1,769,242
  Total IPU exchange cycles:                     482,221
  Total global exchange cycles:                  0
  Total host exchange cycles:                    4,194,423
  Total shared structure copy cycles:            0
  Total sync cycles:                             58,852,900
  Total tile balance:                            2.9%
  Total thread balance:                          91.5%

  Cycles by vertex type:
    poplin::ConvPartial1x1Out<float,float,true,false>                                (808 instances):      764,028
    poplar_rt::DstStridedCopyDA32                                                   (1393 instances):      509,173
    poprand::SetSeedSupervisor                                                      (1216 instances):      177,536
    poplin::ReduceAdd<float,float>                                                   (376 instances):      154,116
    popops::ScaledAddSupervisor<float,float,true>                                    (460 instances):      139,812
    poplar_rt::StridedCopyDA32                                                       (455 instances):       83,995
    poplar_rt::Memcpy64BitSupervisor                                                 (203 instances):       59,048
    popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID>             (68 instances):       15,780
    popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID>         (68 instances):       14,628
    poplar_rt::ShortMemcpy                                                            (73 instances):        5,256
    poplin::Transpose2d<float>                                                        (99 instances):        4,770
    popops::EncodeOneHot<unsigned int,float>                                          (20 instances):        4,408
    popops::Reduce<popops::ReduceAdd,float,float,false,2>                            (112 instances):        3,584
    poplar_rt::DstStridedCopy64BitMultiAccess                                         (41 instances):        2,437
    popops::ScaledReduce<popops::ReduceAdd,float,float,true,1>                        (32 instances):        1,600
    poplar_rt::StridedCopy64BitMultiAccess                                            (33 instances):        1,388
    popops::ScaledAdd2D<float,true>                                                   (39 instances):        1,371
    popnn::ReduceMaxClassGather<float,unsigned int>                                   (16 instances):        1,056
    popops::Reduce<popops::ReduceAdd,float,float,false,1>                             (16 instances):          912
    popnn::LossSumSquaredTransform<float>                                              (8 instances):          792
    poplar_rt::DstStridedMemsetZero64Bit                                              (18 instances):          740
    popops::Reduce<popops::ReduceAdd,float,float,false,3>                              (4 instances):          144
    popops::ScaledReduce<popops::ReduceAdd,float,float,true,0>                         (2 instances):          100
    popnn::CalcAccuracy<unsigned int>                                                  (1 instances):           78
    poplar_rt::MemsetZero64Bit                                                         (2 instances):           26
    poplar_rt::MemsetZero                                                              (1 instances):           15
  --- External Sync ---
  StreamCopy (cycles 118 - 118)
    Cycles:                             7
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         2
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- External Sync ---
  StreamCopy (cycles 472 - 472)
    Cycles:                             9
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         4
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- Internal Sync ---
  DoExchange (cycles 826 - 946): switchControlBroadcast13/ExchangePre
    Cycles:                             120 (0 overlapped with previous)
    Active Tiles:                       1,216 / 1,216
    Tile Balance:                       86.6%
    Active Tile Balance:                86.6%
    Total Data:                         2,432
    Data Balance (mean / max data per tile): 100.0%
  --- External Sync ---
  StreamCopy (cycles 1,064 - 1,064)
    Cycles:                             7
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         2
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- External Sync ---
  StreamCopy (cycles 1,418 - 12,098)
    Cycles:                             10,700
    Active Tiles:                       392 / 1,216
    Tile Balance:                       32.2%
    Active Tile Balance:                100.0%
    Total Data:                         25,122
    Data Balance (mean / max data per tile): 21.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- Internal Sync ---
  DoExchange (cycles 12,432 - 12,581): Layer1/Fwd/Conv_1/Convolve/ExchangePre
    Cycles:                             169 (20 overlapped with previous)
    Active Tiles:                       392 / 1,216
    Tile Balance:                       27.9%
    Active Tile Balance:                86.6%
    Total Data:                         125,440
    Data Balance (mean / max data per tile): 32.2%
  OnTileExecute (cycles 12,581 - 13,580): Layer1/Fwd/Conv_1/Convolve
    Cycles:                             999 (0 overlapped with previous)
    Active Tiles:                       392 / 1,216
    Thread Balance:                     100.0%
    Tile Balance:                       31.7%
    Active Tile Balance:                98.3%
    By vertex type:
      poplin::ConvPartial1x1Out<float,float,true,false>        (392 instances):      384,944

Environment Variables

There are several environment variables which you can use to control the behaviour of the Poplar SDK.


The behaviour of a Poplar program can be traced by enabling logging using the environment variable POPLAR_LOG_LEVEL. If you want to enable logging for the Poplibs libraries, then the POPLIBS_LOG_LEVEL variable can be used.

The supported logging levels are shown in the table below.

OFF No logging information.
ERR Only error conditions will be reported.
WARN Warnings when, for example, the software cannot achieve what was requested (for example, if the convolution planner can’t keep to the memory budget, or Poplar has determined that the model won’t fit in memory but the debug.allowOutOfMemory option is enabled).
INFO Very high level information, such as Poplibs function calls.
DEBUG Useful per-graph information.
TRACE The most verbose level. All useful per-tile information.

All Poplar log messages are prefixed with PO. Messages from Poplibs are prefixed with PL.

The logging information is sent to standard error by default. This can be changed by setting the POPLAR_LOG_DEST or POPLIBS_LOG_DEST environment variables. The value can be “stdout”, “stderr” or a file name.

Graph report

Deprecated in favour of debug.graphReportDestination engine option.

You can set the variable POPLAR_GRAPH_REPORT_DEST in order to write out a graph report when the graph has been compiled. This is equivalent to calling engine.printProfileSummary() without the execution summary.

The value of POPLAR_GRAPH_REPORT_DEST can be “stdout”, “stderr” or a file name.

Setting options

The following environment variables can be used to override the option values specified in the program:


For more information, see the Poplar and Poplibs API Reference.