PopART User Guide

Introduction

The Poplar Advanced Run Time (PopART) is part of the Poplar SDK for implementing and running algorithms on networks of Graphcore IPU processors. It enables you to import models using the Open Neural Network Exchange (ONNX) and run them using the Poplar tools. ONNX is a serialisation format for neural network systems that can be created and read by several frameworks including Caffe2, PyTorch and MXNet.

This document describes the features of PopART. It assumes that you are familiar with machine learning and the ONNX framework.

An overview of the IPU architecture and programming model can be found in the IPU Programmer’s Manual. For more information on the Poplar framework refer to the Poplar and Poplibs User Guide.

PopART has three main features:

  1. It can import ONNX graphs into a runtime environment (see Importing graphs).

  2. It provides a simple interface for constructing ONNX graphs without needing a third party framework (described in Building graphs in PopART).

  3. It runs imported graphs in inference, evaluation or training modes, by building a Poplar engine, connecting data feeds and scheduling the execution of the Engine (see Executing graphs).

IPU-specific annotations on ONNX operations allow the provider of the graph to control IPU-specific features, such as mapping an algorithm across multiple IPUs.

APIs are available for C++ and Python. Most of the examples in this document use the Python API.

Importing graphs

The PopART Session class creates the runtime environment for executing graphs on IPU hardware. It can read an ONNX graph from a serialised ONNX model protobuf (ModelProto), either directly from a file or from memory. A session object can be constructed either as an InferenceSession or a TrainingSession

Some metadata must be supplied to augment the data present in the ONNX graph in order to run it, as described below.

In the following example of importing a graph for inference, TorchVision is used to create a pre-trained AlexNet graph, with a 4 x 3 x 244 x 244 input. The graph has an ONNX output called out, and the DataFlow object contains an entry to fetch that anchor.

 import popart

import torch.onnx
import torchvision

input_ = torch.FloatTensor(torch.randn(4, 3, 224, 224))
model = torchvision.models.alexnet(pretrained=True)

output_name = "output"

torch.onnx.export(model, input_, "alexnet.onnx", output_names=[output_name])

# Create a runtime environment
anchors = {output_name: popart.AnchorReturnType("ALL")}
dataFeed = popart.DataFlow(100, anchors)
device = popart.DeviceManager().createCpuDevice()

session = popart.InferenceSession("alexnet.onnx", dataFeed, device)

The DataFlow object is described in more detail in Executing graphs.

Creating a session

The Session class takes the name of a protobuf file, or the protobuf itself. It also takes a DataFlow object which has information about how to execute the graph:

  • The number of times to conduct a forward pass (and a backward pass, if training) of the graph on the IPU before returning to the host for more data.

  • The names of the tensors in the graph used to return the results to the host.

In some ONNX graphs, the sizes of input tensors might not be specified. In this case, the inputShapeInfo parameter can be used to specify the input shapes. The Poplar framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation.

The patterns parameter allows the user to select a set of graph transformation patterns which will be applied to the graph. Without this parameter, a default set of optimisation transformations will be applied.

Other parameters to the Session object are used when you are training the network instead of performing inference. They describe the types of loss to apply to the network and the optimiser to use.

An example of creating a session object from an ONNX model is shown below.

 import popart

import torch.onnx
import torchvision

input_ = torch.FloatTensor(torch.randn(4, 3, 224, 224))
model = torchvision.models.alexnet(pretrained=False)

labels_name = "labels"
output_name = "output"

torch.onnx.export(model, input_, "alexnet.onnx", output_names=[output_name])

# Describe the labels input shape
inputShapeInfo = popart.InputShapeInfo()
inputShapeInfo.add(labels_name, popart.TensorInfo("INT32", [4]))

# Create a runtime environment
anchors = {output_name: popart.AnchorReturnType("ALL")}
dataFeed = popart.DataFlow(100, anchors)

losses = [popart.NllLoss(output_name, labels_name, "loss")]
optimizer = popart.ConstSGD(0.001)

# Run session on CPU
device = popart.DeviceManager().createCpuDevice()
session = popart.TrainingSession("alexnet.onnx",
                                 deviceInfo=device,
                                 dataFeed=dataFeed,
                                 losses=losses,
                                 optimizer=optimizer,
                                 inputShapeInfo=inputShapeInfo)

In this example, when the Session object is asked to train the graph, an NllLoss node will be added to the end of the graph, and a ConstSGD optimiser will be used to optimise the parameters in the network.

Session control options

The userOptions parameter passes options to the session. The available options are listed in the PopART C++ API Reference. As well as options to control specific features of the PopART session, there are also some that allow you to pass options to the underlying Poplar functions:

  • engineOptions passes options to the Poplar Engine object created to run the graph.

  • convolutionOptions passes options to the Poplibs convolution functions.

  • reportOptions Controls the instrumentation and generation of profiling information.

See Retrieving profiling reports for examples of using some of these options.

Full details of the Poplar options can be found in the Poplar and Poplibs API Reference.

Building graphs in PopART

PopART has a Builder class for constructing ONNX graphs without needing a third party framework.

In the example below, a simple addition is prepared for execution. The steps involved are described in the following sections and in Executing graphs.

 import popart

builder = popart.Builder()

# Build a simple graph
i1 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1, 2, 32, 32]))
i2 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1, 2, 32, 32]))

o = builder.aiOnnx.add([i1, i2])

builder.addOutputTensor(o)

# Get the ONNX protobuf from the builder to pass to the Session
proto = builder.getModelProto()

# Create a runtime environment
anchors = {o : popart.AnchorReturnType("ALL")}
dataFeed = popart.DataFlow(1, anchors)
device = popart.DeviceManager().createCpuDevice()

# Create the session from the graph, data feed and device information
session = popart.InferenceSession(proto, dataFeed, device)

The DataFlow object is described in more detail in Executing graphs.

Adding operations to the graph

The builder adds operations to the graph by calling one of the many operation methods. Each of these methods has a common signature. For example, relu will add an ONNX Relu operation to the graph:

 output = builder.aiOnnx.relu([input], "debug-name")

They take a list of arguments which are the input tensor names, and an optional string to assign to the node. This name is passed to the Poplar nodes and used in debugging and profiling reports.

The operation method returns the name of the tensor that is an output of the newly added node.

In some cases other arguments are required, for instance:

 output = builder.aiOnnx.gather(['input', 'indices'], axis=1, debugPrefix="My-Gather")

Adding parameters to the graph

Parameters, for instance the weights of a convolution, are represented as initialised inputs to the graph. They can be added with the addInitializedInputTensor method:

 w_data = np.random.rand(64, 4, 3, 3).astype(np.float16)
w1 = builder.addInitializedInputTensor(w_data)

Setting outputs

The outputs of the graph should be marked appropriately, using the addOutputTensor method:

 builder.addOutputTensor(output)

Setting the IPU number for operations

When creating a graph which will run on a multiple IPU system, nodes need to be marked with an annotation to describe which IPU they will run upon.

For instance, to place a specific convolution onto IPU 1:

 we = builder.addInitializedInputTensor(np.zeros([32, 4, 3, 3], np.float16))
bi = builder.addInitializedInputTensor(np.zeros([32], np.float16))
o = builder.aiOnnx.conv([x, we, bi],
                        dilations=[1, 1],
                        pads=[1, 1, 1, 1],
                        strides=[1, 1])
# place operation on IPU 1
builder.virtualGraph(o, 1)

A context manager is available for placing multiple operations together onto a specific IPU:

 builder = popart.Builder()

i1 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))
i2 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))
i3 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))
i4 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))

# place two add operations on IPU 0
with builder.virtualGraph(0):
    o1 = builder.aiOnnx.add([i1, i2])
    o2 = builder.aiOnnx.add([i3, i4])

# place one add operation on IPU 1
with builder.virtualGraph(1):
    o = builder.aiOnnx.add([o1, o2])

Alternatively, for automatic placement of nodes on available IPUs, use the session option autoVirtualGraph. See SessionOptions in the PopART C++ API Reference.

Executing graphs

The Session class is used to run graphs on an IPU device. Before the graph can be run, the way in which data will be transferred to and from the IPU must be specified. Then an IPU device can be selected to execute the graph.

Setting input/output data buffers for an execution

The PyStepIO class defines the input data for a specific execution. It takes a dictionary with the input tensor names as keys, and Python arrays for the data values. It also takes a similar dictionary of names and buffers for the output values.

A convenience method initAnchorArrays can create the output buffers and dictionary for you, given the anchors (output nodes) which were specified in the DataFlow object during session construction.

 # Create buffers to receive results from the execution
anchors = session.initAnchorArrays()

# Generate some random input data
data_a = np.random.rand(1).astype(np.float32)
data_b = np.random.rand(1).astype(np.float32)

stepio = popart.PyStepIO({'a': data_a, 'b': data_b}, anchors)

If there are any pre-defined inputs (weights, biases, etc.) in the graph then they will not be specified in the PyStepIO object. However, before executing the graph, they will need to the copied to the hardware. If there are any optimiser-specific parameters which can be modified, then these must be written to the device. For example:

 session.weightsFromHost()
session.optimizerFromHost()

These can also be updated between executions.

 # Update learning rate parameter between training steps
stepLr = learningRate[step]
session.updateOptimizer(popart.SGD(stepLr))
session.optimizerFromHost()

Retrieving results

The DataFlow class describes how to execute the graph. The second parameter is a description of the anchors, the results to fetch from the graph.

 df = popart.DataFlow(1, {o: popart.AnchorReturnType("ALL")})

This is a Python dictionary with keys that are the names of the tensors to retrieve from the model. The associated values are an AnchorReturnType, which is one of:

  • popart.AnchorReturnType("ALL"): a vector of results is returned, one for each iteration of the graph.

  • popart.AnchorReturnType("EVERYN", N): a vector containing the tensor, but only for iterations which are divisible by N.

  • popart.AnchorReturnType("FINAL"): the value of the tensor on the final iteration through the graph.

Selecting a device for execution

The device manager allows the selection of an IPU configuration for executing the session. The device must be passed into the session constructor.

 df = popart.DataFlow(1, {o: popart.AnchorReturnType("ALL")})
device = popart.DeviceManager().createCpuDevice()
s = popart.InferenceSession("onnx.pb", deviceInfo=device, dataFeed=df)

The device manager can enumerate the available devices with the enumerateDevices method. The acquireAvailableDevice method will acquire the next available device. The first parameter specifies how many IPUs to acquire.

 # Acquire a two-IPU pair
dev = popart.DeviceManager().acquireAvailableDevice(2)

Using acquireDeviceById will select a device from the list of IPU configurations, as given by the enumerateDevices method, or by the gc-info command-line tool. This may be a single IPU or a group of IPUs.

 # Acquire IPU configuration 5
dev = popart.DeviceManager().acquireDeviceById(5)

The method createIpuModelDevice is used to create a Poplar software emulation of an IPU device. Similarly, the method createCpuDevice creates a simple Poplar CPU backend. See PopART C++ API Reference for details.

Executing a session

Once the device has been selected, the graph can be compiled for it, and loaded into the hardware. The prepareDevice method is used for this:

 session.prepareDevice()

To execute the session you need to call the session’s run method.

 session.run(stepio)

If the session is created for inference, the user is responsible for ensuring that the forward graph finishes with the appropriate operation for an inference. If losses are provided to the inference session the forward pass and the losses will be executed, and the final loss value will be returned.

If the session was created for training, any pre-initialised parameters will be updated to reflect the changes made to them by the optimiser.

Saving and loading a model

The method modelToHost writes a model with updated weights to the specified file.

 session.modelToHost("trained_model.onnx")

A file of saved parameters, for example from an earlier execution session, can be loaded into the current session.

 session.resetHostWeights("test.onnx")
session.weightsFromHost()

Retrieving profiling reports

Poplar can provide profiling information on the compilation and execution of the graph. Profiling is not enabled by default.

To get profiling reports in PopART, you will need to enable profiling in the Poplar engine. For example:

 opts = popart.SessionOptions()
opts.engineOptions = {"debug.instrument": "true"}

You can also control what information is included in the profiling report:

 opts.reportOptions = {"showExecutionSteps": "true"}

There are three method functions of the session object to access the profiling information:

  • getSummaryReport retrieves a text summary of the compilation and execution of the graph.

  • getGraphReport returns a JSON format report on the compilation of the graph

  • getExecutionReport returns a JSON format report on all executions of the graph since the last report was fetched.

If profiling is not enabled, then the summary report will say ‘Execution profiling not enabled’ and the execution report will contain ‘{“profilerMode”:”NONE”}’.

Both getGraphReport and getExecutionReport can optionally return a Concise Binary Object Representation (CBOR) formatted report.

For more information on profiling control and the information returned by these functions, see the Profiling chapter of the Poplar and Poplibs User Guide.

Turning on execution tracing

PopART contains an internal logging system that can show the progress of graph compilation and execution.

Logging information is generated from the following modules:

session

The ONNX session (the PopART API)

ir

The intermediate representation

devicex

The poplar backend

transform

 

pattern

 

builder

 

op

 

opx

 

ces

 

python

 

The logging levels, in decreasing verbosity, are shown below.

TRACE

The highest level, shows the order of method calls

DEBUG

 

INFO

 

WARN

Warnings

ERR

Errors

CRITICAL

Only critical errors

OFF

No logging

The default is “OFF”. You can change this, and where the logging information is written to, by setting environment variables, see Environment variables.

Programming interface

You can also control the logging level for each module in your program.

For example, in Python:

 # Set all modules to DEBUG level
popart.getLogger().setLevel("DEBUG")
# Turn off logging for the session module
popart.getLogger("session").setLevel("OFF")

And in C++:

 // Set all modules to DEBUG level
popart::logger::setLevel("popart", "DEBUG")
// Turn off logging for the session module
popart::logger::setLevel("session", "OFF")

Output format

The information is output in the following format:

 [<timestamp>] [<module>] [<level>] <logging string>

For example:

 [2019-10-16 13:55:05.359] [popart:devicex] [debug] Creating poplar::Tensor 1
[2019-10-16 13:55:05.359] [popart:devicex] [debug] Creating host-to-device FIFO 1
[2019-10-16 13:55:05.359] [popart:devicex] [debug] Creating device-to-host FIFO 1

Examples

Examples are provided in the examples directory of the PopART installation.

This is a simple example of constructing a network using the builder, and performing a simple inference pass to perform an addition.

 import numpy as np
import popart

# Create a builder and construct a graph
builder = popart.Builder()

data_shape = popart.TensorInfo("FLOAT", [1])

a = builder.addInputTensor(data_shape)
b = builder.addInputTensor(data_shape)

o = builder.aiOnnx.add([a, b])

builder.addOutputTensor(o)

proto = builder.getModelProto()

# Describe how to run the model
dataFlow = popart.DataFlow(1, {o: popart.AnchorReturnType("ALL")})

# Create a session to compile and execute the graph
session = popart.InferenceSession(
    fnModel=proto,
    dataFeed=dataFlow,
    deviceInfo=popart.DeviceManager().createIpuModelDevice({}))

# Compile graph
session.prepareDevice()

# Create buffers to receive results from the execution
anchors = session.initAnchorArrays()

# Generate some random input data
data_a = np.random.rand(1).astype(np.float32)
data_b = np.random.rand(1).astype(np.float32)

stepio = popart.PyStepIO({a: data_a, b: data_b}, anchors)
session.run(stepio)

print("Input a is " + str(data_a))
print("Input b is " + str(data_b))
print("Result is " + str(anchors[o]))

Supported operators

PopART is compatible with ONNX version 1.3 (see ONNX Versioning). This section lists the supported operators.

The Graphcore (ai.graphcore) and ONNX (ai.onnx) operators, and versions supported, are listed below. See ONNX Operators for more information.

Domain: ai.graphcore

  • ConvFlipWeights-1

  • Gelu-1

  • GroupNormalization-1

  • LSTM-1

  • PrintTensor-1

  • Scale-1

  • Square-1

  • Subsample-1

Domain: ai.onnx

  • Abs-6

  • Add-6

  • Add-7

  • And-1

  • And-7

  • ArgMax-1

  • ArgMax-11

  • ArgMin-1

  • ArgMin-11

  • Asin-7

  • Atan-7

  • AveragePool-1

  • AveragePool-7

  • AveragePool-10

  • AveragePool-11

  • BatchNormalization-6

  • BatchNormalization-7

  • BatchNormalization-9

  • Cast-6

  • Cast-9

  • Ceil-1

  • Ceil-6

  • Clip-6

  • Clip-11

  • Concat-1

  • Concat-4

  • Concat-11

  • Conv-1

  • Conv-11

  • Cos-7

  • Cosh-9

  • Div-6

  • Div-7

  • Dropout-6

  • Dropout-7

  • Dropout-10

  • Equal-1

  • Equal-7

  • Equal-11

  • Exp-6

  • Flatten-1

  • Flatten-9

  • Flatten-11

  • Floor-1

  • Floor-6

  • Gather-1

  • Gather-11

  • Gemm-6

  • Gemm-7

  • Gemm-9

  • Gemm-11

  • GlobalAveragePool-1

  • GlobalMaxPool-1

  • Greater-1

  • Greater-7

  • Greater-9

  • Identity-1

  • If-1

  • If-11

  • InstanceNormalization-6

  • IsInf-10

  • IsNaN-9

  • LRN-1

  • LSTM-1

  • LSTM-7

  • Less-7

  • Less-9

  • Log-6

  • LogSoftmax-1

  • LogSoftmax-11

  • MatMul-1

  • MatMul-9

  • Max-6

  • Max-8

  • MaxPool-1

  • MaxPool-8

  • MaxPool-10

  • MaxPool-11

  • Mean-6

  • Mean-8

  • Min-6

  • Min-8

  • Mul-6

  • Mul-7

  • Neg-6

  • Not-1

  • OneHot-9

  • OneHot-11

  • Or-1

  • Or-7

  • Pad-2

  • Pad-11

  • Pow-1

  • Pow-7

  • Reciprocal-6

  • ReduceL1-1

  • ReduceL1-11

  • ReduceL2-1

  • ReduceL2-11

  • ReduceLogSum-1

  • ReduceLogSum-11

  • ReduceLogSumExp-1

  • ReduceLogSumExp-11

  • ReduceMax-1

  • ReduceMax-11

  • ReduceMean-1

  • ReduceMean-11

  • ReduceMin-1

  • ReduceMin-11

  • ReduceProd-1

  • ReduceProd-11

  • ReduceSum-1

  • ReduceSum-11

  • ReduceSumSquare-1

  • ReduceSumSquare-11

  • Relu-6

  • Reshape-5

  • Scatter-9

  • Shrink-9

  • Sigmoid-6

  • Sign-9

  • Sin-7

  • Sinh-9

  • Slice-1

  • Slice-10

  • Slice-11

  • Softmax-1

  • Softmax-11

  • Split-2

  • Split-11

  • Sqrt-6

  • Squeeze-1

  • Squeeze-11

  • Sub-6

  • Sub-7

  • Sum-6

  • Sum-8

  • Tan-7

  • Tanh-6

  • Tile-1

  • Tile-6

  • TopK-1

  • TopK-10

  • TopK-11

  • Transpose-1

  • Unsqueeze-1

  • Unsqueeze-11

Environment variables

There are several environment variables which you can use to control the behaviour of PopART.

Logging

PopART can output information about its activity as described in Turning on execution tracing. You can control the default level of logging information using environment variables.

POPART_LOG_LEVEL

This controls the amount of information written to the log output for all modules. Finer control can be achieved using POPART_LOG_CONFIG.

POPART_LOG_DEST

This variable defines the output for the logging information. The value can be “stdout”, “stderr” or a file name.

The default, if not defined, is “stderr”.

POPART_LOG_CONFIG

If set, this variable defines the name of a configuration file which specifies the logging level for each module. This is a JSON format file with pairs of module:level strings. For example, a file called conf.py can be specified by setting the environment variable:

 export POPART_LOG_CONFIG=conf.py

To set the logging level of the devicex and session modules, conf.py would contain:

 {
  "devicex":"INFO",
  "session":"WARN"
}

These values override the value specified in POPART_LOG_LEVEL.

Generating DOT files

POPART_DOT_CHECKS

PopART can output a graphical representation of the graph, in DOT format, when it constructs the intermediate representation (IR). The stages of IR construction where the DOT files is generated is controlled by this variable.

Supported values:

  • FWD0

  • FWD1

  • BWD0

  • PREALIAS

  • FINAL

These values may be combined using “:” as a separator. The example below shows how to set POPART_DOT_CHECKS to export DOT graphs for the FWD0 and FINAL stages.

 export POPART_DOT_CHECKS=FWD0:FINAL

The values in POPART_DOT_CHECKS will be combined with any values that are defined in the session options.

Saving the tensor map

POPART_TENSOR_TILE_MAP

The mapping of tensors to tiles in the session can be saved to a file by setting this variable to the name of a file. The tensor tile map will be written in JSON format.

The tensor tile map will be saved when you call Session::prepareDevice. For example, to save the tensor tile map to ttm.js you would set the variable as shown:

 export POPART_TENSOR_TILE_MAP=ttm.js

Inspecting the Ir

POPART_IR_DUMP

If set, this variable defines the name of a file where the serialised ir will be written. The ir will be written either at the end of the ir preparation phase, or when an exception is thrown during the ir preparation phase.

Glossary

Sample

The smallest division of a data set.

Micro-batch size

The number of samples processed in a single execution of a graph on a single device. Also referred to as the machine batch size. The micro-batch shape, or the shape of input data as defined in the ONNX model, is therefore [micro_batch_size, *sample_shape]

Replication factor

The number of graphs to be run in parallel over multiple devices. The weight gradients from each device will be accumulated before a weight update. Also referred to as “device replication factor” or “spatial replication factor”. This is sometimes called data-parallel execution.

Accumulation factor

The weight gradients will be accumulated over this number of micro-batches in series before a weight update. Also referred to as “temporal replication factor”.

Accumulation can be thought of as doing replication on a single device.

Batch size

This is defined as micro-batch size * replication factor * accumulation factor. This is the number of samples per weight update.

Batches per step

The number of batches to run in a single call to Session::run.

Step size

This is defined as batch size * batches per step. This is the number of samples per step.

Input data shape

Inputs to a session.run() call are read in with the assumption that data is arranged in the shape:

[batches_per_step, accl_factor, repl_factor, micro_batch_size, *sample_shape]

However, there is no constraint of the shape of the input array, except that it has the correct number of elements.