PopART User Guide

Introduction

The Poplar Advanced Run Time (PopART) is part of the Poplar SDK for implementing and running algorithms on networks of Graphcore IPU processors. It enables you to import models using the Open Neural Network Exchange (ONNX) and run them using the Poplar tools. ONNX is a serialisation format for neural network systems that can be created and read by several frameworks including Caffe2, PyTorch and MXNet.

This document describes the features of PopART. It assumes that you are familiar with machine learning and the ONNX framework.

An overview of the IPU architecture and programming model can be found in the IPU Programmer’s Manual. For more information on the Poplar framework refer to the Poplar and Poplibs User Guide.

PopART has three main features:

  1. It can import ONNX graphs into a runtime environment (see Importing graphs).
  2. It provides a simple interface for constructing ONNX graphs without needing a third party framework (described in Building graphs in PopART).
  3. It runs imported graphs in inference, evaluation or training modes, by building a Poplar engine, connecting data feeds and scheduling the execution of the Engine (see Executing graphs).

IPU-specific annotations on ONNX operations allow the provider of the graph to control IPU-specific features, such as mapping an algorithm across multiple IPUs.

APIs are available for C++ and Python. Most of the examples in this document use the Python API.

Importing graphs

The PopART Session class creates the runtime environment for executing graphs on IPU hardware. It can read an ONNX graph from a serialised ONNX model protobuf (ModelProto), either directly from a file or from memory. A session object can be constructed either as an InferenceSession or a TrainingSession

Some metadata must be supplied to augment the data present in the ONNX graph in order to run it, as described below.

In the following example of importing a graph for inference, TorchVision is used to create a pre-trained AlexNet graph, with a 4 x 3 x 244 x 244 input. The graph has an ONNX output called out, and the DataFlow object contains an entry to fetch that anchor.

 # Copyright (c) 2020 Graphcore Ltd. All rights reserved.
import popart

import torch.onnx
import torchvision

input_ = torch.FloatTensor(torch.randn(4, 3, 224, 224))
model = torchvision.models.alexnet(pretrained=True)

output_name = "output"

torch.onnx.export(model, input_, "alexnet.onnx", output_names=[output_name])

# Create a runtime environment
anchors = {output_name: popart.AnchorReturnType("ALL")}
dataFeed = popart.DataFlow(100, anchors)
device = popart.DeviceManager().createCpuDevice()

session = popart.InferenceSession("alexnet.onnx", dataFeed, device)

The DataFlow object is described in more detail in Executing graphs.

Creating a session

The Session class takes the name of a protobuf file, or the protobuf itself. It also takes a DataFlow object which has information about how to execute the graph:

  • The number of times to conduct a forward pass (and a backward pass, if training) of the graph on the IPU before returning to the host for more data.
  • The names of the tensors in the graph used to return the results to the host.

In some ONNX graphs, the sizes of input tensors might not be specified. In this case, the inputShapeInfo parameter can be used to specify the input shapes. The Poplar framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation.

The patterns parameter allows the user to select a set of graph transformation patterns which will be applied to the graph. Without this parameter, a default set of optimisation transformations will be applied.

Other parameters to the Session object are used when you are training the network instead of performing inference. They describe the types of loss to apply to the network and the optimiser to use.

An example of creating a session object from an ONNX model is shown below.

 # Copyright (c) 2020 Graphcore Ltd. All rights reserved.
import popart

import torch.onnx
import torchvision

input_ = torch.FloatTensor(torch.randn(4, 3, 224, 224))
model = torchvision.models.alexnet(pretrained=False)

labels_name = "labels"
output_name = "output"

torch.onnx.export(model, input_, "alexnet.onnx", output_names=[output_name])

# Describe the labels input shape
inputShapeInfo = popart.InputShapeInfo()
inputShapeInfo.add(labels_name, popart.TensorInfo("INT32", [4]))

# Create a runtime environment
anchors = {output_name: popart.AnchorReturnType("ALL")}
dataFeed = popart.DataFlow(100, anchors)

losses = [popart.NllLoss(output_name, labels_name, "loss")]
optimizer = popart.ConstSGD(0.001)

# Run session on CPU
device = popart.DeviceManager().createCpuDevice()
session = popart.TrainingSession("alexnet.onnx",
                                 deviceInfo=device,
                                 dataFeed=dataFeed,
                                 losses=losses,
                                 optimizer=optimizer,
                                 inputShapeInfo=inputShapeInfo)

In this example, when the Session object is asked to train the graph, an NllLoss node will be added to the end of the graph, and a ConstSGD optimiser will be used to optimise the parameters in the network.

Session control options

The userOptions parameter passes options to the session. The available options are listed in the PopART C++ API Reference. As well as options to control specific features of the PopART session, there are also some that allow you to pass options to the underlying Poplar functions:

  • engineOptions passes options to the Poplar Engine object created to run the graph.
  • convolutionOptions passes options to the Poplibs convolution functions.
  • reportOptions Controls the instrumentation and generation of profiling information.

See Retrieving profiling reports for examples of using some of these options.

Full details of the Poplar options can be found in the Poplar and Poplibs API Reference.

Building graphs in PopART

PopART has a Builder class for constructing ONNX graphs without needing a third party framework.

In the example below, a simple addition is prepared for execution. The steps involved are described in the following sections and in Executing graphs.

 import popart

builder = popart.Builder()

# Build a simple graph
i1 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1, 2, 32, 32]))
i2 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1, 2, 32, 32]))

o = builder.aiOnnx.add([i1, i2])

builder.addOutputTensor(o)

# Get the ONNX protobuf from the builder to pass to the Session
proto = builder.getModelProto()

# Create a runtime environment
anchors = {o : popart.AnchorReturnType("ALL")}
dataFeed = popart.DataFlow(1, anchors)
device = popart.DeviceManager().createCpuDevice()

# Create the session from the graph, data feed and device information
session = popart.InferenceSession(proto, dataFeed, device)

The DataFlow object is described in more detail in Executing graphs.

Adding operations to the graph

The builder adds operations to the graph by calling one of the many operation methods. Each of these methods has a common signature. For example, relu will add an ONNX Relu operation to the graph:

 output = builder.aiOnnx.relu([input], "debug-name")

They take a list of arguments which are the input tensor names, and an optional string to assign to the node. This name is passed to the Poplar nodes and used in debugging and profiling reports.

The operation method returns the name of the tensor that is an output of the newly added node.

In some cases other arguments are required, for instance:

 output = builder.aiOnnx.gather(['input', 'indices'], axis=1, debugPrefix="My-Gather")

Adding parameters to the graph

Parameters, for instance the weights of a convolution, are represented as initialised inputs to the graph. They can be added with the addInitializedInputTensor method:

 w_data = np.random.rand(64, 4, 3, 3).astype(np.float16)
w1 = builder.addInitializedInputTensor(w_data)

Setting outputs

The outputs of the graph should be marked appropriately, using the addOutputTensor method:

 builder.addOutputTensor(output)

Setting the IPU number for operations

When creating a graph which will run on a multiple IPU system, nodes need to be marked with an annotation to describe which IPU they will run upon.

For instance, to place a specific convolution onto IPU 1:

 we = builder.addInitializedInputTensor(np.zeros([32, 4, 3, 3], np.float16))
bi = builder.addInitializedInputTensor(np.zeros([32], np.float16))
o = builder.aiOnnx.conv([x, we, bi],
                        dilations=[1, 1],
                        pads=[1, 1, 1, 1],
                        strides=[1, 1])
# place operation on IPU 1
builder.virtualGraph(o, 1)

A context manager is available for placing multiple operations together onto a specific IPU:

 builder = popart.Builder()

i1 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))
i2 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))
i3 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))
i4 = builder.addInputTensor(popart.TensorInfo("FLOAT", [1]))

# place two add operations on IPU 0
with builder.virtualGraph(0):
    o1 = builder.aiOnnx.add([i1, i2])
    o2 = builder.aiOnnx.add([i3, i4])

# place one add operation on IPU 1
with builder.virtualGraph(1):
    o = builder.aiOnnx.add([o1, o2])

Alternatively, for automatic placement of nodes on available IPUs, use the session option autoVirtualGraph. See SessionOptions in the PopART C++ API Reference.

Executing graphs

The Session class is used to run graphs on an IPU device. Before the graph can be run, the way in which data will be transferred to and from the IPU must be specified. Then an IPU device can be selected to execute the graph.

Setting input/output data buffers for an execution

The PyStepIO class defines the input data for a specific execution. It takes a dictionary with the input tensor names as keys, and Python arrays for the data values. It also takes a similar dictionary of names and buffers for the output values.

A convenience method initAnchorArrays can create the output buffers and dictionary for you, given the anchors (output nodes) which were specified in the DataFlow object during session construction.

 # Create buffers to receive results from the execution
anchors = session.initAnchorArrays()

# Generate some random input data
data_a = np.random.rand(1).astype(np.float32)
data_b = np.random.rand(1).astype(np.float32)

stepio = popart.PyStepIO({'a': data_a, 'b': data_b}, anchors)

If there are any pre-defined inputs (weights, biases, etc.) in the graph then they will not be specified in the PyStepIO object. However, before executing the graph, they will need to the copied to the hardware. If there are any optimiser-specific parameters which can be modified, then these must be written to the device. For example:

 session.weightsFromHost()
session.optimizerFromHost()

These can also be updated between executions.

 # Update learning rate parameter between training steps
stepLr = learningRate[step]
session.updateOptimizer(popart.SGD(stepLr))
session.optimizerFromHost()

Retrieving results

The DataFlow class describes how to execute the graph. The second parameter is a description of the anchors, the results to fetch from the graph.

 df = popart.DataFlow(1, {o: popart.AnchorReturnType("ALL")})

This is a Python dictionary with keys that are the names of the tensors to retrieve from the model. The associated values are an AnchorReturnType, which is one of:

  • popart.AnchorReturnType("ALL"): a vector of results is returned, one for each iteration of the graph.
  • popart.AnchorReturnType("EVERYN", N): a vector containing the tensor, but only for iterations which are divisible by N.
  • popart.AnchorReturnType("FINAL"): the value of the tensor on the final iteration through the graph.

Selecting a device for execution

The device manager allows the selection of an IPU configuration for executing the session. The device must be passed into the session constructor.

 df = popart.DataFlow(1, {o: popart.AnchorReturnType("ALL")})
device = popart.DeviceManager().createCpuDevice()
s = popart.InferenceSession("onnx.pb", deviceInfo=device, dataFeed=df)

The device manager can enumerate the available devices with the enumerateDevices method. The acquireAvailableDevice method will acquire the next available device. The first parameter specifies how many IPUs to acquire.

 # Acquire a two-IPU pair
dev = popart.DeviceManager().acquireAvailableDevice(2)

Using acquireDeviceById will select a device from the list of IPU configurations, as given by the enumerateDevices method, or by the gc-info command-line tool. This may be a single IPU or a group of IPUs.

 # Acquire IPU configuration 5
dev = popart.DeviceManager().acquireDeviceById(5)

The method createIpuModelDevice is used to create a Poplar software emulation of an IPU device. Similarly, the method createCpuDevice creates a simple Poplar CPU backend. See PopART C++ API Reference for details.

Executing a session

Once the device has been selected, the graph can be compiled for it, and loaded into the hardware. The prepareDevice method is used for this:

 session.prepareDevice()

To execute the session you need to call the session’s run method.

 session.run(stepio)

If the session is created for inference, the user is responsible for ensuring that the forward graph finishes with the appropriate operation for an inference. If losses are provided to the inference session the forward pass and the losses will be executed, and the final loss value will be returned.

If the session was created for training, any pre-initialised parameters will be updated to reflect the changes made to them by the optimiser.

Saving and loading a model

The method modelToHost writes a model with updated weights to the specified file.

 session.modelToHost("trained_model.onnx")

A file of saved parameters, for example from an earlier execution session, can be loaded into the current session.

 session.resetHostWeights("test.onnx")
session.weightsFromHost()

Retrieving profiling reports

Poplar can provide profiling information on the compilation and execution of the graph. Profiling is not enabled by default.

To get profiling reports in PopART, you will need to enable profiling in the Poplar engine. For example:

 opts = popart.SessionOptions()
opts.engineOptions = {"debug.instrument": "true"}

You can also control what information is included in the profiling report:

 opts.reportOptions = {"showExecutionSteps": "true"}

There are three method functions of the session object to access the profiling information:

  • getSummaryReport retrieves a text summary of the compilation and execution of the graph.
  • getGraphReport returns a JSON format report on the compilation of the graph
  • getExecutionReport returns a JSON format report on all executions of the graph since the last report was fetched.

If profiling is not enabled, then the summary report will say ‘Execution profiling not enabled’ and the execution report will contain ‘{“profilerMode”:”NONE”}’.

Both getGraphReport and getExecutionReport can optionally return a Concise Binary Object Representation (CBOR) formatted report.

For more information on profiling control and the information returned by these functions, see the Profiling chapter of the Poplar and Poplibs User Guide.

Turning on execution tracing

PopART contains an internal logging system that can show the progress of graph compilation and execution.

Logging information is generated from the following modules:

session The ONNX session (the PopART API)
ir The intermediate representation
devicex The poplar backend
transform  
pattern  
builder  
op  
opx  
ces  
python  

The logging levels, in decreasing verbosity, are shown below.

TRACE The highest level, shows the order of method calls
DEBUG  
INFO  
WARN Warnings
ERR Errors
CRITICAL Only critical errors
OFF No logging

The default is “OFF”. You can change this, and where the logging information is written to, by setting environment variables, see Environment variables.

Programming interface

You can also control the logging level for each module in your program.

For example, in Python:

 # Set all modules to DEBUG level
popart.getLogger().setLevel("DEBUG")
# Turn off logging for the session module
popart.getLogger("session").setLevel("OFF")

And in C++:

 // Set all modules to DEBUG level
popart::logger::setLevel("popart", "DEBUG")
// Turn off logging for the session module
popart::logger::setLevel("session", "OFF")

Output format

The information is output in the following format:

 [<timestamp>] [<module>] [<level>] <logging string>

For example:

 [2019-10-16 13:55:05.359] [popart:devicex] [debug] Creating poplar::Tensor 1
[2019-10-16 13:55:05.359] [popart:devicex] [debug] Creating host-to-device FIFO 1
[2019-10-16 13:55:05.359] [popart:devicex] [debug] Creating device-to-host FIFO 1

Examples

Examples are provided in the examples directory of the PopART installation.

This is a simple example of constructing a network using the builder, and performing a simple inference pass to perform an addition.

 # Copyright (c) 2018 Graphcore Ltd. All rights reserved.
import numpy as np
import popart

# Create a builder and construct a graph
builder = popart.Builder()

data_shape = popart.TensorInfo("FLOAT", [1])

a = builder.addInputTensor(data_shape)
b = builder.addInputTensor(data_shape)

o = builder.aiOnnx.add([a, b])

builder.addOutputTensor(o)

proto = builder.getModelProto()

# Describe how to run the model
dataFlow = popart.DataFlow(1, {o: popart.AnchorReturnType("ALL")})

# Create a session to compile and execute the graph
session = popart.InferenceSession(
    fnModel=proto,
    dataFeed=dataFlow,
    deviceInfo=popart.DeviceManager().createIpuModelDevice({}))

# Compile graph
session.prepareDevice()

# Create buffers to receive results from the execution
anchors = session.initAnchorArrays()

# Generate some random input data
data_a = np.random.rand(1).astype(np.float32)
data_b = np.random.rand(1).astype(np.float32)

stepio = popart.PyStepIO({a: data_a, b: data_b}, anchors)
session.run(stepio)

print("Input a is " + str(data_a))
print("Input b is " + str(data_b))
print("Result is " + str(anchors[o]))

Distributed training with Horovod

In order to scale out training with PopART across multiple machines we use Horovod to setup and run collective operations. There is currently support for the following MPI-based collective operations: Broadcast and AllReduce. The Broadcast operation is typically run at the start of a training to initialise the weights to have the same values across the instances. Gradients produced during the backwards pass will be aggregated and averaged across the instances by running the AllReduce operation. This ensures that each rank applies the same gradients to its weights during the weight update step.

How to modify a PopART program for distributed training

Import the Horovod PopART extension:

 import horovod.popart as hvd

Enable the hostAllReduce PopART session option:

 userOpts = popart.SessionOptions()

# Enable host side AllReduce operations in the graph
userOpts.hostAllReduce = True

Initialise the Horovod runtime:

 hvd.init()

Initialise the Horovod DistributedOptimizer object. The constructor takes the PopART optimiser, training session and session options objects as arguments. The DistributedOptimizer object will add operations to copy gradients into and out of the IPU and run the Horovod AllReduce operation:

 distributed_optimizer = hvd.DistributedOptimizer(optimizer, training.session, userOpts)

Broadcast the initial weights from the rank zero process to the other PopART instances:

 hvd.broadcast_weights(training.session, root_rank=0)

Install

Requirements for installing the Horovod PopART extension can be found here: Horovod install.

Configuring and running distributed training

Running distributed training with the Horovod PopART extension can be done in the same way as with other frameworks. For instance, running distributed training across two processes on the same machine can be done with the following command:

 $ horovodrun -np 2 -H localhost:2 python train.py

Additional documentation on running Horovod can be found here: Horovod documentation.

Full distributed training example

 import numpy as np
import os
from collections import namedtuple

# import the PopART Horovod extension
import horovod.popart as hvd
import popart

Session = namedtuple('Session', ['session', 'anchors'])
batch_size = 1
IN_SHAPE = 784
OUT_SHAPE = 10


def create_model():
    builder = popart.Builder()
    dtype = np.float32

    np.random.seed(42)
    input_shape = popart.TensorInfo(dtype, [batch_size, IN_SHAPE])
    x = builder.addInputTensor(input_shape)
    init_weights = np.random.normal(0, 1, [IN_SHAPE, OUT_SHAPE]).astype(dtype)
    w = builder.addInitializedInputTensor(init_weights)
    init_biases = np.random.normal(0, 1, [OUT_SHAPE]).astype(dtype)
    b = builder.addInitializedInputTensor(init_biases)
    h = builder.aiOnnx.matmul([x, w])
    a = builder.aiOnnx.add([h, b])

    output = a
    builder.addOutputTensor(output)
    probs = builder.aiOnnx.softmax([output])

    label_shape = popart.TensorInfo("INT32", [batch_size])
    label = builder.addInputTensor(label_shape)

    loss = popart.NllLoss(probs, label, "nllLossVal")

    proto = builder.getModelProto()

    return builder, proto, x, label, output, loss


def get_device(simulation=True):
    num_ipus = 1
    deviceManager = popart.DeviceManager()
    if simulation:
        print("Creating ipu sim")
        ipu_options = {
            "compileIPUCode": True,
            'numIPUs': num_ipus,
            "tilesPerIPU": 1216
        }
        device = deviceManager.createIpuModelDevice(ipu_options)
        if device is None:
            raise OSError("Failed to acquire IPU.")
    else:
        print("Aquiring IPU")
        device = deviceManager.acquireAvailableDevice(num_ipus)
        if device is None:
            raise OSError("Failed to acquire IPU.")
        else:
            print("Acquired IPU: {}".format(device))

    return device


def init_session(proto, loss, dataFlow, userOpts, device):
    # Create a session to compile and execute the graph
    optimizer = popart.SGD({"defaultLearningRate": (0.1, False)})
    session = popart.TrainingSession(fnModel=proto,
                                     losses=[loss],
                                     deviceInfo=device,
                                     optimizer=optimizer,
                                     dataFeed=dataFlow,
                                     userOptions=userOpts)

    session.prepareDevice()
    session.setRandomSeed(42)

    # Create buffers to receive results from the execution
    anchors = session.initAnchorArrays()

    return Session(session, anchors), optimizer


def train():
    builder, proto, data_in, labels_in, output, loss, = create_model()

    batches_per_step = 32
    anchor_desc = {
        output: popart.AnchorReturnType("ALL"),
        loss.output(0): popart.AnchorReturnType("ALL")
    }
    dataFlow = popart.DataFlow(batches_per_step, anchor_desc)

    userOpts = popart.SessionOptions()
    device = get_device()

    # Enable host side AllReduce operations in the graph
    userOpts.hostAllReduce = True
    training, optimizer = init_session(proto, loss, dataFlow, userOpts, device)
    if userOpts.hostAllReduce:
        hvd.init()

        distributed_optimizer = hvd.DistributedOptimizer(
            optimizer, training.session, userOpts)

        # Broadcast weights to all the other processes
        hvd.broadcast_weights(training.session, root_rank=0)

    training.session.weightsFromHost()
    training.session.optimizerFromHost()

    # Synthetic data
    data = np.random.normal(size=(batches_per_step, batch_size, 784)).astype(
        np.float32)
    labels = np.zeros((batches_per_step, batch_size, 1)).astype(np.int32)

    num_training_steps = 10

    for _ in range(num_training_steps):
        stepio = popart.PyStepIO({
            data_in: data,
            labels_in: labels
        }, training.anchors)
        training.session.run(stepio)


train()

Performance optimisation

Sync configuration

In a multi-IPU system, synchronisation (sync) signals are used to ensure that IPUs are ready to exchange data and that data exchange is complete. These sync signals are also used to synchronise host transfers and access to remote buffers.

Each IPU can be allocated to one or more “sync groups”. At a synchronization point, all the IPUs in a sync group will wait until all the other IPUs in the group are ready.

Sync groups can be used to to allow subsets of IPUs to overlap their operations. For example, one sync group can be performing data transfers to or from the host, while another group is processing a previous batch of data.

You can configure the sync groups using the PopART syncPatterns option when creating a device.

For example, the following code shows how to set the sync configuration to “ping-pong” mode.

 sync_pattern = popart.SyncPattern.Full
if args.execution_mode == "PINGPONG":
    sync_pattern = popart.SyncPattern.PingPong
device = popart.DeviceManager().acquireAvailableDevice(
    request_ipus,
    1216,
    pattern=sync_pattern)

Sync patterns

There are three sync patterns available. These control how the IPUs are allocated to the two sync groups, GS1 and GS2.

The sync patterns are described with reference to the diagram below, which shows four IPUs: A, B, C and D.

Sync patterns in PopART

Sync patterns

  • Full: All four IPUs are in both sync groups. Any communication between the IPUs or with the host, will require all IPUs to synchronise.

  • SinglePipeline: One sync group contains all four of the IPUs. So any communication using that sync group will synchronise all the IPUs.

    The other sync group is used separately by each IPU. This means that they can each sync with the host independently, without syncing with each other. This allows any IPU to be doing host IO, for example, while others are processing data.

  • PingPong: One sync group contains all the IPUs. The other sync group is used independently by sets of IPUs, for example A+C and B+D. This means that each subset can communicate independently of each other. The two groups of IPUs can then alternate between host I/O and processing.

For more information on how the sync groups are used by the Poplar framework, please refer to the Poplar and Poplibs User Guide.

Supported operators

PopART is compatible with ONNX version 1.3 (see ONNX Versioning). This section lists the supported operators.

The Graphcore (ai.graphcore) and ONNX (ai.onnx) operators, and versions supported, are listed below. See ONNX Operators for more information.

Domain: ai.graphcore

  • CacheLoad-1
  • CacheStore-1
  • Call-1
  • ConvFlipWeights-1
  • DynamicAdd-1
  • DynamicSlice-1
  • DynamicUpdate-1
  • DynamicZero-1
  • Gelu-1
  • GroupNormalization-1
  • Init-1
  • LSTM-1
  • PrintTensor-1
  • Scale-1
  • Square-1
  • Subsample-1

Domain: ai.onnx

  • Abs-6
  • Add-6
  • Add-7
  • And-1
  • And-7
  • ArgMax-1
  • ArgMax-11
  • ArgMin-1
  • ArgMin-11
  • Asin-7
  • Atan-7
  • AveragePool-1
  • AveragePool-7
  • AveragePool-10
  • AveragePool-11
  • BatchNormalization-6
  • BatchNormalization-7
  • BatchNormalization-9
  • Cast-6
  • Cast-9
  • Ceil-1
  • Ceil-6
  • Clip-6
  • Clip-11
  • Concat-1
  • Concat-4
  • Concat-11
  • Conv-1
  • Conv-11
  • Cos-7
  • Cosh-9
  • Div-6
  • Div-7
  • Dropout-6
  • Dropout-7
  • Dropout-10
  • Equal-1
  • Equal-7
  • Equal-11
  • Exp-6
  • Expand-8
  • Flatten-1
  • Flatten-9
  • Flatten-11
  • Floor-1
  • Floor-6
  • Gather-1
  • Gather-11
  • Gemm-6
  • Gemm-7
  • Gemm-9
  • Gemm-11
  • GlobalAveragePool-1
  • GlobalMaxPool-1
  • Greater-1
  • Greater-7
  • Greater-9
  • Identity-1
  • If-1
  • If-11
  • InstanceNormalization-6
  • IsInf-10
  • IsNaN-9
  • LRN-1
  • LSTM-1
  • LSTM-7
  • LeakyRelu-1
  • LeakyRelu-6
  • Less-7
  • Less-9
  • Log-6
  • LogSoftmax-1
  • LogSoftmax-11
  • Loop-1
  • Loop-11
  • MatMul-1
  • MatMul-9
  • Max-6
  • Max-8
  • MaxPool-1
  • MaxPool-8
  • MaxPool-10
  • MaxPool-11
  • Mean-6
  • Mean-8
  • Min-6
  • Min-8
  • Mul-6
  • Mul-7
  • Neg-6
  • Not-1
  • OneHot-9
  • OneHot-11
  • Or-1
  • Or-7
  • Pad-2
  • Pad-11
  • Pow-1
  • Pow-7
  • Reciprocal-6
  • ReduceL1-1
  • ReduceL1-11
  • ReduceL2-1
  • ReduceL2-11
  • ReduceLogSum-1
  • ReduceLogSum-11
  • ReduceLogSumExp-1
  • ReduceLogSumExp-11
  • ReduceMax-1
  • ReduceMax-11
  • ReduceMean-1
  • ReduceMean-11
  • ReduceMin-1
  • ReduceMin-11
  • ReduceProd-1
  • ReduceProd-11
  • ReduceSum-1
  • ReduceSum-11
  • ReduceSumSquare-1
  • ReduceSumSquare-11
  • Relu-6
  • Reshape-5
  • Scatter-9
  • Shrink-9
  • Sigmoid-6
  • Sign-9
  • Sin-7
  • Sinh-9
  • Slice-1
  • Slice-10
  • Slice-11
  • Softmax-1
  • Softmax-11
  • Split-2
  • Split-11
  • Sqrt-6
  • Squeeze-1
  • Squeeze-11
  • Sub-6
  • Sub-7
  • Sum-6
  • Sum-8
  • Tan-7
  • Tanh-6
  • Tile-1
  • Tile-6
  • TopK-1
  • TopK-10
  • TopK-11
  • Transpose-1
  • Unsqueeze-1
  • Unsqueeze-11

Environment variables

There are several environment variables which you can use to control the behaviour of PopART.

Logging

PopART can output information about its activity as described in Turning on execution tracing. You can control the default level of logging information using environment variables.

POPART_LOG_LEVEL

This controls the amount of information written to the log output for all modules. Finer control can be achieved using POPART_LOG_CONFIG.

POPART_LOG_DEST

This variable defines the output for the logging information. The value can be “stdout”, “stderr” or a file name.

The default, if not defined, is “stderr”.

POPART_LOG_CONFIG

If set, this variable defines the name of a configuration file which specifies the logging level for each module. This is a JSON format file with pairs of module:level strings. For example, a file called conf.py can be specified by setting the environment variable:

 export POPART_LOG_CONFIG=conf.py

To set the logging level of the devicex and session modules, conf.py would contain:

 {
  "devicex":"INFO",
  "session":"WARN"
}

These values override the value specified in POPART_LOG_LEVEL.

Generating DOT files

POPART_DOT_CHECKS

PopART can output a graphical representation of the graph, in DOT format, when it constructs the intermediate representation (IR). The stages of IR construction where the DOT files is generated is controlled by this variable.

Supported values:

  • FWD0
  • FWD1
  • BWD0
  • PREALIAS
  • FINAL

These values may be combined using “:” as a separator. The example below shows how to set POPART_DOT_CHECKS to export DOT graphs for the FWD0 and FINAL stages.

 export POPART_DOT_CHECKS=FWD0:FINAL

The values in POPART_DOT_CHECKS will be combined with any values that are defined in the session options.

Saving the tensor map

POPART_TENSOR_TILE_MAP

The mapping of tensors to tiles in the session can be saved to a file by setting this variable to the name of a file. The tensor tile map will be written in JSON format.

The tensor tile map will be saved when you call Session::prepareDevice. For example, to save the tensor tile map to ttm.js you would set the variable as shown:

 export POPART_TENSOR_TILE_MAP=ttm.js

Inspecting the Ir

POPART_IR_DUMP

If set, this variable defines the name of a file where the serialised ir will be written. The ir will be written either at the end of the ir preparation phase, or when an exception is thrown during the ir preparation phase.

Glossary

Sample

The smallest division of a data set.

Micro-batch size

The number of samples processed in a single execution of a graph on a single device. Also referred to as the machine batch size. The micro-batch shape, or the shape of input data as defined in the ONNX model, is therefore [micro_batch_size, *sample_shape]

Replication factor

The number of graphs to be run in parallel over multiple devices. The weight gradients from each device will be accumulated before a weight update. Also referred to as “device replication factor” or “spatial replication factor”. This is sometimes called data-parallel execution.

Accumulation factor

The weight gradients will be accumulated over this number of micro-batches in series before a weight update. Also referred to as “temporal replication factor”.

Accumulation can be thought of as doing replication on a single device.

Batch size

This is defined as micro-batch size * replication factor * accumulation factor. This is the number of samples per weight update.

Batches per step

The number of batches to run in a single call to Session::run.

Step size

This is defined as batch size * batches per step. This is the number of samples per step.

Input data shape

Inputs to a session.run() call are read in with the assumption that data is arranged in the shape:

[batches_per_step, accl_factor, repl_factor, micro_batch_size, *sample_shape]

However, there is no constraint of the shape of the input array, except that it has the correct number of elements.