Targeting the IPU from TensorFlow

Introduction

The purpose of this document is to introduce the TensorFlow framework from the perspective of developing and training models for the IPU. It assumes you have some knowledge of TensorFlow and machine learning.

See the “Getting Started” guide for your IPU system on the Graphcore support portal for installation instructions.

To some extent, implementing at the framework level is relatively independent of the underlying hardware as it relates to the specifics of defining a graph and its components (for example, how a convolutional layer is defined).

However, there are critical elements of targeting the IPU from TensorFlow that need to be understood to successfully use it as a training and inference engine. These include IPU-specific API configurations, model parallelism, error logging and report generation, as well as strategies for dealing with out-of-memory (OOM) issues.

Requirements

The Graphcore TensorFlow implementation requires Ubuntu 18.04 and Python 3.6. It will only run on a processor that supports the Intel AVX-512 extension to the instructions set.

Tutorial

TensorFlow is a powerful graph-modelling framework that can be used for the development, training and deployment of deep learning models. In the Graphcore software stack, TensorFlow sits at the highest level of abstraction. Poplar and Poplibs provide a software interface to operations running on the IPU.

TensorFlow abstraction

TensorFlow abstraction in relation to Poplar and the IPU

For the discussion that follows, it is important to understand the three key concepts of graph, session and device as well as their functional interdependence.

Session graph device illustration

Relationship between session, graph and device in TensorFlow

  • Graph: A computational graph is the connectivity framework of a deep learning model, where nodes are operators and edges are the data streams that connect them. Building a deep learning model in TensorFlow is the functional equivalent of designing a graph, where specified layer operations (for example, fully-connected layers) are nodes, and the sequence and connectivity of layers (such as a convolutional layer followed by max-pooling) define the edges.

  • Session: A session is the computational platform that encapsulates a graph. It handles data flow into and out of the graph, variable initialisation, model/weight storage and weight restoration, along with a number of other operations that are required to manage the computational task.

  • Device: The device identifies the hardware on which a session is run, such as the IPU, CPU or TPU. In many of the applications targeting the IPU, it will be helpful to segregate tasks between the CPU and IPU to leverage those aspects of the computation that they are each best suited for.

In the sections that follow, these three concepts will form a recurrent theme in building and deploying models from TensorFlow.

There are a number of references, user guides, model repositories and texts that can be valuable in learning the framework. See the References section for further reading.

Preliminary graphs

The focus now is to implement our first basic graphs targeting the IPU. The first step will be a straightforward additive graph with nothing save the fundamental components required for running on an IPU.

From there, we add the XLA library, which is required for a number of TensorFlow operators.

Finally, we add the concept of sharding, in which we take our first steps to model parallelism by splitting a basic graph across four IPUs then consolidate the calculations from the separate IPUs to produce a single final result.

A basic graph

We begin with the most humble of aspirations: the ability to add.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
 import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure arguments for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")


def basic_graph(pa, pb, pc):
  # Do basic addition with tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  result = basic_graph(pa, pb, pc)

with tf.Session() as sess:
  # Run the graph through the session feeding it an arbitrary dictionary
  result = sess.run(result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })
  print(result)

Let’s review the various key sections of the code as they are presented. In lines 1-5 are the basic import statements, two of which pertain to the IPU specifically. Line 3 imports the IPU API, which will be the main interface to set configuration options for running the IPU session. ipu_scope is a helper function that ensures that the device and resource scopes are set (that is, that the hardware is properly initialised when called by the script).

 # Configure arguments for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

In this section of the code basic configuration options are being defined. Boolean flags are passed to create_ipu_config, which turn on profiling and a text-format report.

  • The profiling parameter enables trace event logging on the IPU. This will monitor operations on the chip, providing detailed data about the session as it runs on hardware.

  • use_poplar_text_report configures the text format of the generated report, making it more readable for debugging purposes.

Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.

Running on the IPU Model simulator

You can run the graph on IPU hardware or on an IPU Model running on the host. The IPU Model is a simulation of the behaviour of the IPU hardware. It does not implement every aspect of a real IPU. For example, the IPU Model does not support replicated graphs in TensorFlow (see Replicated graphs).

When using an IPU Model instead of actual IPU hardware, the runtime operations will behave exactly as they would on hardware. However, the profiler will estimate the performance of operations and the memory use so the profiling information will not be as precise as running on hardware. By default, the memory use will not include that required for IPU code.

If you set the set_ipu_model_options option compile_ipu_code to True then Poplar will compile code for the IPU (in addition to the CPU code that is actually executed by the host). In this case, the reported IPU memory usage will include the memory used for code.

The IPU Model can be a useful tool for debugging OOM-related issues. See Using the IPU Model device for debugging for more information.

By default, the code will be run on IPU hardware. To run on the IPU Model instead, you need to set the environment variable TF_POPLAR_FLAGS='--use_ipu_model', for example:

 # Using IPU model instead of IPU hardware
if self.base_dictionary['ipu_model']:
    os.environ['TF_POPLAR_FLAGS'] = '--use_ipu_model'

Selecting hardware to run on

The auto_select_ipus function enables you to select from the available IPUs in a system. In this example, one IPU is selected. This can be changed to any number between 1 and 16 for a system, such as the Dell EMC DSS8440 IPU Server which has eight C2 cards installed, each with two IPUs.

This option will be important when we explore sharding, in which a single graph is segregated into separate sections, each section targeting a distinct IPU.

 with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")

In this section, TensorFlow placeholders are being placed into the CPU part of the graph. These will be used to feed data using a feed dictionary when executing session.run().

 def basic_graph(pa, pb, pc):
  # Do basic addition with tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  result = basic_graph(pa, pb, pc)

In this section, a graph of operations is created to do simple arithmetic on three input tensors. The ipu_scope directive is used to ensure that these operations are placed on the IPU system.

Then the graph is executed by using session.run(), the following output can be seen in the console log:

 ... [VARIOUS OUTPUT LINES FROM SCRIPT]...
...: I tensorflow/compiler/plugin/poplar/driver/executor.cc:660] Device /device:IPU:0 attached to IPU: 0
[3. 8.]

Beyond summing the vectors correctly, the line directly preceding informs us that the targeted device was the IPU, and the index of the IPU that ran the graph was IPU 0.

Note that "/device:IPU:0" in the script is a logical identifier for the IPU, and so when using auto_select_ipus, the actual IPU selected to run the graph may not be IPU 0, but could be any of the other IPUs that are free and available on the server. This will be covered in more detail in Sharding a graph.

An XLA graph

The previous script introduced a very basic graph that consisted of the summation of three vectors and published the results of a forward pass. For certain applications, it will be necessary to incorporate control flow structures, as in conditional if or while statements. Certain recurrent neural network (RNN) layers and long-short term memory (LSTM) cells have conditionals implicitly defined in their source code. In those cases, it will be necessary to use the XLA library to define the graph. XLA is an optimised linear algebra library that interfaces the graph to a set of optimisation parsers that render highly efficient computation sets.

Using XLA has certain restrictions, the most pertinent of which for the current discussion is that the dimensions of all tensors involved in the computational graph must be fully defined at compile time. Dealing with this restriction can at times require some meticulous refactoring of placeholders or input tensors (especially when dealing with mini-batch processing) but does not constitute a significant development overhead.

The main interface to the XLA library is ipu.ipu_compiler.compile(), which will take a graph and a feed dictionary for input tensors, and return a tensor set. ipu.ipu_compiler.compile sits between the graph definition and the session construct, as shown below:

``xla.compile`` in relation to a session and graph

xla.compile in relation to a session and graph

In most IPU-specific implementations, it is likely that an entire graph will be parsed through ipu.ipu_compiler.compile. However, it is also possible to compile only a portion of a graph with XLA and then combine the resulting tensor set with another, non-XLA, graph.

Further details about XLA compilation are available on the TensorFlow website: https://www.tensorflow.org/xla/tutorials/xla_compile.

Let’s now build on our previous TensorFlow script by adding ipu.ipu_compiler.compile to the session definition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
 import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")


def basic_graph(pa, pb, pc):
  # Do basic addition on tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(basic_graph, [pa, pb, pc])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })

  print(result)

The script has now gone from calling basic_graph directly, to feeding it as the graph input to ipu.ipu_compiler.compile. This takes the graph, along with the corresponding placeholders, as input.

Note that the dimensions of the placeholders fed to ipu.ipu_compiler.compile have been defined on the CPU. The values of these tensors are not defined until the session.run call.

In other words, it is only the dimensions of the placeholders that are the critical information for ipu.ipu_compiler.compile so that it can parse the graph correctly at compile time.

Given that this graph and the one in the previous example are the same, it is apparent that ipu.ipu_compiler.compile is not actually required to execute the graph. However, if the following code:

 def basic_graph(pa, pb, pc):
    # Do basic addition on tensors
    o1 = pa + pb
    o2 = pa + pc
    simple_graph_output = o1 + o2
    return simple_graph_output

Were to be replaced with:

 def while_loop_graph(pa):
        c = tf.constant(0)

        def body_of_while_loop(i):
            return i+1

        cond = lambda i: i < 10
        loop = tf.while_loop(cond, body_of_while_loop, [c])
        square = pa * pa
        return loop, square, tf.no_op()

Then ipu.ipu_compiler.compile would be strictly required, because of the use of the tf.while_loop() conditional statement.

Sharding a graph

The final script of this introductory series focuses on sharding: the process of splitting a graph across multiple IPUs. In essence, the session continues to be a single entity, so that the graph construct is treated as a single model, but distinct portions of the graph live on different IPUs, as illustrated below:

Sharding across two IPUs

Sharding across two IPUs

Let’s now return to our basic script and add the sharding component.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
from tensorflow.compiler.plugin.poplar.ops import gen_ipu_ops
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

NUM_IPUS = 4

# Configure the IPU system
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, NUM_IPUS)
ipu.utils.configure_ipu_system(cfg)

# Create the CPU section of the graph
with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")

# Define a trace event
with tf.device('cpu'):
  report = gen_ipu_ops.ipu_event_trace()


# Distribute the computation across four shards
def sharded_graph(pa, pb, pc):
  with ipu.scopes.ipu_shard(0):
    o1 = pa + pb
  with ipu.scopes.ipu_shard(1):
    o2 = pa + pc
  with ipu.scopes.ipu_shard(2):
    o3 = pb + pc
  with ipu.scopes.ipu_shard(3):
    out = o1 + o2 + o3
    return out


# Create the IPU section of the graph
with ipu_scope("/device:IPU:0"):
  result = ipu.ipu_compiler.compile(sharded_graph, [pa, pb, pc])

with tf.Session() as sess:
  # sharded run
  result = sess.run(result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })

  print(result)

Focusing on the sharding parts of this new script, line 14 uses auto_select_ipus to select four separate IPUs for the task. This will allow the script to go through the IPUs accessible by the host, determine which are being utilised and which are free, and then subscribe to those IPUs that are available.

In lines 29-38, the standard sum graph is defined (with the addition of one more sum on shard 2). Now each portion of the sum is performed on a distinct shard, using

 with ipu.scopes.ipu_shard(shard_index):

As a result, shards 0 through 2 perform independent tensor sums, while shard 3 performs an accumulated sum from the other three shards. In line 43 we are using xla.compile to parse the graph.

Note that sharding can also be performed without running through the XLA library.

The output of the session run will be something similar to this:

 ... [VARIOUS OUTPUT LINES FROM SCRIPT]...
...:  I tensorflow/compiler/plugin/poplar/driver/executor.cc:660] Device /device:IPU:0 attached to IPUs: 24
[array([ 4., 14.], dtype=float32)]

The first thing to note is that the sum is correct so we know that the sharded implementation works correctly.

The second thing to note is that the IPU ID is reported as 24. This is a multi-IPU ID and corresponds to the individual IPUs 4, 5, 6 and 7. These are the IPUs selected to host the graph and to process respective shards as indexed in the code. See the IPU Command Line Tools document for more information about how IPU IDs are allocated.

Targeting the Poplar XLA device

The Poplar XLA devices are named /device:IPU:X, where X is an integer which identifies that logical device. This can consist of one or more physical IPU devices, as described below.

A Python context handler is available for setting up all appropriate scoping while creating the graph:

 # Create the IPU section of the graph
with ipu_scope("/device:IPU:0"):
  result = ipu.ipu_compiler.compile(sharded_graph, [pa, pb, pc])

For very simple graphs, it is sufficient to use the IPU scope to define the parts of the graph which will be compiled. For most graphs, the function ipu_compiler.compile() must be used. This must be placed inside an IPU device scope.

The function ipu_compiler.compile() will cause all operations created by the Python function passed into its first argument to be placed on the IPU system, and be compiled together into a single Poplar executable.

Supported types

Poplar and the Poplibs libraries support the following data types:

  • tf.float32

  • tf.float16

  • tf.int32

  • tf.bool

Device selection

Hardware configuration options enable you to select the number of IPU devices. By default, TensorFlow will create one device. This device will be for a single IPU. The first available single IPU will be used.

Two API calls are available for selecting the number and configuration of the IPU system:

  • auto_select_ipus allows the selection of a number of IPUs. The function returns a single logical device containing the requested number of IPUs.

  • select_ipus allows the selection of a specific IPU hardware devices using ID numbers as returned by the gc-info tool.

Both of these functions takes the options structure returned by the create_ipu_config function as the first argument .

The second argument to auto_select_ipus is the number of IPUs required.

The second argument to select_ipus is either an integer or a list.

When a single integer is specified, this will be treated as the ID of the IPU device or devices to use. The ID specifies a single IPU, if it is in the range 0 to 15. Larger numbers represent “multi-IPU” IDs that specify groups of closely connected IPUs.

For example, to use all the IPUs in a 16-IPU system the appropriate ID is 30. (See the IPU Command Line Tools document for details of how device IDs map to available IPUs.) This will allocate a single TensorFlow device (/device:IPU:0) configured with all 16 IPUs.

You can also use a list of IDs as the argument to select_ipus. This configures a TensorFlow device for each ID in the list (/device:IPU:0, /device:IPU:1, and so on). Again, each ID value can specify a single IPU or multiple IPUs.

For more examples, see the documentation in Python API.

Once the hardware structure has been specified, the API call ipu.utils.configure_ipu_system must be used to attach to and initialise the hardware.

 # Configure the IPU system
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, NUM_IPUS)
ipu.utils.configure_ipu_system(cfg)

Configuring compilation options

The create_ipu_config function has many options for system configuration. They are divided into roughly three categories:

  1. Profiling and report generation.

  2. IO control.

  3. Graph creation.

In addition to auto_select_ipus and select_ipus, several other functions exist for configuring the hardware and compiler.

  • set_compilation_options sets general options to be passed to the Poplar compiler.

  • set_convolution_options, set_matmul_options and set_pooling_options pass specific options directly to the Poplibs convolution and pooling operations.

  • set_report_options passes options directly to the Poplar summary report generator.

  • set_ipu_model_options controls the Poplar IPU Model device type.

  • set_recomputation_options turns on recomputation, to reduce the memory requirement at the expense of speed.

  • set_floating_point_behaviour_options controls the IPUs floating point control register.

  • set_optimization_options controls the performance and memory use trade offs.

More options are available on the create_ipu_config function itself. These mostly control specific features of the Poplar and Poplibs operations. Some of the main ones are described below:

  • max_scheduler_lookahead_depth controls how far the scheduler can look beyond a given scheduling decision to understand the max-liveness implications. This search space grows very quickly and can take an unacceptable amount of time for large values.

  • max_scheduler_search_space_size introduces an upper-limit to the size of the schedule search space to guarantee that it will terminate in a reasonable amount of time.

  • scheduler_selection controls the particular scheduler that is selected to perform the scheduling of instructions in the compilation stage. By default, several schedules will be created and the one with the lowest predicted liveness chosen. This can sometimes produce incorrect results because the overall peak liveness isn’t always a good measure for the maximum liveness on one tile of the processor.

    The available schedulers are:

    • Clustering, which groups clusters of operations together in order to look through stretches of instructions with potentially high liveness.

    • PostOrder, which schedules the instructions in the order which is obtained by walking the graph in ‘post order’.

    • LookAhead, which looks ahead a number of operations from any schedulable one, as given by the max_scheduler_lookahead_depth and max_scheduler_search_space_size options described above. It attempts to look through areas of high liveness.

    • ShortestPath, which schedules the graph giving priority to the shortest path to the root.

See the documentation in Python API for more details.

TF_POPLAR_FLAGS environment variable

The options passed through create_ipu_config and configure_ipu_system can be directed at any machine in a TensorFlow cluster. Some configuration options are provided by an environment variable called TF_POPLAR_FLAGS.

If you set TF_POPLAR_FLAGS=--help and execute a TF session, it will output some help for each option. Some of the more common options are described below. For a full list, refer to Python API.

  • --help will print the information for all the flags.

  • --use_synthetic_data will prevent the system from downloading or uploading data to the card when executing code. This is used for testing performance without the overhead of data transfer.

  • --synthetic_data_initializer is used in combination with the --use_synthetic_data flag to control how the inputs to the graph will be initialised on the IPU. The values will be either random (--synthetic_data_initializer=random) or a constant value X (--synthetic_data_initializer=X)

  • --use_ipu_model will use the Poplar IPUModel for graph compilation and execution.

  • --log_cycle_count will log the number of cycles used in evaluating the main graph. The numeric argument indicates the tile on which the cycle count operation will be created. This may be used as an alternative to profiling for graphs with dynamic control flow.

  • --while_loop_brute_force_max_trip_count is the upper bound for how many iterations a while loop will be simulated for in order to brute force the number of times it will be executed.

  • --max_compilation_threads sets the maximum number of threads which Poplar is allowed to use for compiling the executable.

  • --max_infeed_threads sets the maximum number of threads which each infeed queue is allowed to use when accessing data from datasets.

  • --save_vertex_graph dumps the Poplar vertex graph (as a DOT file) to the given directory.

  • --save_interval_report dumps the Poplar interval report to the given directory.

  • --executable_cache_path enables the Poplar executable cache. See Caching of compiled executables.

  • --save_interval_report dumps the Poplar interval report to the given directory.

  • --tensor_map_file_path will cause a JSON file containing the tile mapping of all tensors to be written to this directory.

  • --dump_schedule_as_dot will dump the schedule of the XLA graph to the user console.

  • --fallback_scheduler uses the standard TensorFlow scheduler, instead of the Graphcore specific one.

  • --allow_nans will allow NaNs.

  • --null_data_feed will cause any infeed queues to copy garbage data to the IPU rather than real data. This option can be used to determine whether the dataset provided to the infeed queue is the bottleneck during execution.

  • --dump_text_reports_to_stdio if profiling is enabled, then a text summary of the profile will be dumped into the standard output, in addition to the normal report processing.

Multiple options can be specified at the same time by concatenating them like command line switches, for example: --executable_cache_path=/tmp/cache --allow_nans.

Caching of compiled executables

It can take a long time to compile a large fused graph into an executable suitable for the IPU. To prevent the need for compiling every time a TensorFlow process is started, you can enable an executable cache.

You can use the flag --executable_cache_path to specify a directory where compiled files will be placed. Fused XLA/HLO graphs are hashed with a 64-bit hash and stored in this directory. For example:

 TF_POPLAR_FLAGS='--executable_cache_path=/tmp/cachedir'

A pair of files will be saved for each compiled graph, the TensorFlow metadata and the Poplar executable.

The cache does not manage the files within the directory. It is your responsibility to delete files. No index is kept of the files, so they can be deleted without risk.

Supported operations

A list of supported TensorFlow operations is provided in TensorFlow operators supported by the IPU.

Unsupported operations

TensorFlow core operations which use variable buffers or strings are not supported. For instance, JpegDecode.

Unsupported operations will cause the compilation to fail.

By including config=tf.ConfigProto(log_device_placement=True) as an argument to the creation of the session, you can check whether the operations in your graph have been targeted at the Poplar device. For example:

 # Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Adding variables

Do not add variables using tf.Variable([shape], initializer), because they will fail to obey certain operations, such as assign_add.

Make sure that all variables are added using a variable scope that is marked as a resource. This can be done globally, as shown below:

 vscope = tf.get_variable_scope()
vscope.set_use_resource(True)
...
var = tf.get_variable(name, shape=[...], dtype=tf.float32, initializer=tf.constant_initializer(0.5))
...

Or it can be done locally, in a specific scope:

 with tf.variable_scope("vs", use_resource=True):
  var = tf.get_variable(name, shape=[...], dtype=tf.float32, initializer=tf.constant_initializer(0.5))

Troubleshooting

If you get an error similar to the following (especially the lines containing VariableV2) it indicates that a variable has been created which is not a resource variable.

 InvalidArgumentError (see above for traceback): Cannot assign a device for operation
  'InceptionV1/Logits/Conv2d_0c_1x1/biases': Could not satisfy explicit device specification
  '/device:IPU:0' because no supported kernel for IPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Const: CPU IPU XLA_CPU
Identity: CPU IPU XLA_CPU
Fill: CPU IPU XLA_CPU
Assign: CPU
VariableV2: CPU

Note on the global_step counter

More advanced execution control frameworks in TensorFlow use a scalar counter called global_step to count the number of iterations of training which have occurred. This counter is serialised along with the model. It allows the model to base parameters on the step count, even if the model is run multiple times.

There is an add operation which adds to the global_step scalar on each training pass. If the global_step variable is placed on the IPU device, then this increment operation will occur on the IPU too. This will cause the Poplar training engine to be swapped out for the increment engine on each training step, causing very poor performance.

To avoid this, in the CPU context, use the expression tf.train.get_or_create_global_step() before you create any special training sessions. This will ensure that the global_step variable is on the CPU.

 with tf.device("cpu"):
  tf.train.get_or_create_global_step()

with ipu.ops.ipu_scope("/device:IPU:0"):
  out = ipu.ipu_compiler.compile(model_fn, [...])

Half-precision floating point and stochastic rounding

The IPU supports IEEE half-precision floating-point numbers, and supports stochastic rounding in hardware. The IPU extensions to TensorFlow expose this floating point functionality through the functions described below. See the Python API for more detail.

Controlling the half-precision floating-point unit

You can configure the behaviour of the floating-point hardware using the function tensorflow.python.ipu.utils.set_floating_point_behaviour_options().

The esr bit enables the stochastic rounding unit. Three of the remaining options control the generation of hardware exceptions on various conditions. The nanoo bit selects between clipping or generating a NaN when a half-precision number overflows.

Resetting the global random number seed

The stochastic rounding unit and the TensorFlow stateful random number generators both use a common global random number seed to initialise the random number generator hardware. Each IPU device has its own seed.

By default this seed is set randomly, but it can be reset by using the function tensorflow.python.ipu.utils.reset_ipu_seed().

Due to the hardware threading in the device, if the seed reset function is used then the target.deterministicWorkers Poplar Engine option will need to be set to true.

This can be done using tensorflow.python.ipu.utils.set_compilation_options().

Debugging numerical issues

The values held in a tensor can be printed by calling ipu.ops.internal_ops.print_tensor. This function takes a tensor and will print it to standard error as a side effect.

See tensorflow.python.ipu.ops.internal_ops.print_tensor().

Retrieving information about compilation and execution

When developing models for the IPU, it is important to be able to see how compute tiles are being used and what the balance of memory use across them is. In certain cases, such as when investigating memory over-consumption of a model or investigating any tile imbalance issues, it is useful to produce a trace report that will show a number of different aspects of graph deployment on the IPU.

Several mechanisms are available to retrieve trace information about the Poplar IPU compilation and execution. Firstly, there are environment variables provided by Poplar itself to dump the compilation and execution reports into a file. See the “Profiling” chapter in the Poplar and Poplibs User Guide for more information.

Within TensorFlow, the basic steps for this are:

  • Include an operation in the graph to retrieve the reports

  • Enable tracing in the hardware configuration options

  • Execute the graph, including the operation to retrieve the reports

  • Extract the reports from the returned events

Adding an operation to get compilation and execution events

Two operations are available to fetch events from the Poplar backend. The first is an operation which fetches the reporting events into a tensor, and is typically executed independently of the main graph. The second is a summary event which will extract the reports along with any other summary events. These events will typically be written into a file using the tensorflow.summary.FileWriter class.

ipu_event_trace()

This is an operation which retrieves all IPU events since the last time it was executed. The operation must be placed on the CPU, and returns the events as a one dimensional tensor of strings containing serialised IPU event protobufs, from tensorflow.compiler.plugin.poplar.driver.trace_pb2.IpuTraceEvent.

This is the example from the tutorial with a few lines of additional code to create a trace report:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 import numpy as np

# IPU imports
from tensorflow.compiler.plugin.poplar.ops import gen_ipu_ops
from tensorflow.python.ipu import utils
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = utils.auto_select_ipus(cfg, 1)
utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")

  # Create a trace event
  report = gen_ipu_ops.ipu_event_trace()


def basic_graph(pa, pb, pc):
  # Do basic addition with tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  result = basic_graph(pa, pb, pc)

with tf.Session() as sess:
  # Run the graph through the session feeding it an arbitrary dictionary
  result = sess.run(result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })

  # Generate report based on the event run in session
  trace_out = sess.run(report)
  trace_report = utils.extract_all_strings_from_event_trace(trace_out)

  # Write trace report to file
  with open('Trace_Event_Report.rep', "w") as f:
    f.write(trace_report)

  # Print the result
  print(result)

The example starts by importing two new elements that are IPU-specific APIs. The first import is gen_ipu_ops, which will generate the event trace. The second import is an assortment of utility functions, one of which is used here to parse the event trace to a readable output.

The event trace operation is created when gen_ipu_ops is called to instantiate the trace and returns it to report. This is then fed to the TensorFlow session as a run argument, directly following the session run call to the feed-forward pass through basic_graph. In essence, the report is generated based on the last session graph call. The trace output is then parsed through extract_all_strings_from_event_trace, and a log file is generated. The final step of writing the trace to a file is done near the end of the example where a file is opened and the parsed trace data written to it.

ipu_compile_summary(name, [op list])

This produces a summary which can be tied into the rest of the summary system to produce output for Tensorboard. The parameter name is the name of the summary, and op is one of the operations in the IPU graph. It is best to choose either the inference output for an inference graph, the loss output for an evaluation graph, or the train op for a training graph.

 import tensorflow as tf
from tensorflow.python import ipu

...

tf.summary.scalar('c_out', c)
ipu.summary_ops.ipu_compile_summary('report', [c])
all_sum = tf.summary.merge_all()

...

f = tf.summary.FileWriter('logs')
with tf.Session() as s:
  sum_out, ... = s.run([all_sum, ...])
  f.add_summary(sum_out, 0)

  print("c = {}".format(c))

Enabling tracing in the hardware configuration options

The main function for producing an IPU system hardware configuration is called create_ipu_config. It provides several options for controlling the logging and tracing of Poplar compilations.

  • profiling: This enables compilation and execution graph reports in Poplar, and generates COMPILE_BEGIN and COMPILE_END events in the trace.

  • enable_ipu_events: Setting this to True while leaving profiling as False will generate trace events without creating the Poplar compilation and execution reports in them. This is useful for getting timing information from the event trace without the overhead of the Poplar reporting.

  • use_poplar_text_report: Normally, the Poplar reports are generated in JSON format. Setting this parameter to True will generate a text summary report instead of JSON.

  • use_poplar_cbor_report: Instead of a JSON format report, a CBOR format report will be generated.

  • profile_execution: When this is set to True, then EXECUTE events will be generated in addition to compilation events. By default the execution events will contain a device type trace. If a different type of execution trace is required, then instead of True, one of ExecutionProfileType.DEVICE_PROFILE, ExecutionProfileType.IPU_PROFILE or ExecutionProfileType.TILE_PROFILE can be used.

  • report_every_nth_execution: This will restrict the number of execution reports to a subset of all executions.

  • max_report_size: Poplar reports can get very large. This parameter can be used to restrict the maximum size of report generated. Reports larger than this value will be discarded and a warning message sent to the TensorFlow log.

  • report_directory: Rather than reports being placed directly into the events, they can be written to a file, and the file name written into the event log. This behaviour is enabled by setting this parameter to a directory name.

Extract the reports from the returned events

If the summary event generator has been used then the events will be inside Tensor type events in the Tensorboard logs.

If the individual report gathering event is used then executing it will return an array of tensors. Within each tensor is a string which is an IpuTraceEvent of one type.

The IpuTraceEvent is within the tensorflow namespace at tensorflow.compiler.plugin.poplar.driver.trace_pb2.IpuTraceEvent. It is a protobuf that can be decoded from the string into an object with fields containing trace information.

Several utility functions are available for extracting fields, for example:

 rep = sess.run(report)
compile_reports = ipu.utils.extract_compile_reports(rep)
execute_reports = ipu.utils.extract_execute_reports(rep)
events = ipu.utils.extract_all_events(rep)

See the Python API section for more information.

COMPILE_BEGIN

This event is generated when the Poplar compilation begins. It contains the XLA module name, a timestamp and the ordinal of the device that the code was compiled for.

COMPILE_END

This is generated when the Poplar compilation ends. It contains the module name, a timestamp, an ordinal and the following compilation trace fields:

  • compilation_report is the Poplar compilation report.

  • duration is the duration of the compilation.

  • tensor_map is a mapping of tensors generated by XLA/HLO instructions to the IPU tiles where those tensors are mapped.

Tensor map

The tensor_map field has the following format. It is JSON but, in order to keep it dense, it is mostly JSON lists instead of keyed dictionaries.

At the top level there is a map called mappings which contains an entry for each XLA computation, keyed by the name of that computation. The value is a list of tensors generated by that computation.

 { 'mapping' : {'computation_0' : [ ... ], 'computation_1' : [ ... ] } }

Each tensor in that list is also a list, consisting of the following items:

  • 0 - name of the XLA/HLO instruction generating the tensor.

  • 1 - the ordinal of the tensor produced by that instruction.

  • 2 - a list of integers indicating the shape of the tensor.

  • 3 - a string indicating the tensor element type.

  • 4 - a Boolean indicating if the tensor contains any constant elements.

  • 5 - a Boolean indicating if the tensor contains any aliases.

  • 6 - the total number of elements in the tensor.

  • 7 - a list of information about the elements on each tile, for example:

     [ 'add.0', 0, [32, 32], 'float', 0, 0, 2, 256, [ ... ] ]
    

The list of elements on each tile has one entry per tile that contains elements of the tensor. Each entry is itself a list, containing the following items:

  • the tile index number.

  • the total number of elements on that tile.

The instruction_info field contains information about how the specific HLO instructions were mapped to Poplar API calls. Its format is as follows:

 { 'ml_types': {'instruction': <ml_type>, ... } }

The instruction is the name of the instruction at the HLO level, which is similar to the name in the main compilation report. The ml_type field takes one of the following values, for instructions which are convolution or matmul:

  • 0 - Unclassified

  • 1 - Standalone

  • 2 - The forward pass of training

  • 3 - The input gradient of training

  • 4 - The filter gradient of training

EXECUTE

This event contains the Poplar execution report in the execution_report field.

Using the IPU Model device for debugging

The IPU Model is an emulator that mimics the IPU computational framework on the host device. It is functionally equivalent to the IPU, but obviously the compute performance will be completely different.

If you encounter an out of memory error, it may be useful to use the IPU Model device to debug the problem.

Consider the situation in which the event trace is being used to investigate a graph that creates a tile memory imbalance. In this case, running on the IPU will lead to an out of memory exception before the report is generated. Running on the IPU Model instead of actual hardware will still run out of memory, but the code will run to completion so the report can be generated.

There are a number of ways to target the IPU Model, but the simplest is to pass a flag to TensorFlow using the TF_POPLAR_FLAGS environment variable. For example:

 $ TF_POPLAR_FLAGS="--use_ipu_model" python basic_graph.py

See TF_POPLAR_FLAGS environment variable for more information about this environment variable.

 ...] Device /device:IPU:0 attached to IPU: 0

where the “Device /device:IPU:0 attached to IPU: 0” indicates that the device known to TensorFlow as “/device:IPU:0” is IPU 0. The numbering of IPUs in your machine can be found by using the gc-info -l command.

TensorFlow options for reporting

Some tracing and reporting options are provided by TensorFlow as standard, and can be useful when developing graphs for the IPU.

TF_CPP_MIN_VLOG_LEVEL is an environment variable that enables the logging of the main C++ backend. Setting TF_CPP_MIN_VLOG_LEVEL=1 will show a lot of output. Included in this is the compilation and execution of the IPU code. The output of TF_CPP_MIN_VLOG_LEVEL can be overwhelming. TF_CPP_VMODULE provides a mechanism to reduce the logging to certain translation units (source files). This combination is quite useful:

 TF_CPP_VMODULE='poplar_compiler=1,poplar_executable=1'

Finally, there is an environment variable called XLA_FLAGS which provides options to the general XLA backend. For example, the follow will produce a Graphviz DOT file of the optimised HLO graph which is passed to the Poplar compiler.

 XLA_FLAGS='--xla_dump_to=. --xla_dump_hlo_as_dot --xla_dump_hlo_pass_re=forward-allocation --xla_hlo_graph_sharding_color'

The HLO pass forward-allocation is the final pass to run before the HLO instructions are scheduled for passing to the Poplar graph compiler. Running with these options will create a file called something like module_0001.0001.IPU.after_forward-allocation.before_hlo-memory-scheduler.dot. (The way that the file names are generated is explained in xla_file_naming.) The Graphviz dot command can be used to convert this data to an image.

More information on the XLA flags can be found in the definition of the XLA proto here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/xla.proto

Reading the Poplar textual summary report

When the example code is run, a new file is generated called Trace_Event_Report.rep. This is the Poplar compilation report. The report is broken into a number of sections, but here, we will focus on the first three: Target, Graph, and Memory Usage.

Target

The “Target” section describes the target hardware which, in the absence of sharding, will be a single IPU. For instance:

 Target:
  Number of IPUs:         1
  Tiles per IPU:          1,216
  Total Tiles:            1,216
  Memory Per-Tile:        256.0 kB
  Total Memory:           304.0 MB
  Clock Speed (approx):   1,600.0 MHz

It is important to note that this section of the report does not distinguish between hardware or the IPU Model, and in essence it is only dependent on the number of IPUs selected for deployment via the sharding utility.

Graph

The next section is “Graph”, which describes the topology of the deployed graph.

For instance:

 Graph:
  Number of vertices:            1,219
  Number of edges:               1,223
  Number of variables:          30,562
  Number of compute sets:            4

You may see different numbers, depending on the version of the software.

This is from the report generated by the adder example. The graph map includes control code, not just compute graph components. Note that the number of vertices in the graph is very close to the 1,216 tiles on the IPU.

Memory usage

The “Memory Usage” section gives the memory consumption profile of the graph from a number of different perspectives:

 Memory Usage:
  Total:
    Including Gaps:         23,878,396 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:     5,355,604 B
        Interleaved:                 0 B
        Overflowed:                  0 B
      By Data Type:
          Variables:                            39,108 B
          Constants:                                 0 B
          Host Exchange Packet Headers:         10,512 B
          Global Exchange Packet Headers:            0 B
          Stack:                             3,852,288 B
          Vertex Instances:                     14,640 B
          Copy Descriptors:                          0 B
          VectorList Descriptors:                    0 B
          Vertex Field Data:                         0 B
          Control Table:                             0 B
          Control Code:                        851,272 B
          Vertex Code:                         170,788 B
          Internal Exchange Code:               60,792 B
          Host Exchange Code:                  351,328 B
          Global Exchange Code:                      0 B
          Instrumentation Results:               4,876 B
          Shared Code Storage:                       0 B
          Shared Data Storage:                       0 B
        Vertex Data (14,640B):
          By Category:
            Internal vertex state:          9,736 B
            Edge pointers:                  4,904 B
            Copy pointers:                      0 B
            Padding:                            0 B
            Descriptors:                        0 B
          By Type:
            poprand::SetSeedSupervisor                                                  34,048 B
            popops::ScaledAddSupervisor<float,float,true>                                   60 B
            popops::BinaryOp1DSupervisor<popops::expr::BinaryOpType::ADD,float>             16 B

  By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
         4 - 5 ****************************************  1,215
         5 - 6 *                                             1

    Maximum (Including Gaps): 49,184 (48.0 K) on tile 0
    Maximum (Excluding Gaps): 5,780 (5.6 K) on tile 0
    0 tile(s) out of memory

The information is presented in several sections. The first is the total memory used, including gaps. This is followed by a breakdown of the gap-excluding memory: first in terms of interleaved and non-interleaved usage, then by data type, followed by vertex data.

A useful portion of the report is the tile histogram memory consumption profile, which in this simple case is confined to two categories. When the graph is more complex, the histogram will most likely have a more distributed profile. In those instances, where there is in fact a tile imbalance, the histogram produced may look more like this:

 By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
       0 -   8 *                                            20
       8 -  16 ****************************************  1,192
      16 -  24 *                                             2
      24 -  32                                               0
      32 -  40                                               0
    .
    .
    .
     488 - 496                                               0
     496 - 504                                               0
     504 - 512 *                                             1
     512 - 520                                               0
     520 - 528                                               0
    .
    .
    .
     784 - 792                                               0
     792 - 800                                               0
     800 - 808                                               0
     808 - 816 *                                             1

    Maximum (Including Gaps): 834,416 (814.9 K) on tile 0
    Maximum (Excluding Gaps): 834,339 (814.8 K) on tile 0
    2 tile(s) out of memory

In this case, two tiles are out of physical memory, while most of the allocation is well within the single tile budget.

In those instances where a memory imbalance occurs, the report will produce a detailed description of the operations running on five of the most memory-subscribed tiles (regardless of whether they are over their physical limit or not) and list them in descending order of memory consumption.

In the above case, tile 0 is the most over-subscribed tile, and the report produces the following:

 Tile # 0 memory usage:
Memory Usage:
  Total:
    Including Gaps:            834,416 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:       122,880 B
        Interleaved:           131,072 B
        Overflowed:            580,387 B
      By Data Type:
          Variables:                           807,658 B
          Constants:                                 0 B
          Host Exchange Packet Headers:          1,160 B
          Global Exchange Packet Headers:            0 B
          Stack:                                 3,168 B
          Vertex Instances:                     12,074 B
          Copy Descriptors:                      1,385 B
          VectorList Descriptors:                  960 B
          Vertex Field Data:                     7,934 B
          Control Table:                             0 B
          Control Code:                              0 B
            .
            .
            .

        Vertex Data (22,353B):
          By Category:
            Internal vertex state:          4,152 B
            Edge pointers:                 10,798 B
            .
            .
            .
          By Type:
            poplin::ConvPartial1x1Out<float,float,true,false>                                6,648 B
            poplar_rt::DstStridedCopy64BitMultiAccess                                        2,669 B
            popops::Reduce<popops::ReduceAdd,float,float,false,0>                            2,542 B
            popops::ScaledAddSupervisor<float,float,true>                                    1,440 B
            poplar_rt::StridedCopyDA32                                                       1,374 B
            poplar_rt::DstStridedCopyDA32                                                    1,101 B
            popops::BinaryOp1DSupervisor<popops::expr::BinaryOpType::MULTIPLY,float>           752 B
            .
            .
            .

This information can be very useful when tracking down the source of the over-allocation.

Producing an ELF image of the compilation

There is another method to produce much of the same detailed information provided in the trace event report. This generates code for IPU hardware (not an emulator on the host) and then extracts the memory allocation information from the generated ELF object file created at compile time. This technique will be described briefly here, only showing how the object file is created and memory-per-tile information extracted.

When compiling the graph, a Poplar engine option can be used to dump the ELF file to a specified location.

 POPLAR_ENGINE_OPTIONS='{"target.saveArchive":"binaries.a", "debug.allowOutOfMemory": "true"}' python basic_graph.py

The file binaries.a is created, which is an archive file of the compiled graph. To extract the memory size information from it, run the following command:

 $ size -A binaries.a > tiles_elf.txt

This pipes a tile-by-tile rendition of the memory consumed in bytes to the file tiles_elf.txt. All of the memory allocated is part of the text section. This can be extracted from the tiles’ ELF files to produce a single column where each entry is the size of the text section corresponding to a tile:

 $ size -A binaries.a | grep -e ".text" | awk '{print $2}' > memory_usage_per_tile.txt

The file memory_usage_per_tile.txt will contain this memory allocation information. Further details of the deployed graph can be extracted with this approach.

Dumping auxiliary Poplar information

Two environment variable flags are available to get to extra Poplar information: --save_vertex_graph and --save_interval_report.

Poplar vertex graph

The Poplar vertex graph is a DOT file containing a complete description of the lowered Poplar graph. Each node in the graph represents one vertex in the Poplar graph operating on one region of a tensor.

Poplar interval report

The interval report is a CSV file describing the number of tiles executing, exchanging and syncing on each instruction cycle.

TF_POPLAR_FLAGS environment variable describes how to set the environment flags.

XLA graph file naming

The number of files produced depends on the number of TensorFlow HLO modules generated. This can generally be predicted from the number of sess.run calls on distinct graphs that you make. For example, if your program contains a variable initialisation then this will be compiled as a separate XLA graph and appear as a separate file when dumped. If your program creates a report operation, then that will also be compiled as a separate XLA graph.

When you use ipu_compiler.compile, you force everything inside the compile call to be compiled into a single XLA graph. If you don’t use ipu_compiler.compile, then the results depend on the XLA scheduler, which will combine or split up parts of the TensorFlow graph as it sees fit, creating many arbitrary distinct XLA graphs. If you do not use ipu_compiler.compile, expect to see a larger number of XLA graphs generated. Please note, there is no guarantee your compiled op will only produce one XLA graph. Sometimes others are created for operations such as casting.

The following description provides a break down of the names of the generated files. These are of the general form:

module_XXXX.YYYY.IPU.after_allocation-finder.before_forward-allocation.dot

  • There is always a module_ prefix, which indicates that this is the graph for an HLO Module.

  • The first XXXX is the HLO module’s unique ID, generated here: https://github.com/tensorflow/tensorflow/blob/r2.1/tensorflow/compiler/xla/service/dump.cc#L263

    There is no guarantee about the spacing between IDs, only that they are unique and increasing.

  • To understand the rest of the name, YYYY.IPU.......dot, we need to understand that the XLA graph is operated on by multiple different HLO passes, each modifying the XLA graph by optimizing, shuffling or otherwise rewriting it. After these passes, the graph is then lowered to Poplar. There are some TensorFlow native HLO passes, and there are some IPU specific ones.

    When dumping the XLA graphs, we can render the XLA graph before and after any HLO pass (for example, to see the effect of that pass on the graph) by supplying the argument --xla_dump_hlo_pass_re=xxxx, where xxxx is a regular expression describing which passes you want. TensorFlow will then render the XLA graph before and after every pass whose name matches that regex. For example, if you wanted to see the effect of every XLA HLO IPU pass involving while loops, you could use --xla_dump_hlo_pass_re=*While*.

    The number YYYY is simply an ID related to the order in which these graphs are generated.

  • Finally, the passes which the graph was “between” when it was rendered are appended to the filename.

    The before_optimizations graph is always rendered if dumping XLA.

  • The HLO modules have CamelCase class names by convention. For the file names, these are converted to snake_case.

Using IPU optimised operations

Several custom versions of operators are provided to target functions available in Poplibs. See the Python API for more details.

Dropout

The Poplibs version of dropout does not need to store the dropout mask between the forward and backward parts of the graph, saving memory.

See tensorflow.python.ipu.ops.rand_ops.dropout().

Embedding lookup

This is a version of embedding lookup which will produce a smaller memory footprint for small lookups. Instead of using dynamic lookup into the main embedding dictionary, it uses a one hot operator and a multiply.

See tensorflow.python.ipu.embedding_ops.embedding_lookup().

Group normalisation

Group normalisation is an alternative to batch normalisation, and produces smaller and more optimised graphs.

The original paper on group normalisation is “Group Normalization”, Yuxin Wu, Kaiming He.

See tensorflow.python.ipu.normalization_ops.group_norm().

Instance normalisation

Instance normalisation is another alternative to batch normalisation.

The original paper on instance normalisation is “Instance Normalization: The Missing Ingredient for Fast Stylization” Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky.

See tensorflow.python.ipu.normalization_ops.group_norm().

Layer normalisation

Layer normalisation is another alternative to batch normalisation.

The original paper on layer normalisation is “Layer Normalization” Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.

See tensorflow.python.ipu.normalization_ops.layer_norm().

GeLU activation

GeLU, gaussian error linear units, is an alternative to the ReLU non-lineaity. The paper at https://arxiv.org/pdf/1606.08415.pdf describes it.

See tensorflow.python.ipu.nn_ops.gelu().

Training a model

TensorFlow XLA and Poplar provide the ability to combine an entire training graph into a single operation in the TensorFlow graph. This accelerates training by removing the need to make calls to the IPU hardware for each operation in the graph.

However, if the Python code with the training pass is called multiple times, once for each batch in the training data set, then there is still the overhead of calling the hardware for each batch.

The Graphcore IPU support for TensorFlow provides three mechanisms to improve the training performance: training loops, data set feeds, and replicated graphs.

Training loops, data sets and feed queues

By placing the training operations inside a loop, they can be executed multiple times without returning control to the host. It is possible to use a standard TensorFlow while_loop operation to wrap the training operation, but the IPU library provides a convenient and feature rich version.

Normally when TensorFlow runs, operations which are not inside a loop will be executed once, and those operations will return one or more tensors with fixed values. However, when a training operation is placed into a loop, the inputs to that training operation need to provide a stream of values. Standard TensorFlow Python feed dictionaries cannot provide data in this form, so when training in a loop, data must be fed from a TensorFlow DataSet.

More information can be found on the DataSet class and its use in normal operation at https://www.tensorflow.org/guide/performance/datasets. TensorFlow provides many pre-configured DataSets for use in training models. See the site https://www.tensorflow.org/datasets.

To construct a system that will train in a loop, you will need to do the following:

  • Wrap your optimiser training operation in a loop.

  • Create an IPUInfeedQueue to feed data to that loop.

  • Create an IPUOutfeedQueue to take results out of that loop.

  • Create a TensorFlow DataSet to provide data to the input queue.

The following example shows how to construct a trivial DataSet, attach it to a model using in IPUInfeedQueue, feed results into an IPUOutfeedQueue, and construct a loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
 from tensorflow.python.ipu import ipu_compiler
from tensorflow.python.ipu import ipu_infeed_queue
from tensorflow.python.ipu import ipu_outfeed_queue
from tensorflow.python.ipu import loops
from tensorflow.python.ipu import scopes
from tensorflow.python.ipu import utils
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# The dataset for feeding the graphs
ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[800]))
ds = ds.map(lambda x: [x, x])
ds = ds.repeat()

# The host side queues
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds, feed_name="infeed")
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")


# The device side main
def body(x1, x2):
  d1 = x1 + x2
  d2 = x1 - x2
  outfeed = outfeed_queue.enqueue({'d1': d1, 'd2': d2})
  return outfeed


def my_net():
  r = loops.repeat(10, body, [], infeed_queue)
  return r


with scopes.ipu_scope('/device:IPU:0'):
  run_loop = ipu_compiler.compile(my_net, inputs=[])

# The outfeed dequeue has to happen after the outfeed enqueue
dequeue_outfeed = outfeed_queue.dequeue()

# Configure the hardware
config = utils.create_ipu_config()
config = utils.auto_select_ipus(config, 1)
utils.configure_ipu_system(config)

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)

  sess.run(run_loop)
  result = sess.run(dequeue_outfeed)
  print(result)

In this case the DataSet is a trivial one. It constructs a base DataSet from a single TensorFlow constant, and then maps the output of that DataSet into a pair of tensors. It then arranges for the DataSet to be repeated indefinitely.

After the DataSet is constructed, the two data feed queues are constructed. The IPUInfeedQueue takes the DataSet as a parameter, along with a name. Every queue in the system must have a unique name.

The IPUOutfeedQueue has extra options to control how it collects and outputs the data sent to it. None of these are used in this example.

Now that we have the DataSet and the queues for getting data in and out of the device-side code, we can construct the device-side part of the model. In this example, the body function constructs a very simple model, which does not even have an optimiser. It takes the two data samples which will be provided by the DataSet, and performs some simple maths on them, and inserts the results into the output queue.

Typically, in this function, the full ML model would be constructed and a TensorFlow Optimizer would be used to generate a backward pass and variable update operations. The returned data would typically be a loss value, or perhaps nothing at all if all we do is call the training operation.

The my_net function is where the loops.repeat function is called. This wraps the body function in a loop. It takes as the first parameter the number of times to execute the operation, in this case 10. It also takes the function that generated the body of the loop, in this case the function body, a list of extra parameters to pass to the body, in this case none, and finally the infeed queue which will feed data into the loop.

Next we create an IPU scope at the top level and call ipu_compiler.compile passing the my_net function, to create the training loop in the main graph. The output of the ipu_compiler.compile will be an operation that can be called to execute the training loop.

Finally, we create an operation which can be used to fetch results from the outfeed queue. Note that it isn’t necessary to use an outfeed queue if you do not wish to receive any per-sample output from the training loop. If all you require is the final value of a tensor, then it can be output normally without the need for a queue.

If you run this example then you will find that the result is a Python dictionary containing two numpy arrays. The first is the d1 array and will contain x1 + x2 for each iteration in the loop. The second is the d2 array and will contain x1 - x2 for each iteration in the loop.

See entries in the Python API for more details.

Replicated graphs

To improve performance, multiple IPUs can be configured to run in a data parallel mode. The graph is said to be replicated across multiple IPUs. See the Poplar and Poplibs User Guide for more background about replicated graphs.

Note: replicated graphs are not supported when running on an IPU Model.

Selecting the number of replicas

During system configuration, you specify the number of IPUs for the TensorFlow device using the auto_select_ipus() function, or the select_ipus() function.

A graph can be sharded across multiple IPUs (model parallelism), and then replicated across IPUs (data parallelism). When specifying the number of IPUs in the system, you must specify a multiple of the number of shards used by the graph.

For instance, if a graph is sharded over two IPUs, and you specify eight IPUs to the auto_select_ipus function, then the graph will be replicated four times.

Data feeds

When used with a replicated graph, the IPUInfeedQueue and IPUOutfeedQueue classes require the number of replicas to be passed into the constructor in the replication_factor parameter.

Performing parameter updates

Each replica maintains its own copy of the graph, but during training it is important to ensure that the graph parameters are updated so that they are in sync across replicas.

A wrapper for standard TensorFlow optimisers is used to add extra operations to the parameter update nodes in the graph to average updates across replicas. It is called CrossReplicaOptimizer. See the Python API for more details.

Pipelined training

The IPU pipeline API creates a series of computational stages, where the outputs of one stage are the inputs to the next one. These stages are then executed in parallel across multiple IPUs. This approach can be used to split the model where layer(s) are executed on different IPUs.

This improves utilisation of the hardware when a model is too large to fit into a single IPU and must be sharded across multiple IPUs.

Each of the stages is a set of operations, and is described using a Python function, in much the same way as the ipu.compile takes a function that describes the graph to compile onto the IPU.

See the Python API for more specific details of the ipu.pipeline operator.

The pipeline API requires data inputs to be provided by a tf.DataSet source connected via an infeed operation. If you would like per-sample output, for instance the loss, then this will have to be provided by an outfeed operation.

The computational stages can be interleaved on the devices in two different ways as described by the pipeline_schedule parameter. By default the API will use the PipelineSchedule.Grouped mode, where the forward passes are grouped together, and the backward passes are grouped together. The alternative is the PipelineSchedule.Interleaved, where the forward and backward passes are interleaved, so that fewer activations need to be stored.

Sharded scheduling

Sharded pipeline schedule illustration

Interleaved scheduling

Interleaved pipeline schedule illustration

Grouped scheduling

Grouped pipeline schedule illustration

Pipeline stage inputs and outputs

The first pipeline stage needs to have inputs which are a combination of the tensors from the DataSet, and the tensors given as arguments to the pipeline operation. Any data which changes for every sample or minibatch of the input should be included in the DataSet, while data which can vary only on each run of the pipeline should be passed as arguments to the pipeline operation. Parameters like the learning rate would fit into this latter case.

Every subsequent pipeline stage must have its inputs as the outputs of the previous stage. Note that things like the learning rate must be threaded through each pipeline stage until they are used.

Applying an optimiser to the graph

The optimiser must be applied by creating it in a special optimiser function and then returning a handle to it from that function. The function is passed into the optimizer_function argument of the pipeline operation.

When a pipeline is running it will accumulate the gradients from each step of the pipeline and only apply the updates to the graph parameters at the end of each pipeline run, given by the pipeline_depth parameter. Consequently it is important for the system to have more knowledge of the optimiser and so it must be given to the pipeline operator using this function.

Device mapping

By default the pipeline operation will map the pipeline stages onto IPUs in order to minimise the inter-IPU communication lengths. If you need to override this order, then you can use the device_mapping parameter.

Dataset benchmarking

In order to fully utilise the potential of the IPU, the tf.data.Dataset used by the IPUInfeedQueue needs to be optimised so that the IPU is not constantly waiting for more data to become available.

To benchmark your tf.data.Dataset, you can make use of the ipu.dataset_benchmark tool, see the Python API for more specific details of the ipu.dataset_benchmark functions which allow you to obtain the maximum throughput of your tf.data.Dataset.

If the throughput of your tf.data.Dataset is the bottleneck, you can try and optimise it using the information on the TensorFlow website:

Accessing the JSON data

The functions in ipu.dataset_benchmark return the JSON as a string which can be loaded into a JSON object using the native JSON library, for example:

 import json

# Create your tf.data.Dataset
dataset = ...
benchmark_op = ipu.dataset_benchmark.dataset_benchmark(dataset, 10, 512)

with tf.Session() as sess:
    json_string = sess.run(benchmark_op)
    json_object = json.loads(j_str[0])

Targeting the IPU with TensorFlow 2

In TensorFlow version 2, the Eager mode is enabled by default, and Keras has become the main API for constructing models. Distribution strategies are the new way of targeting different pieces of hardware.

As in TensorFlow version 1, there are a small number of things that need to be done when constructing and executing a model in order to target the IPU efficiently. The IPU achieves its performance by fusing operations into a single kernel that is executed repeatedly, amortising the cost of control and I/O.

IPUStrategy

Distribution strategies are a more advanced and flexible version of device tagging. The IPUStrategy is a sub-class of distribution strategy which specifically targets a system with one or more IPUs attached. A separate class IPUMultiWorkerStrategy is for targeting a multiple system configuration.

Use the strategy.scope() context to ensure that everything within that context will be compiled for the IPU device. You should do this instead of using the tf.device context.

 from tensorflow.python import ipu

# Create an IPU distribution strategy
strategy = ipu.ipu_strategy.IPUStrategy()

with strategy.scope():
    ...

It is important to construct any Keras model within the scope of the IPUStrategy, because a Keras Model class may create some of the model at construction time, and some other parts of it at execution time.

See the online documentation for more details.

Function annotation with @tf.function

The function annotation @tf.function is well documented in the standard TensorFlow documentation. It converts the body of the annotated function into a fused set of operations that are executed as a group, in the same way as a whole graph would have been in TensorFlow version 1. In addition, a library called autograph will convert python flow control constructs into TensorFlow graph operations.

Best practice is to ensure that anything which is intended to be executed on the IPU is placed into a function and annotated with @tf.function. This does not apply to constructing a Keras model or using the Keras Model.fit() API. See below for details on Keras.

When calling a function that is marked with a @tf.function from within a distribution strategy like IPUStrategy, you should not call them directly, but instead use the experimental_run_v2 method.

See the following online resources for more information.

Keras

The Keras API is used for constructing models using a set of high-level Layers objects. https://www.tensorflow.org/guide/keras.

Full support is available for Keras on the IPU. It is important to ensure that the model is both instantiated and called from within an IPUStrategy context.

The Model.fit method

This method of the Keras Model class can be used within an IPUStrategy to train a model without the need for a specialised training loop.

For high performance training, the fit API should be avoided, because it does not provide an on-device training loop.

Custom training loops

If a more sophisticated training loop is required, then it can be described inside a function which is marked as a @tf.function. See the examples section for a full example.

The outer training function should be called using the experimental_run_v2 method on the IPUStrategy object, to ensure that it is executed using the strategy’s configuration.

PipelinedModel

PipelinedModel is a substitute for the Keras Sequential model class, with support for multi-device IPU pipelines. Using pipelined execution allows the IPU to achieve high compute efficiency while utilising multiple devices.

The PipelinedModel has the same API as the standard Keras Model and Sequential classes, but will train the model on multiple IPUs and stream the data into the devices using an Infeed queue which is created automatically.

The constructor takes, rather than a list of layers as with the standard Sequential model, a list of lists of layers, one for each IPU pipeline stage. See the examples section to see how the API is used.

In a machine learning model a step is often considered to be one pass through the model where the forward pass is done, then the gradients are calculated and then the parameters are updated. Since a pipeline accumulates multiple gradients before applying them collectively to the parameter, we call a step one of those pipeline operations. So the number of data samples processed per step is equal to the batch size multiplied by the pipeline depth.

This will be reflected in the rate at which the progress bar advances, and the entries in the Keras History.

TensorFlow 2 examples

This example shows the Keras API and the IPUStrategy being used to train a model using the Keras Model.fit() method.

 from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow import keras
from tensorflow.python import ipu

#
# Configure the IPU system
#
cfg = ipu.utils.create_ipu_config()
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)


#
# The input data and labels
#
def create_dataset():
  mnist = tf.keras.datasets.mnist

  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.float32)))

  return train_ds.repeat()


#
# The model.  Because this model does not have a specific shape for its inputs
# it will be constructed when it is first called (in the `train` function). So
# it does not need to be an IPU device targeted model.
#
def create_model():
  m = keras.models.Sequential([
      keras.layers.Flatten(),
      keras.layers.Dense(128, activation='relu'),
      keras.layers.Dense(10, activation='softmax')
  ])
  return m


# Create an IPU distribution strategy
strategy = ipu.ipu_strategy.IPUStrategy()

with strategy.scope():
  # Create an instance of the model
  model = create_model()

  # Get the training dataset
  ds = create_dataset()

  # Train the model
  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                optimizer=tf.keras.optimizers.SGD())
  model.fit(ds, steps_per_epoch=2000, epochs=4)

This example shows the same model being trained using a custom training function.

 from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow import keras
from tensorflow.python import ipu

step_count = 10000

#
# Configure the IPU system
#
cfg = ipu.utils.create_ipu_config()
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)


#
# The input data and labels
#
def create_dataset():
  mnist = tf.keras.datasets.mnist

  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.repeat()


#
# The model.  Because this model does not have a specific shape for its inputs
# it will be constructed when it is first called (in the `train` function). So
# it does not need to be an IPU device targeted model.
#
def create_model():
  m = keras.models.Sequential([
      keras.layers.Flatten(),
      keras.layers.Dense(128, activation='relu'),
      keras.layers.Dense(10, activation='softmax')
  ])
  return m


# The custom training loop
@tf.function
def training_step(features, labels, model, opt):
  with tf.GradientTape() as tape:
    predictions = model(features, training=True)
    prediction_loss = keras.losses.sparse_categorical_crossentropy(
        labels, predictions)
    loss = tf.reduce_mean(prediction_loss)

  grads = tape.gradient(loss, model.trainable_variables)
  opt.apply_gradients(zip(grads, model.trainable_variables))
  return loss


# Create an IPU distribution strategy
strategy = ipu.ipu_strategy.IPUStrategy()

with strategy.scope():
  # An optimizer for updating the trainable variables
  opt = tf.keras.optimizers.SGD(0.01)

  # Create an instance of the model
  model = create_model()

  # Get the training dataset
  ds = create_dataset()

  # Train the model
  for (x, y), c in zip(ds, range(step_count)):
    loss = strategy.experimental_run_v2(training_step, args=[x, y, model, opt])

    if not c % 50:
      print("Step " + str(c) + " loss = " + str(loss.numpy()))

This example shows how to use the IPU specific Keras pipelined model class to train a network.

 import argparse
import tensorflow as tf

from tensorflow.python import ipu

from tensorflow.python.ipu.keras.layers import Embedding
from tensorflow.python.ipu.keras.layers import LSTM

from tensorflow.python.keras.datasets import imdb
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.optimizer_v2.adam import Adam

max_features = 20000


# Define the dataset
def get_dataset():
  (x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)

  x_train = sequence.pad_sequences(x_train, maxlen=80)

  ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
  ds = ds.repeat()
  ds = ds.map(lambda x, y: (x, tf.cast(y, tf.int32)))
  ds = ds.batch(32, drop_remainder=True)
  return ds


# Define the model
def get_model():
  return ipu.keras.PipelinedModel(
      [[Embedding(max_features, 128)],
       [LSTM(128, dropout=0.2),
        Dense(1, activation='sigmoid')]],
      pipeline_depth=16)


#
# Main code
#

# Parse command line args
parser = argparse.ArgumentParser("Config Parser", add_help=False)
parser.add_argument('--steps-per-epoch',
                    type=int,
                    default=768,
                    help="Number of steps in each epoch.")
parser.add_argument('--epochs',
                    type=int,
                    default=10,
                    help="Number of epochs to run.")
args = parser.parse_args()

# Configure IPUs
cfg = ipu.utils.create_ipu_config()
cfg = ipu.utils.auto_select_ipus(cfg, 2)
ipu.utils.configure_ipu_system(cfg)

# Set up IPU strategy
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():

  model = get_model()

  model.compile(loss='binary_crossentropy', optimizer=Adam(0.005))

  model.fit(get_dataset(),
            steps_per_epoch=args.steps_per_epoch,
            epochs=args.epochs)

Example using IPUEstimator

This example shows how to use the IPUEstimator to train a simple CNN on the CIFAR-10 dataset. The XLA compilation is already handled while using the IPUEstimator, so the model_fn should not be manually compiled with ipu_compiler.

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import time

import tensorflow.compat.v1 as tf

from tensorflow.keras import Sequential
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.python import ipu

NUM_CLASSES = 10


def model_fn(features, labels, mode, params):
  """A simple CNN based on https://keras.io/examples/cifar10_cnn/"""

  model = Sequential()
  model.add(Conv2D(32, (3, 3), padding="same"))
  model.add(Activation("relu"))
  model.add(Conv2D(32, (3, 3)))
  model.add(Activation("relu"))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Conv2D(64, (3, 3), padding="same"))
  model.add(Activation("relu"))
  model.add(Conv2D(64, (3, 3)))
  model.add(Activation("relu"))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Flatten())
  model.add(Dense(512))
  model.add(Activation("relu"))
  model.add(Dropout(0.5))
  model.add(Dense(NUM_CLASSES))

  logits = model(features, training=mode == tf.estimator.ModeKeys.TRAIN)

  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  if mode == tf.estimator.ModeKeys.EVAL:
    predictions = tf.argmax(input=logits, axis=-1)
    eval_metric_ops = {
        "accuracy": tf.metrics.accuracy(labels=labels,
                                        predictions=predictions),
    }
    return tf.estimator.EstimatorSpec(mode,
                                      loss=loss,
                                      eval_metric_ops=eval_metric_ops)

  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(params["learning_rate"])
    if params["replicas"] > 1:
      optimizer = ipu.cross_replica_optimizer.CrossReplicaOptimizer(optimizer)
    train_op = optimizer.minimize(loss=loss)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

  raise NotImplementedError(mode)


def parse_args():
  parser = argparse.ArgumentParser()

  parser.add_argument(
      "--test-only",
      action="store_true",
      help="Skip training and test using latest checkpoint from model_dir.")

  parser.add_argument("--batch-size",
                      type=int,
                      default=32,
                      help="The batch size.")

  parser.add_argument(
      "--iterations-per-loop",
      type=int,
      default=100,
      help="The number of iterations (batches) per loop on IPU.")

  parser.add_argument("--log-interval",
                      type=int,
                      default=10,
                      help="Interval at which to log progress.")

  parser.add_argument("--summary-interval",
                      type=int,
                      default=1,
                      help="Interval at which to write summaries.")

  parser.add_argument("--training-steps",
                      type=int,
                      default=200000,
                      help="Total number of training steps.")

  parser.add_argument(
      "--learning-rate",
      type=float,
      default=0.01,
      help="The learning rate used with stochastic gradient descent.")

  parser.add_argument(
      "--replicas",
      type=int,
      default=1,
      help="The replication factor. Increases the number of IPUs "
      "used and the effective batch size by this factor.")

  parser.add_argument(
      "--model-dir",
      help="Directory where checkpoints and summaries are stored.")

  return parser.parse_args()


def create_ipu_estimator(args):
  ipu_options = ipu.utils.create_ipu_config(
      profiling=False,
      use_poplar_text_report=False,
  )

  ipu.utils.auto_select_ipus(ipu_options, num_ipus=args.replicas)

  ipu_run_config = ipu.ipu_run_config.IPURunConfig(
      iterations_per_loop=args.iterations_per_loop,
      num_replicas=args.replicas,
      ipu_options=ipu_options,
  )

  config = ipu.ipu_run_config.RunConfig(
      ipu_run_config=ipu_run_config,
      log_step_count_steps=args.log_interval,
      save_summary_steps=args.summary_interval,
      model_dir=args.model_dir,
  )

  return ipu.ipu_estimator.IPUEstimator(
      config=config,
      model_fn=model_fn,
      params={
          "learning_rate": args.learning_rate,
          "replicas": args.replicas
      },
  )


def train(ipu_estimator, args, x_train, y_train):
  """Train a model on IPU and save checkpoints to the given `args.model_dir`."""
  def input_fn():
    # If using Dataset.from_tensor_slices(), the data will be embedded
    # into the graph as constants, which makes the training graph very
    # large and impractical. So use Dataset.from_generator() here instead,
    # but add prefetching and caching to improve performance.

    def generator():
      return zip(x_train, y_train)

    types = (x_train.dtype, y_train.dtype)
    shapes = (x_train.shape[1:], y_train.shape[1:])

    dataset = tf.data.Dataset.from_generator(generator, types, shapes)
    dataset = dataset.prefetch(len(x_train)).cache()
    dataset = dataset.repeat()
    dataset = dataset.shuffle(len(x_train))
    dataset = dataset.batch(args.batch_size, drop_remainder=True)

    return dataset

  # Training progress is logged as INFO, so enable that logging level
  tf.logging.set_verbosity(tf.logging.INFO)

  t0 = time.time()
  ipu_estimator.train(input_fn=input_fn, steps=args.training_steps)
  t1 = time.time()

  duration_seconds = t1 - t0
  images_per_step = args.batch_size * args.replicas
  images_per_second = args.training_steps * images_per_step / duration_seconds
  print("Took {:.2f} minutes, i.e. {:.0f} images per second".format(
      duration_seconds / 60, images_per_second))


def calc_batch_size(num_examples, batches_per_loop, batch_size):
  """Reduce the batch size if needed to cover all examples without a remainder."""
  assert batch_size > 0
  assert num_examples % batches_per_loop == 0
  while num_examples % (batch_size * batches_per_loop) != 0:
    batch_size -= 1
  return batch_size


def test(ipu_estimator, args, x_test, y_test):
  """Test the model on IPU by loading weights from the final checkpoint in the
  given `args.model_dir`."""

  num_test_examples = len(x_test)

  batches_per_loop = args.replicas * args.iterations_per_loop
  test_batch_size = calc_batch_size(num_test_examples, batches_per_loop,
                                    args.batch_size)

  if test_batch_size != args.batch_size:
    print("Test batch size changed to {}.".format(test_batch_size))

  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
    dataset = dataset.batch(test_batch_size, drop_remainder=True)
    return dataset

  num_steps = num_test_examples // (test_batch_size * args.replicas)
  metrics = ipu_estimator.evaluate(input_fn=input_fn, steps=num_steps)
  test_loss = metrics["loss"]
  test_accuracy = metrics["accuracy"]

  print("Test loss: {:g}".format(test_loss))
  print("Test accuracy: {:.2f}%".format(100 * test_accuracy))


def main():
  args = parse_args()
  train_data, test_data = cifar10.load_data()

  num_test_examples = len(test_data[0])
  batches_per_loop = args.replicas * args.iterations_per_loop
  if num_test_examples % batches_per_loop != 0:
    raise ValueError(("replicas * iterations_per_loop ({} * {}) must evenly " +
                      "divide the number of test examples ({})").format(
                          args.replicas, args.iterations_per_loop,
                          num_test_examples))

  ipu_estimator = create_ipu_estimator(args)

  def normalise(x, y):
    return x.astype("float32") / 255.0, y.astype("int32")

  if not args.test_only:
    print("Training...")
    x_train, y_train = normalise(*train_data)
    train(ipu_estimator, args, x_train, y_train)

  print("Testing...")
  x_test, y_test = normalise(*test_data)
  test(ipu_estimator, args, x_test, y_test)


if __name__ == "__main__":
  main()

Example using IPUPipelineEstimator

This example shows how to use the IPUPipelineEstimator to train a simple CNN on the CIFAR-10 dataset. It can be compared to the example using the IPUEstimator (Example using IPUEstimator) to see the changes required to add pipelined execution to a model.

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import time

import tensorflow.compat.v1 as tf

from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.python import ipu

NUM_CLASSES = 10


def model_fn(mode, params):
  """A simple CNN based on https://keras.io/examples/cifar10_cnn/ split
  into two pipeline stages placed on different IPUs."""

  # Tell the dropout layers whether we are training to avoid a placeholder.
  is_training = mode == tf.estimator.ModeKeys.TRAIN

  def stage1(features, labels):
    partial = Conv2D(32, (3, 3), padding="same")(features)
    partial = Activation("relu")(partial)
    partial = Conv2D(32, (3, 3))(partial)
    partial = Activation("relu")(partial)
    partial = MaxPooling2D(pool_size=(2, 2))(partial)
    partial = Dropout(0.25)(partial, training=is_training)

    return partial, labels

  def stage2(partial, labels):
    partial = Conv2D(64, (3, 3), padding="same")(partial)
    partial = Activation("relu")(partial)
    partial = Conv2D(64, (3, 3))(partial)
    partial = Activation("relu")(partial)
    partial = MaxPooling2D(pool_size=(2, 2))(partial)
    partial = Dropout(0.25)(partial, training=is_training)

    partial = Flatten()(partial)
    partial = Dense(512)(partial)
    partial = Activation("relu")(partial)
    partial = Dropout(0.5)(partial, training=is_training)
    logits = Dense(NUM_CLASSES)(partial)

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    if mode == tf.estimator.ModeKeys.TRAIN:
      # This return value is passed to the `optimizer_function`.
      return loss

    if mode == tf.estimator.ModeKeys.EVAL:
      predictions = tf.argmax(input=logits, axis=1, output_type=tf.int32)
      # These return values are passed to the `eval_metrics_fn`.
      return loss, predictions, labels

    raise NotImplementedError(mode)

  def optimizer_function(loss):
    optimizer = tf.train.GradientDescentOptimizer(params["learning_rate"])
    return ipu.ops.pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

  def eval_metrics_fn(loss, predictions, labels):
    # This is executed on the host.
    return {
        "loss": loss,
        "accuracy": tf.metrics.accuracy(predictions=predictions,
                                        labels=labels),
    }

  return ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec(
      mode,
      computational_stages=[stage1, stage2],
      optimizer_function=optimizer_function,
      eval_metrics_fn=eval_metrics_fn,
      pipeline_depth=params["pipeline_depth"])


def parse_args():
  parser = argparse.ArgumentParser()

  parser.add_argument(
      "--test-only",
      action="store_true",
      help="Skip training and test using latest checkpoint from model_dir.")

  parser.add_argument("--batch-size",
                      type=int,
                      default=16,
                      help="The batch size.")

  parser.add_argument(
      "--pipeline-depth",
      type=int,
      default=4,
      help="The the number of batches that will be pipelined together.")

  parser.add_argument(
      "--iterations-per-loop",
      type=int,
      default=100,
      help="The number of iterations (pipelines executions) per loop on IPU.")

  parser.add_argument("--log-interval",
                      type=int,
                      default=10,
                      help="Interval at which to log progress.")

  parser.add_argument("--summary-interval",
                      type=int,
                      default=1,
                      help="Interval at which to write summaries.")

  parser.add_argument("--training-steps",
                      type=int,
                      default=100000,
                      help="Total number of training steps.")

  parser.add_argument(
      "--learning-rate",
      type=float,
      default=0.01,
      help="The learning rate used with stochastic gradient descent.")

  parser.add_argument(
      "--model-dir",
      help="Directory where checkpoints and summaries are stored.")

  return parser.parse_args()


def create_ipu_estimator(args):
  num_ipus_in_pipeline = 2

  ipu_options = ipu.utils.create_ipu_config()
  ipu.utils.auto_select_ipus(ipu_options, num_ipus_in_pipeline)

  ipu_run_config = ipu.ipu_run_config.IPURunConfig(
      num_shards=num_ipus_in_pipeline,
      iterations_per_loop=args.iterations_per_loop,
      ipu_options=ipu_options,
  )

  config = ipu.ipu_run_config.RunConfig(
      ipu_run_config=ipu_run_config,
      log_step_count_steps=args.log_interval,
      save_summary_steps=args.summary_interval,
      model_dir=args.model_dir,
  )

  return ipu.ipu_pipeline_estimator.IPUPipelineEstimator(
      config=config,
      model_fn=model_fn,
      params={
          "learning_rate": args.learning_rate,
          "pipeline_depth": args.pipeline_depth,
      },
  )


def train(ipu_estimator, args, x_train, y_train):
  """Train a model on IPU and save checkpoints to the given `args.model_dir`."""
  def input_fn():
    # If using Dataset.from_tensor_slices(), the data will be embedded
    # into the graph as constants, which makes the training graph very
    # large and impractical. So use Dataset.from_generator() here instead,
    # but add prefetching and caching to improve performance.

    def generator():
      return zip(x_train, y_train)

    types = (x_train.dtype, y_train.dtype)
    shapes = (x_train.shape[1:], y_train.shape[1:])

    dataset = tf.data.Dataset.from_generator(generator, types, shapes)
    dataset = dataset.prefetch(len(x_train)).cache()
    dataset = dataset.repeat()
    dataset = dataset.shuffle(len(x_train))
    dataset = dataset.batch(args.batch_size, drop_remainder=True)

    return dataset

  # Training progress is logged as INFO, so enable that logging level
  tf.logging.set_verbosity(tf.logging.INFO)

  t0 = time.time()
  ipu_estimator.train(input_fn=input_fn, steps=args.training_steps)
  t1 = time.time()

  duration_seconds = t1 - t0
  images_per_step = args.batch_size * args.pipeline_depth
  images_per_second = args.training_steps * images_per_step / duration_seconds
  print("Took {:.2f} minutes, i.e. {:.0f} images per second".format(
      duration_seconds / 60, images_per_second))


def calc_batch_size(num_examples, batches_per_loop, batch_size):
  """Reduce the batch size if needed to cover all examples without a remainder."""
  assert batch_size > 0
  assert num_examples % batches_per_loop == 0
  while num_examples % (batch_size * batches_per_loop) != 0:
    batch_size -= 1
  return batch_size


def test(ipu_estimator, args, x_test, y_test):
  """Test the model on IPU by loading weights from the final checkpoint in the
  given `args.model_dir`."""

  num_test_examples = len(x_test)

  batches_per_loop = args.pipeline_depth * args.iterations_per_loop
  test_batch_size = calc_batch_size(num_test_examples, batches_per_loop,
                                    args.batch_size)

  if test_batch_size != args.batch_size:
    print("Test batch size changed to {}.".format(test_batch_size))

  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
    dataset = dataset.batch(test_batch_size, drop_remainder=True)
    return dataset

  num_steps = num_test_examples // (test_batch_size * args.pipeline_depth)
  metrics = ipu_estimator.evaluate(input_fn=input_fn, steps=num_steps)
  test_loss = metrics["loss"]
  test_accuracy = metrics["accuracy"]

  print("Test loss: {:g}".format(test_loss))
  print("Test accuracy: {:.2f}%".format(100 * test_accuracy))


def main():
  args = parse_args()
  train_data, test_data = cifar10.load_data()

  num_test_examples = len(test_data[0])
  batches_per_loop = args.pipeline_depth * args.iterations_per_loop
  if num_test_examples % batches_per_loop != 0:
    raise ValueError(("pipeline_depth * iterations_per_loop ({} * {}) must " +
                      "evenly divide the number of test examples ({})").format(
                          args.pipeline_depth, args.iterations_per_loop,
                          num_test_examples))

  ipu_estimator = create_ipu_estimator(args)

  def normalise(x, y):
    return x.astype("float32") / 255.0, y.astype("int32")

  if not args.test_only:
    print("Training...")
    x_train, y_train = normalise(*train_data)
    train(ipu_estimator, args, x_train, y_train)

  print("Testing...")
  x_test, y_test = normalise(*test_data)
  test(ipu_estimator, args, x_test, y_test)


if __name__ == "__main__":
  main()

Distributed training example

This example shows how to use the IPUEstimator with the IPUMultiWorkerStrategy to perform distributed training of a model on the MNIST dataset.

The example is based on the following official tutorial with some modifications for use with the IPU: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_estimator

We highlight the changes needed to convert code using IPUEstimator to support distributed training below.

The input function

In multi-worker training, it is necessary to shard the dataset such that each worker processes distinct portions of the dataset.

When used in a distributed context, the input function is passed an additional argument input_context that can be used to get the current worker index and the total number of workers. We pass this information to the Dataset.shard() function to perform the sharding.

Note that the batch size provided by the input function is the per-worker batch size. The global batch size will be this multiplied by the number of workers.

The model function

The optimiser will automatically divide the loss by the number of workers, so in the model function we should only divide the loss by the local batch size.

We will do some changes to how we update the weights of the model. Instead of using the high-level Optimizer.minimize() function, we will use the Optimizer.compute_gradients() and Optimizer.apply_gradients() separately in order to control their placement. The Optimizer.compute_gradients() call (the backward pass) is placed on the IPU, while the Optimizer.apply_gradients() call (the allreduce of gradients and weight updates) is placed on the host. This is done by using the host_call parameter in IPUEstimatorSpec.

In practice this means that the gradients will be streamed from the IPU to the host as soon as they are computed. The workers will then start reducing the gradients amongst themselves, allowing overlap between the backward pass on the IPUs with the reductions on the hosts. After a gradient is reduced across the workers, the corresponding weight update is also done on the host.

The reduction is done using a ring-based collectives implementation with gRPC as the cross-host communication layer.

One benefit of this approach is that any additional optimiser state (such as momentum) is only needed in host memory, so there is no additional IPU memory consumption when using stateful optimisers with this approach.

Cluster definition

We use the TFConfigClusterResolver which reads the TF_CONFIG environment variable to determine the cluster definition.

There are two components of TF_CONFIG: cluster and task.

  • cluster provides information about the entire cluster, namely the workers and parameter servers in the cluster.

  • task provides information about the current task.

In this example, the task type is worker and the task index is 0. You could run this example with two workers on the same machine (in different terminals) like this:

 $ TF_CONFIG='{"cluster":{"worker":["localhost:3737","localhost:3738"]},"task":{"type":"worker","index":0}}' python distributed_training_example.py
$ TF_CONFIG='{"cluster":{"worker":["localhost:3737","localhost:3738"]},"task":{"type":"worker","index":1}}' python distributed_training_example.py

Complete example

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import numpy as np

import tensorflow.compat.v1 as tf

from tensorflow.python import ipu

BATCH_SIZE = 64


def input_fn(mode, input_context=None):  # pylint: disable=unused-argument
  train_data, _ = tf.keras.datasets.mnist.load_data()

  def normalise(image, label):
    image = image.astype(np.float32) / 255.0
    image = np.expand_dims(image, axis=-1)
    label = label.astype(np.int32)
    return image, label

  x_train, y_train = normalise(*train_data)

  def generator():
    return zip(x_train, y_train)

  types = (x_train.dtype, y_train.dtype)
  shapes = (x_train.shape[1:], y_train.shape[1:])
  mnist_dataset = tf.data.Dataset.from_generator(generator, types, shapes)

  if input_context:
    mnist_dataset = mnist_dataset.shard(input_context.num_input_pipelines,
                                        input_context.input_pipeline_id)

  mnist_dataset = mnist_dataset.shuffle(len(y_train)) \
      .cache().batch(BATCH_SIZE, drop_remainder=True).repeat()
  return mnist_dataset


def model_fn(features, labels, mode):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation="relu"),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation="relu"),
      tf.keras.layers.Dense(10)
  ])
  logits = model(features, training=mode == tf.estimator.ModeKeys.TRAIN)

  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {"logits": logits}
    return tf.estimator.EstimatorSpec(labels=labels, predictions=predictions)

  optimizer = tf.compat.v1.train.AdamOptimizer()
  loss = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True, reduction=tf.compat.v1.losses.Reduction.NONE)(labels,
                                                                      logits)
  loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)
  if mode == tf.estimator.ModeKeys.EVAL:
    predictions = tf.argmax(input=logits, axis=-1)
    eval_metric_ops = {
        "accuracy":
        tf.compat.v1.metrics.accuracy(labels=labels, predictions=predictions),
    }
    return tf.estimator.EstimatorSpec(mode,
                                      loss=loss,
                                      eval_metric_ops=eval_metric_ops)

  variables = model.trainable_variables

  def host_model_fn(*host_gradients):
    # This will allreduce the gradients and update the weights on the host.
    return optimizer.apply_gradients(zip(host_gradients, variables))

  train_op = tf.identity(loss)
  grads_and_vars = optimizer.compute_gradients(loss, var_list=variables)
  gradients = [g for (g, _) in grads_and_vars]
  host_call = (host_model_fn, gradients)

  return ipu.ipu_estimator.IPUEstimatorSpec(mode=mode,
                                            loss=loss,
                                            train_op=train_op,
                                            host_call=host_call)


# Get the cluster configuration from the TF_CONFIG environment variable.
cluster = tf.distribute.cluster_resolver.TFConfigClusterResolver()
# Create strategy that places variables (including momentums) on the host.
strategy = ipu.ipu_multi_worker_strategy.IPUMultiWorkerStrategy(
    cluster, variables_on_host=True)

ipu_options = ipu.utils.create_ipu_config()
ipu.utils.auto_select_ipus(ipu_options, num_ipus=1)
ipu_run_config = ipu.ipu_run_config.IPURunConfig(ipu_options=ipu_options)

config = ipu.ipu_run_config.RunConfig(
    session_config=tf.ConfigProto(allow_soft_placement=False),
    ipu_run_config=ipu_run_config,
    train_distribute=strategy,
)

parser = argparse.ArgumentParser()
parser.add_argument("--num-steps", type=int, default=10000)
parser.add_argument("--model-dir")
args = parser.parse_args()

classifier = ipu.ipu_estimator.IPUEstimator(
    config=config,
    model_fn=model_fn,
    model_dir=args.model_dir,
)

# Training progress is logged as INFO, so enable that logging level.
tf.logging.set_verbosity(tf.logging.INFO)

tf.estimator.train_and_evaluate(
    classifier,
    train_spec=tf.estimator.TrainSpec(input_fn=input_fn,
                                      max_steps=args.num_steps),
    eval_spec=tf.estimator.EvalSpec(input_fn=input_fn))

Custom IPU operations

There are three mechanisms for providing custom operations to the IPU through the TensorFlow interface. The first uses a fully custom codelet and host build file.

The second case is a custom operation which is executed on the CPU.

The third possibility is a custom, fused elementwise arithmetic operation. In this last case, the gradient creation in the optimisers will not produce a gradient operation for the custom operation.

Fully customised IPU operations

You can provide a custom operation to be compiled into the Poplar executable and run on the IPU hardware. You must provide a host-side shared object library that implements the action of adding vertices to a Poplar graph, given some Poplar tensor inputs. They can optionally provide a Poplar source code or binary file containing one or more “codelets” (code that runs on the IPU).

For more information about writing codelets, please refer to the Poplar and Poplibs User Guide.

These operations are added with ipu.user_ops.precompiled_user_op. More information about this can be found in Python API. An example of this is shown below.

The shared object file must contain an undecorated symbol, that should be declared as below. It should add vertices to the graph that perform the custom operation. The name of the symbol should match the name of the operation in the graph. By default these types of operations are called Build.

 extern "C"
poplar::program::Program Build(
  poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
  std::vector<poplar::Tensor>& outputs, const std::string &debug_prefix)

The arguments are:

  • graph: the Poplar graph into which to add tensors and vertices.

  • inputs: a vector of Poplar tensors which are inputs to the operation.

  • outputs: a vector into which to store the outputs of the operation. The vector will contain zero entries when the Build function is called.

  • ``debug_prefix: the debug name that has been given to the operation in the TensorFlow graph.

If the operation can have its gradient taken, then the shared object can contain a separate function with the same name as the forward pass builder. The function must be given the same name as the forward operation with _grad appended. The signature of the builder function is slightly different, as it takes the forward pass outputs and inputs as arguments, as well as the gradient outputs.

 extern "C"
poplar::program::Program Build_grad(
    poplar::Graph& graph, int input_grad_index,
    const std::vector<poplar::Tensor>& gradients,
    const std::vector<poplar::Tensor>& fwd_outputs,
    const std::vector<poplar::Tensor>& fwd_inputs,
    std::vector<poplar::Tensor>& outputs,
    const std::string& debug_prefix)

The arguments are:

  • graph: the Poplar graph into which to add tensors and vertices.

  • input_grad_index: The index of the input for which this operation is producing the partial derivative. If the gradient operation calculates all of the partial derivatives, then this input should be ignored.

  • gradients: the inputs to the gradient operation, from the previous gradient operation or loss.

  • fwd_outputs: the tensors which are the outputs of the forward operation.

  • fwd_inputs: the tensors which are the inputs to the forward operation.

  • outputs: the outputs of this gradient operation. There must be one per input of the original forward operation. Inputs which are not differentiable can have an null Poplar tensor.

  • debug_prefix: the name of the operation.

Metadata

The shared object file can optionally contain an undecorated symbol that is the same as the builder function with _metadata appended. This function must have the following signature:

 extern "C"
void Build_metadata(std::vector<std::int64_t>& allocating_indices,
  std::uint32_t& num_inplace, bool& is_elementwise,
  std::uint32_t num_inputs)

The arguments are:

  • allocating_indices: indicates which of the inputs should be allocated using the tensor allocation function. See the description in Tensor allocation.

  • num_inplace: indicates the number of inputs which are ‘in place’. The first num_inplace of the inputs will be considered to be in-place.

  • is_elementwise: indicates that this operation is element-wise.

  • num_inputs: indicates how many inputs are on the operation.

The function should fill in the values of the first three arguments, which are all reference types.

In-place operations

If an operation does an in-place modification of an input tensor, as opposed to creating a new output tensor, then the num_inplace can be used to indicate that this is the case. The system will ensure that when a tensor is updated in place, that any other uses of that tensor will be complete before the operation is run.

If a tensor is not marked as in place then the operation must not modify it. If it is modified then other operations which consume it may see an incorrect value on their input.

Elementwise operations

The IPU driver can do a better job of allocating the layout of Poplar tensors if it can associate them with specific operations. If the output of an operation is the same shape and layout as its first input, then it should be marked as elementwise.

Typically, the graph building code for the operation will clone the input in order to generate the output Poplar tensor.

Tensor allocation

When generating the Poplar graph, sometimes the backend has the freedom to allocate an input to an operation. This happens when an input to an operation is also the input to the graph, or when previous operations do not put constraints on the input tensor.

If this condition occurs, then by default the backend will create the Poplar tensor with linear mapping. See the section on tile mapping in the Poplar and Poplibs API Reference.

To override this behaviour and allocate a tensor using a specific layout mapping, the custom operation can provide a function with the following signature:

 extern "C" poplar::Tensor Build_allocator(
  poplar::Graph& graph, std::uint32_t operand,
  const std::vector<size_t>& shape, poplar::Type type,
  const std::string& debug_prefix)

The arguments are:

  • graph: the Poplar graph where the tensor should be created.

  • operand: the operand number of the input to allocate.

  • shape: the shape of the tensor.

  • type: the Poplar data type for the tensor.

Gradient operations

As described above, when the gradient of the forward operation is generated, either a single operation, or multiple operations can be inserted into the graph.

You can use the parameter separate_gradients on the precompiled_user_op function to select which of the two options are required. The compiled code must match this setting.

If the separate_gradients parameter is set to False, then the compiled function for generating the gradient operation should fill in one output for each of the inputs of the forward pass function. Each output should be the partial derivative with respect to one of the inputs.

If the separate_gradients parameter is True, then the gradient operation building function should produce an operation with a single output, which is the partial differential with respect to only one of the forward pass inputs.

The specific input will be given by the input_grad_index input of the call to the sharded object Build_grad function.

Example

This example shows the source file for a rotate operation, which takes three vectors and rotates the x and y ones by the angle one:

 /* Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

#include <vector>

#include <poplar/Graph.hpp>
#include <poplar/Tensor.hpp>
#include <poputil/Util.hpp>
#include <poputil/VertexTemplates.hpp>
#include <poputil/exceptions.hpp>

extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
                               std::uint32_t& num_inplace, bool& is_elementwise,
                               std::uint32_t num_inputs) {
  allocating_indices.clear();
  num_inplace = 0;
  is_elementwise = true;
}

extern "C" poplar::program::Program Build(
    poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
    std::vector<poplar::Tensor>& outputs, const std::string& debugPrefix) {
  if (inputs.size() != 3) {
    throw poputil::poplibs_error("Rotate requires 3 inputs");
  }

  if (inputs[0].numElements() == 0) {
    return poplar::program::Sequence();
  }

  if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
    throw poputil::poplibs_error("All inputs must be rank 1");
  }

  if (inputs[0].dim(0) != inputs[1].dim(0) ||
      inputs[0].dim(0) != inputs[2].dim(0)) {
    throw poputil::poplibs_error(
        "Length of rotate vector and data vectors must match");
  }

  if (inputs[0].elementType() != inputs[1].elementType() ||
      inputs[0].elementType() != inputs[2].elementType()) {
    throw poputil::poplibs_error(
        "Data types of angle vector and data vectors must match");
  }

  auto dType = inputs[0].elementType();

  /*
   * Create a ComputeSet which will be executed, and contains the vertices
   */
  auto cs = graph.addComputeSet(debugPrefix + "/rotate");

  /*
   * Get the tile mapping for the complete tensor.  We will map the vertices so
   * that they match the layout of the 'x' input tensor (input[0]).  If the 'x'
   * tensor was layed out differently to the other ones, then Poplar will
   * insert code to move the data in the other tensors to the mapped tile. So
   * ideally we would choose the best mapping for the vertices by analysing
   * all of the tensor mappings.
   */
  auto tileMapping = graph.getTileMapping(inputs[0]);

  /*
   * Get the target, which descibes properties of the hardware.
   */
  auto target = graph.getTarget();

  /*
   * Get the vector width of the particular data type, so that later we can
   * divide the tensor up between workers in an appropriate way.
   */
  const auto vectorWidth = target.getVectorWidth(dType);

  /*
   * Create the output tensors
   */
  outputs.push_back(graph.clone(inputs[0]));
  outputs.push_back(graph.clone(inputs[1]));

  auto xFlat = inputs[0].flatten();
  auto yFlat = inputs[1].flatten();
  auto aFlat = inputs[2].flatten();
  auto xOutputFlat = outputs[0].flatten();
  auto yOutputFlat = outputs[1].flatten();

  for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
    /*
     * If a tile contains no elements of the tensor then do not create any
     * vertices for it.
     */
    if (tileMapping[tile].empty()) {
      continue;
    }

    /*
     * Split up the regions of the inputs tensors so that they are evenly
     * distributed between the workers on the tile.
     */
    auto vertexRegions = poputil::splitRegionsBetweenWorkers(
        target, tileMapping[tile], vectorWidth, 2 * vectorWidth);

    for (const auto& regions : vertexRegions) {
      /*
       * If a region has no elements, then there is no need to add a vertex for
       * it.
       */
      if (regions.empty()) {
        continue;
      }

      /*
       * Add codelets to tiles which work over the regions in the input
       * tensors.
       */
      auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
                               );

      /* Map the vertex onto the appropriate tile. */
      graph.setTileMapping(v, tile);

      /* Provide a bogus cycle count estimate for the profiler. */
      graph.setCycleEstimate(v, 1);
    }
  }

  return poplar::program::Execute(cs);
}

This is the associated codelet file:

 #include <cmath>

#include <poplar/HalfFloat.hpp>
#include <poplar/Vertex.hpp>

using namespace poplar;

/*
 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
 * tensor 'angle', around the origin.
 */
template <typename FPType>
class Rotate : public Vertex {
 public:
  Vector<Output<Vector<FPType>>> x_out;
  Vector<Output<Vector<FPType>>> y_out;
  Vector<Input<Vector<FPType>>> x_in;
  Vector<Input<Vector<FPType>>> y_in;
  Vector<Input<Vector<FPType>>> angle;

  bool compute() {
    for (unsigned i = 0; i < angle.size(); ++i) {
      for (unsigned j = 0; j != angle[i].size(); ++j) {
        float a = angle[i][j];
        float x = x_in[i][j];
        float y = y_in[i][j];
        x_out[i][j] = x * cos(a) - y * sin(a);
        y_out[i][j] = x * sin(a) + y * cos(a);
      }
    }
    return true;
  }
};

template class Rotate<float>;
template class Rotate<half>;

This is an example of it in use:

 import os
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  x_data = tf.placeholder(np.float32, [4])
  y_data = tf.placeholder(np.float32, [4])
  p_angle = tf.placeholder(np.float32, [4])


def rotate_op(x, y, a):
  outputs = {
      "output_types": [tf.float32, tf.float32],
      "output_shapes": [tf.TensorShape([4]),
                        tf.TensorShape([4])],
  }

  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
  lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
  gp_path = os.path.join(base_path, "custom_codelet.gp")

  o = ipu.custom_ops.precompiled_user_op([x, y, a],
                                         lib_path,
                                         gp_path,
                                         outs=outputs)
  return o


def my_net(x, y, a):
  return rotate_op(x, y, a)


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        x_data: [2., 4., 6., -1.],
                        y_data: [2., 3., 8., -1.],
                        p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
                    })

  print(result)

When compiling the host-size shared object file, it is not necessary to include or link against any TensorFlow header or library files. Only the Poplar headers and link libraries should be necessary.

Fully customised CPU operations

The framework also allows a custom operation that executes code on the CPU instead of on the IPU. A shared object, much like the builder function of the device-side custom operation, must be written. The signature of this function should be:

 extern "C" void Callback(const std::vector<void*>& data,
                         const std::vector<std::uint32_t>& number_of_elements,
                         std::vector<void*>& outputs,
                         const std::string& name);

The arguments are:

  • data: the input data. the function should be written to expect a certain data type so the void pointer can be cast into the expected type.

  • number_of_elements: indicates the number of elements in the input data.

  • outputs: should be filled in by the operation.

  • name: is the name of the operation within the XLA/HLO graph.

Custom elementwise expressions

The Python class ipu.custom_ops.codelet_expression_op provides an interface for giving a custom fused expression to the compiler. This will be encoded into a single compute set.

The arguments to the Python function are a callable Python function which encodes the arithmetic expression, and the tensor arguments to the operation.

For instance:

 def my_custom_op(x, y, z):
    return x * x + y * z

ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)

In this example, the Python function my_custom_op provides the expression, and the arguments a, b and c are the three inputs from other parts of the TensorFlow graph.

Python operators which are supported in the function are +, -, *, and abs.

References

The following documents may be useful.

Python API

Automatic graph sharding

tensorflow.python.ipu.autoshard.automatic_sharding(num_shards, input_ts, loss_ts, edge_filter=None, frozen_inference=False)

Automatically set shards for all connected nodes in graph.

Parameters
  • num_shards – number of shards to split graph over.

  • input_ts – tensor closest to the datafeed in graph.

  • loss_ts – tensor closest to the loss in graph.

  • edge_filter – a callable predicate, with the signature fn(edge), where edge is a tuple containing the name of the source op and the name of the destination op. If the predicate returns True then the graph will not be split at that edge. Only used if frozen_inference is False.

  • frozen_inference – Flag set to True if running inference on a frozen graph.

tensorflow.python.ipu.autoshard.ipu_autoshard()

Provides a context for autosharding. All operations created within this context will be automatically sharded.

Compiler interface

tensorflow.python.ipu.ipu_compiler.compile(computation, inputs=None)
Builds an operator that compiles and runs computation with the Graphcore

IPU XLA backend.

Parameters
  • computation

    A Python function that builds a computation to apply to the input. If the function takes n inputs, inputs should be a list of n tensors.

    computation may return a list of operations and tensors. Tensors must come before operations in the returned list. The return value of compile is a list of tensors corresponding to the tensors from the output of computation.

    All Operation`s returned from `computation will be executed when evaluating any of the returned output tensors.

     

  • inputs – A list of inputs or None (equivalent to an empty list). Each input can be a nested structure containing values that are convertible to tensors. Note that passing an N-dimension list of compatible values will result in a N-dimension list of scalar tensors rather than a single Rank-N tensors. If you need different behaviour, convert part of inputs to tensors with tf.convert_to_tensor.

Returns

 

Same data structure as if computation(inputs) is called directly with some exceptions for correctness.

  1. None output. a NoOp would be returned which control-depends on computation.

  2. Single value output. A tuple containing the value would be returned.

  3. Operation-only outputs. a NoOp would be returned which control-depends on computation.

 

Raises

Exception – If the computation was not compiled for an IPU device.

Scoping contexts for IPUs

tensorflow.python.ipu.scopes.frontend_attribute(attribute_name, attribute_value, restore_to=None)

Sets the specified scope attribute to the specified value in the graph.

Parameters
  • attribute_name – Name of the attribute.

  • attribute_value – Attribute’s value as a string.

  • restore_to – If at the end of the scope the attribute was to be undefined sets it to this value instead.

Returns

A context

tensorflow.python.ipu.scopes.ipu_jit_scope(ipu_scope)

Provides a scope for compilation of operations.

If you would like to compile several sets of operations together, then this can provide that mechanism.

Parameters

ipu_scope – A name to differentiate between different JIT scopes

Returns

A context

tensorflow.python.ipu.scopes.ipu_scope(device)

Provides a scope for placing operations onto a particular IPU/IPU cluster.

Parameters

device – The name of the Tensorflow device, eg ‘/device:IPU:0’

Returns

A context

tensorflow.python.ipu.scopes.ipu_shard(index)

Control sharding for a set of operations.

Provides a scope which targets operations onto a particular shard (IPU) of a multi-IPU sharded device.

Parameters

index – The index of the IPU on which to place the enclosed operations.

Returns

A context

tensorflow.python.ipu.scopes.outside_compilation_scope(name='outside')

Provides a scope for placing operations on the host, outside the current compilation scope. The operations will be placed on the default host device. This allows for offloading computations from the IPU to the host, which can be useful for operations that are not supported or suitable for execution on the IPU.

Example:

 def my_net(a):
  with ipu_scope("/device:IPU:0"):
    b = a * a
    with outside_compilation_scope():
      c = b + 2  # Placed on the host.
    d = b + c
    return d
Parameters

name – A name for the outside compilation scope.

Returns

A context

tensorflow.python.ipu.scopes.partials_type(override_type)

Override the default type used to store intermediate results by some operations.

Parameters

override_type – Numpy type of the partials (float16 or float32)

Returns

A context

tensorflow.python.ipu.scopes.stochastic_rounding(override)

Control stochastic rounding for a set of operations.

Manually sets the stochastic rounding method to use.

Returns

A context

Infeed queue

class tensorflow.python.ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, feed_name, device_ordinal=0, replication_factor=1, data_to_prefetch=1)

Wraps a tf.Dataset object with infeed operations specific to the IPU.

This class, along with tensorflow.python.ipu.loops is used to create a data pipeline from a dataset into a training/inference loop on the IPU inside a single session.run which reduces the overheads of calling session.run for each iteration of the loop.

You should pass the infeed queue as an argument to a loop from tensorflow.python.ipu.loops. These loops will then handle the dequeuing of the data to the device automatically.

The feed_name allows individual feeds to be named. When including more than one feed in the same graph, each should be independently named.

The following skeleton shows how to use this method when building a training loop. Note how the body signature contains variables which correspond to the nested structure of tf.Tensor objects representing the next element in the infeed queue:

 # Create an example dataset.
dataset = ...  # A `tf.data.Dataset` object.

def dataset_parser(value):
  features, labels = parse_record(value)
  return {"features": features,
          "labels": labels}
# The resulting dataset has a nested structure of: {features, labels}.
dataset = dataset.map(dataset_parser)

infeed_queue = ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, feed_name="training_infeed")

# dataset can no longer be used beyond this point.

def my_net():
  # Note how the nested structure forms part of the loop body signature.
  def body(loss, features, labels):
    with variable_scope.variable_scope("vs", use_resource=True):
      y = tf.conv2d(features, .....)
      ...
      ...
      logits = tf.nn.xw_plus_b(....)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels))
    optimizer = gradient_descent.GradientDescentOptimizer(0.000001)
    train = optimizer.minimize(loss)
    with ops.control_dependencies([train]):
      return array_ops.identity(loss)

  loss = 0.0
  return = tf.python.ipu.loops.repeat(10000, body, [loss], infeed_queue)

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[])

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)
  sess.run(variables.global_variables_initializer())
  result = sess.run(res)
property deleter

A tf.Operation that can be run to delete the resources owned by this IPUInfeedQueue. This allows creating a new IPUInfeedQueue with the same name afterwards.

Returns

A tf.Operation that can be run to delete this IPUInfeedQueue

property dequeued

Returns whether this queue has been dequeued.

Returns

A nested structure of tf.Tensor objects.

get_next()

Obsolete function.

property initializer

A tf.Operation that should be run to initialize this IPUInfeedQueue.

Returns

A tf.Operation that should be run to initialize this IPUInfeedQueue

Raises

ValueError – if the function initializer has already been called.

property number_of_tuple_elements

Returns the number of arguments supplied by this IPUInfeedQueue.

Optimizer wrapper for replicated graphs

class tensorflow.python.ipu.optimizers.cross_replica_optimizer.CrossReplicaOptimizer(opt, name='CrossReplicaOptimizer')

An optimizer that averages gradients across IPU replicas.

apply_gradients(grads_and_vars, global_step=None, name=None)

Apply gradients to variables.

Calls popops_cross_replica_sum.cross_replica_sum() to sum gradient contributions across replicas, and then applies the real optimizer.

Parameters
  • grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the gradients. If global_step was not None, that operation also increments global_step.

Raises

ValueError – If the grads_and_vars is malformed.

compute_gradients(loss, var_list=None, **kwargs)

Compute gradients of “loss” for the variables in “var_list”.

This simply wraps the compute_gradients() from the real optimizer. The gradients will be aggregated in the apply_gradients() so that user can modify the gradients like clipping with per replica global norm if needed. The global norm with aggregated gradients can be bad as one replica’s huge gradients can hurt the gradients from other replicas.

Parameters
  • loss – A Tensor containing the value to minimize.

  • var_list – Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES.

  • **kwargs – Keyword arguments for compute_gradients().

Returns

A list of (gradient, variable) pairs.

get_slot(*args, **kwargs)

Return a slot named “name” created for “var” by the Optimizer.

This simply wraps the get_slot() from the actual optimizer.

Parameters
  • *args – Arguments for get_slot().

  • **kwargs – Keyword arguments for get_slot().

Returns

The Variable for the slot if it was created, None otherwise.

get_slot_names(*args, **kwargs)

Return a list of the names of slots created by the Optimizer.

This simply wraps the get_slot_names() from the actual optimizer.

Parameters
  • *args – Arguments for get_slot().

  • **kwargs – Keyword arguments for get_slot().

Returns

A list of strings.

variables()

Forwarding the variables from the underlying optimizer.

Outfeed queue

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedMode

Types used to control the IPUOutfeedQueue modes.

Contains the following values:

  • ALL: When used with an IPUOutfeedQueue, all the elements which were

    enqueued to the queue will be returned by the outfeed.

  • LAST: When used with an IPUOutfeedQueue, only the last element which was

    enqueued to the queue will be returned by the outfeed.

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueue(feed_name, outfeed_mode=None, outfeed_all=None, device_ordinal=0, replication_factor=1, io_batch_size=1)

Generates and adds outfeed enqueue/dequeue operations to the graph.

The queue has two modes of operation - outfeed all or outfeed last. In outfeed all mode every element that is enqueued will be stored for a subsequent dequeue. All of the enqueued elements will be returned when the dequeue operation is run.

In outfeed last mode only the last enqueued element is stored. The dequeue operation will in this case return a single element.

property deleter

A tf.Operation that can be run to delete the resources owned by this IPUOutfeedQueue. This allows creating a new IPUOutfeedQueue with the same name afterwards. The behaviour is undefined if this op is executed concurrently with the dequeue op.

Returns

A tf.Operation that can be run to delete this IPUOutfeedQueue

dequeue()

Generate host side operation to dequeue the outfeed values. The operation generated by this function will block if called prior to any enqueues.

The return value of this operation depends on the enqueued tensors, replication factor and the execution mode.

  1. Outfeed returning a single tensor:

 outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed", replication_factor=2)

def body(input):
  output = input + 1
  outfeed = outfeed_queue.enqueue(output)
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example the tensor output is of shape [4, 4] and it’s enqueued into the outfeed with replication_factor = 2. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.ALL, then the shape of the resulting outfed tensor will be [20, 2, 4, 4], where the first dimension represents the number of times we have enqueued a tensor to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed. The second dimension is the replication_factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.LAST, then the shape of the resulting outfed tensor will be [2, 4, 4], which represents the value of the output tensor the last time it was enqueued during execution for each of the replicated graphs.

  1. Outfeed returning a tuple of tensors:

 outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue((output, sum))
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a tuple of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.ALL, then the resulting outfed is a two-tuple of tensors with shapes ([20, 4, 4], [20, 1]), where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed for each of the tensors in the tuple. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.LAST, then the outfed is a two tuple of tensors with shapes ([4, 4], [1]), which represents the values of the output and sum tensors the last time they were enqueued during execution.

Note that replication_factor here is the default (=1), which means that the extra replication dimension is not added.

  1. Outfeed returning a dictionary of tensors:

 outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed", replication_factor=8)

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue({"x": output,
                                   "y": sum})
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(40, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a dictionary of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.ALL, then the resulting outfed is a dictionary of tensors with shapes: {“x”: [40, 8, 4, 4], “y”: [40, 8, 1]}, where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 40 times, and therefore we get 40 values back from the outfeed for each of the tensors in the tuple. The second dimension is the replication_factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.LAST, then the outfed is a dictionary of tensors with shapes: {“x”: [8, 4, 4], “y”: [8, 1]}, which represents the values of the output and sum tensors the last time they were enqueued during execution for each of the replicated graphs.

enqueue(tensors)

Enqueue a tensor, tuple or a dictionary of tensors for being outfed from the IPU graph. This operation is placed on the IPU device. This function returns an Operation which needs be executed (by either returning it or using tf.control_dependencies(…))

Examples: 1. Outfeed returning a single tensor:

   outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

  def body(v):
    v = v + 1
    outfeed = outfeed_queue.enqueue(v)
    return (v, outfeed)

  def my_net(v):
    r = loops.repeat(20, body, (v))
    return r

  with ipu.scopes.ipu_scope("/device:IPU:0"):
    res = ipu_compiler.compile(my_net, inputs=[v])

  ...
  ...

2. Outfeed returning a tuple of tensors:

.. code-block:: python

  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

  def body(v):
    v = v + 1
    x = v * 2
    outfeed = outfeed_queue.enqueue((v, x))
    return (v, outfeed)

  def my_net(v):
    r = loops.repeat(20, body, (v))
    return r

  with ipu.scopes.ipu_scope("/device:IPU:0"):
    res = ipu_compiler.compile(my_net, inputs=[v])

  ...
  ...

3. Outfeed returning a dictionary of tensors:

.. code-block:: python

  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

  def body(v):
    v = v + 1
    x = v * 2
    outfeed = outfeed_queue.enqueue({"output_1": v,
                                     "output_2": x})
    return (v, outfeed)

  def my_net(v):
    r = loops.repeat(20, body, (v))
    return r

  with ipu.scopes.ipu_scope("/device:IPU:0"):
    res = ipu_compiler.compile(my_net, inputs=[v])

  ...
  ...

IPUEstimator

class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator with IPU support.

IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.

The data parallelism is based on graph replication. One batch from the dataset returned by the input_fn (of size batch_size) is sent to each replica, giving an effective batch size of num_replicas * batch_size. The only change needed to the model_fn is that the optimizer should be wrapped in an CrossReplicaOptimizer in order to average the gradients across the replicas.

This can also be combined with distributed multi-worker training using the IPUMultiWorkerStrategy, giving a total effective batch size of num_workers * num_replicas * batch_size.

The model parallelism supported by this class is basic sharding. Consider using the IPUPipelineEstimator to get pipelined execution.

For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.

See https://tensorflow.org/guide/estimators for general information about estimators.

Parameters
  • model_fn – The model function. Refer to https://www.tensorflow.org/guide/custom_estimators#write_a_model_function for details on how to write this function.

  • model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

  • configtf.ipu.ipu_run_config.RunConfig configuration object.

  • paramsdict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.

  • warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm-started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters

name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters
  • input_fn

    A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • steps – Number of steps for which to evaluate model.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.

  • checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.

  • name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The string path to the exported directory.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • experimental_modetf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

get_variable_names()

Returns list of all variable names in this model.

Returns

List of names.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters

name – string or a list of string, name of the tensor.

Returns

Numpy array - value of the tensor.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns

The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns

def model_fn(features, labels, mode, config)

Return type

The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)

Yields predictions for given features.

Parameters
  • input_fn

    A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:

    • features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.

    • A tuple, in which case the first item is extracted as features.

     

  • predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.

  • checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.

  • yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.

  • num_predictions – If not None, the generator will raise StopIteration after yielding this number of predictions. This allows draining the generator by using list(predictions). If None, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising the StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by using list(predictions), you have to consume the expected number of elements yourself, e.g. using [next(predictions) for _ in range(num_predictions)].

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters
  • input_fn

    A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.

  • steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.

  • max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.

  • saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec

Ops and objects returned from a model_fn and passed to IPUEstimator.

static __new__(cls, mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)

Create new instance of IPUEstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops, host_call, training_hooks, evaluation_hooks, prediction_hooks)

IPUPipelineEstimator

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator for pipelining on IPUs.

IPUPipelineEstimator, like IPUEstimator, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionaly, it adds support for pipelined execution over multiple IPUs.

The major API difference from the IPUEstimator is that the provided model_fn must return a IPUPipelineEstimatorSpec that contains the information needed for pipelined execution.

Data parallelism based on graph replication is supported. Each replica will consume pipeline_depth batches from the dataset returned by the input_fn and accumulate the gradients, giving an effective batch size of num_replicas * pipeline_depth * batch_size. The optimizer in the model_fn should be wrapped in an CrossReplicaOptimizer in order to average the gradients across the replicas.

This can further be combined with distributed multi-worker training using the IPUMultiWorkerStrategy, giving a total effective batch size of num_workers * num_replicas * pipeline_depth * batch_size.

Refer to the pipelining_ops documentation for more details about pipelining.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters

name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters
  • input_fn

    A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • steps – Number of steps for which to evaluate model.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.

  • checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.

  • name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The string path to the exported directory.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • experimental_modetf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

get_variable_names()

Returns list of all variable names in this model.

Returns

List of names.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters

name – string or a list of string, name of the tensor.

Returns

Numpy array - value of the tensor.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns

The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns

def model_fn(features, labels, mode, config)

Return type

The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)

Yields predictions for given features.

Parameters
  • input_fn

    A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:

    • features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.

    • A tuple, in which case the first item is extracted as features.

     

  • predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.

  • checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.

  • yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.

  • num_predictions – If not None, the generator will raise StopIteration after yielding this number of predictions. This allows draining the generator by using list(predictions). If None, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising the StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by using list(predictions), you have to consume the expected number of elements yourself, e.g. using [next(predictions) for _ in range(num_predictions)].

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters
  • input_fn

    A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.

  • steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.

  • max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.

  • saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec

Ops and objects returned from a model_fn and passed to IPUPipelineEstimator.

static __new__(cls, mode, computational_stages, pipeline_depth, eval_metrics_fn=None, optimizer_function=None, device_mapping=None, pipeline_schedule=None, offload_weight_update_variables=True)

Creates a validated IPUPipelineEstimatorSpec instance.

Depending on the value of mode, different arguments are required. Namely

  • For mode == ModeKeys.TRAIN: the optimizer_function is required.

  • For mode == ModeKeys.EVAL: the eval_metrics_fn is required.

Refer to the pipelining_ops documentation for more details about pipelining.

Parameters
  • mode – A ModeKeys. Specifies if this is training, evaluation or prediction.

  • computational_stages – a list of Python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline state as its inputs.

  • pipeline_depth – the number of times each pipeline stage will be executed.

  • eval_metrics_fn – a Python function which takes the output of the last computational stage as parameters and returns a dict of evaluation metrics. The dict must contain a a loss tensor value with the key “loss”. This function will be called on the host.

  • optimizer_function – a Python function which takes the output of the last computational stage as parameters and returns an instance of OptimizerFunctionOutput in order to generate the back-propagation and weight-update parts of the model suitable for training.

  • device_mapping – optional stage to IPU mapping override.

  • pipeline_schedule – the scheduling algorithm to use for pipeline lowering. Must be of type PipelineSchedule.

  • offload_weight_update_variables – not supported in SDK1.1

Returns

A validated IPUPipelineEstimatorSpec object.

Raises

ValueError – If validation fails.

class tensorflow.python.ipu.ipu_run_config.IPURunConfig

IPU related configuration required by IPUEstimator.

Parameters
  • iterations_per_loop – This is the number of iterations running on the IPU device before returning to the CPU host for each Session.run. This means that the global step is increased iterations_per_loop times in one Session.run.

  • ipu_options – An IpuOptions configuration protobuf which is populated prior to being passed into IPURunConfig. Note that if more than one device is being used then ipu_options needs to be populated with a device_config.

  • compile_summary – Generate compilation summary

  • num_replicas – Number of replicated graphs (data parallelism)

  • num_shards – Number of IPU devices on which the graph is sharded (model parallelism)

  • autosharding – Use the IPU automatic_sharding to automatically shard the graph across num_shards devices

class tensorflow.python.ipu.ipu_run_config.RunConfig(ipu_run_config=None, master=None, **kwargs)

RunConfig with IPU support.

Parameters
  • ipu_run_configIPURunConfig object for IPU-specific configuration.

  • master – a string. The address of the distributed master to use for training.

  • **kwargs – keyword config parameters.

Distribution strategy for a single system

class tensorflow.python.ipu.ipu_strategy.IPUStrategy(ipu_device='/device:IPU:0', cpu_device='/device:CPU:0')

This is a distribution strategy for targeting a system with one or more IPUs.

Creating variables and Keras models within the scope of the IPUStrategy will ensure that they are placed on the IPU.

A tf.function can be executed on the IPU by calling it from the experimental_run_v2 function.

Variables will automatically be placed onto the IPUs, but the initializers for the variables will be performed on the CPU device.

 from tensorflow.python import ipu

# Create an IPU distribution strategy
strategy = ipu.ipu_strategy.IPUStrategy()

with strategy.scope():

    # Instantiate a keras model here
    m = MyModel()

    # And train it
    m.fit(...)

    # Or call a tf.function
    res = strategy.experimental_run_v2(my_fn, [...])

Dropout Keras layer

class tensorflow.python.ipu.keras.layers.dropout.Dropout(rate=0.5, noise_shape=None, seed=None, scale=1, seed_modifier=1, **kwargs)

Base class for implementing XLA and Popnn compatible Dropout layer.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(x, training=True)

This is where the layer’s logic lives.

Parameters
  • inputs – Input tensor, or list/tuple of input tensors.

  • **kwargs – Additional keyword arguments.

Returns

A tensor or list/tuple of tensors.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

Embedding Keras layer

class tensorflow.python.ipu.keras.layers.embedding_lookup.Embedding(input_dim, output_dim, embeddings_initializer='uniform', **kwargs)

This is designed to be a replacement for the typical use cases of the Keras Embedding layer.

Parameters
  • input_dim – int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

  • output_dim – int >= 0. Dimension of the dense embedding.

  • embeddings_initializer – Initializer for the embeddings matrix.

Input shape:

2D tensor with shape: (batch_size, input_length).

Output shape:

3D tensor with shape: (batch_size, input_length, output_dim).

call(inputs)

Perform an embedding lookup.

Parameters

inputs – An integer tensor of indices into the embedding variable.

Returns

The entries of the embedding tensor corresponding to the ids tensor indices.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

Normalization Keras layers

class tensorflow.python.ipu.keras.layers.normalization.GroupNorm(dtype=tf.float32, groups=2, channels_axis=- 1, center=True, scale=True, epsilon=1e-06, beta_initializer=None, gamma_initializer=None, name=None)

Group normalization layer optimized for running on the IPU.

This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Group normalization is described in this paper: https://arxiv.org/abs/1803.08494.

Parameters
  • dtype – The data type for the trainable weights.

  • groups – The number of groups to use in the normalization.

  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • name – Optional name for the layer.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=True)

This is where the layer’s logic lives.

Parameters
  • inputs – Input tensor, or list/tuple of input tensors.

  • **kwargs – Additional keyword arguments.

Returns

A tensor or list/tuple of tensors.

class tensorflow.python.ipu.keras.layers.normalization.InstanceNorm(dtype=tf.float32, channels_axis=- 1, center=True, scale=True, epsilon=1e-06, beta_initializer=None, gamma_initializer=None, name=None)

Instance normalization layer optimized for use on the IPU.

This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Instance normalization is described in this paper: https://arxiv.org/abs/1607.08022.

Parameters
  • dtype – The data type for the trainable weights.

  • groups – The number of groups to use in the normalization.

  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • name – Optional name for the layer.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=True)

This is where the layer’s logic lives.

Parameters
  • inputs – Input tensor, or list/tuple of input tensors.

  • **kwargs – Additional keyword arguments.

Returns

A tensor or list/tuple of tensors.

class tensorflow.python.ipu.keras.layers.normalization.LayerNorm(dtype=tf.float32, channels_axis=- 1, center=True, scale=True, epsilon=1e-06, beta_initializer=None, gamma_initializer=None, name=None)

Layer normalization layer optimized for use on the IPU.

This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Layer normalization is described in this paper: https://arxiv.org/abs/1607.06450.

Parameters
  • dtype – The data type for the trainable weights.

  • groups – The number of groups to use in the normalization.

  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • name – Optional name for the layer.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=True)

This is where the layer’s logic lives.

Parameters
  • inputs – Input tensor, or list/tuple of input tensors.

  • **kwargs – Additional keyword arguments.

Returns

A tensor or list/tuple of tensors.

Recurrent Keras layers

class tensorflow.python.ipu.keras.layers.rnn.PopnnLSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False, partials_dtype=tf.float32, seed=None, time_major=False, **kwargs)

XLA compatible, Popnn implementation of an LSTM layer.

Below is a typical workflow:

 with tf.Graph().as_default():
  lstm = PopnnLSTM(num_units, ...)

  outputs = lstm(inputs)
Parameters
  • num_units – the number of units within the RNN model.

  • partials_dtype

    the type used by Popnn to perform partial

    calculations.

    Either tf.float16 or tf.float32.

     

  • kernel_initializer – starting value to initialize the weight (default is all zeros).

  • bias_initializer – starting value to initialize the bias (default is all zeros).

  • recurrent_initializer – This optional parameter will partition weight initialization into two stages, first initalizing the input kernel using kernel_initializer then will initalize a kernel for the recurrent state. This partitioning is what the keras LSTM layer does. (default is None, meaning off)

  • dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.

  • return_state – When True, the layer returns a tuple containing the output and the state tensors. Otherwise it returns only the output tensor.

  • seed – A Python integer. Used to create the default Glorot uniform initializer kernel_initializer.

  • time_major – The input should be of the form [sequence, batch, units] instead of the default [batch, sequence, units].

build(input_shape)

Create variables of the PopnnLSTM.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the LSTM layer.

Parameters
  • inputs – 3-D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is set to True, then the shape should be [seq_len, batch_size, input_size].

  • initial_state – An LSTMStateTuple of state tensors, each shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – whether this operation will be used in training or inference.

Returns

 

When return_sequences is set, then LSTM returns a tensor of

shape [batch_size, seq_len, num_units], otherwise it returns a tensor of shape [batch_size, num_units].

output_state: The output state of the last cell, when the parameter

return_state is set to True.

 

Return type

output

state_shape(batch_size)

Shape of Popnn LSTM states.

Shape is a 2-element tuple. Each is [batch_size, num_units]

Parameters

batch_size – an int

Returns

a tuple of python arrays.

class tensorflow.python.ipu.keras.layers.rnn.PopnnGRU(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, implementation=1, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False, reset_after=False, partials_dtype=tf.float32, seed=None, time_major=False, **kwargs)

XLA compatible, Popnn implementation of an GRU layer.

Below is a typical workflow:

 with tf.Graph().as_default():
  gru = PopnnGRU(num_units, ...)

  outputs = gru(inputs)
Parameters
  • units – the number of units within the RNN model.

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • kernel_initializer – starting value to initialize the weight (default isipu Glorot uniform initializer).

  • bias_initializer – starting value to initialize the bias (default is all zeros).

  • dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.

  • return_state – When True, the layer returns a tuple containing the output and the state tensors. Otherwise it returns only the output tensor.

  • seed – A Python integer. Used to create the default Glorot uniform initializer kernel_initializer.

  • time_major – The input should be of the form [sequence, batch, units] instead of the default [batch, sequence, units].

build(input_shape)

Create variables of the PopnnGRU.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the GRU layer.

Parameters
  • inputs – 3-D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is True, the the shape should be [seq_len, batch_size, input_size].

  • initial_state – Initial state tensor, shaped [batch_size, num_units] If not provided, the state is initialized to zeros.

  • training – whether this operation will be used in training or inference.

Returns

 

When return_sequences is set, then GRU returns a tensor of

shape [batch_size, seq_len, num_units], otherwise it returns a tensor of shape [batch_size, num_units].

output_state: The output state of the last cell, when the parameter

return_state is set to True.

 

Return type

output

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn GRU state.

State shape is [batch_size, num_units].

Parameters

batch_size – an int

Returns

A python array.

Looping utilities

tensorflow.python.ipu.loops.repeat(n, body, inputs=None, infeed_queue=None, use_while_v1=True)

Builds a loop that executes a fixed number of iterations.

The set of loop-carried tensors correspond to inputs. body must be a function that takes and returns the values of the loop-carried tensors.

Parameters
  • n – the number of loop iterations

  • body – a Python function that builds the loop body.

  • inputs – a list of initial values passed into the loop or None (equivalent to an empty list).

  • infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.

  • use_while_v1 – if True, then use a Tensorflow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises
tensorflow.python.ipu.loops.while_loop(condition, body, inputs=None, infeed_queue=None, maximum_iterations=None, use_while_v1=True)

Builds a while loop for IPUs.

The set of loop-carried tensors corresponds to inputs. Both condition and body take the current value of the loop-carried tensors. condition must return a single boolean value that determines whether iteration continues. body must return an updated list of values for the loop-carried tensors.

Parameters
  • condition – a Python function that builds the loop condition.

  • body – a Python function that builds the loop body.

  • inputs – a list of initial values passed into the loop, or None (equivalent to an empty list).

  • infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.

  • use_while_v1 – if True, then use a Tensorflow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises

TypeError – if body or condition has the wrong signature.

Utility functions for sharding graphs

tensorflow.python.ipu.sharding.dependencies(roots)

Find a list of ancestor operations for a given set of root operations

Parameters

roots – The root operations from which to start.

tensorflow.python.ipu.sharding.get_shard_from_colocation(op)

Find the shard number from an op which shares co-location information with the given operation.

Parameters

op – The operation to apply sharding to.

tensorflow.python.ipu.sharding.has_attr(o, attr_name)

Test for the presence of a specific attribute.

Parameters
  • o – An operation.

  • attr_name – The name of an attribute to test for.

Returns

True if the operation has the given attribute.

tensorflow.python.ipu.sharding.propagate_sharding(g)

Move the sharding from the forward pass operations onto their co-located backward pass operations.

Parameters

g – The graph.

General utility functions

class tensorflow.python.ipu.utils.DeviceConnectionType

Enumeration to describe the mechanism used to attach to the Poplar device.

  • ALWAYS indicates that the system will attach when configuring the device.

  • ON_DEMAND will defer connection to when the IPU is needed.

  • NEVER will never try to attach to a device. Used when compiling offline.

class tensorflow.python.ipu.utils.ExecutionProfileType

The execution profile type indicates the desired information in the execution profile.

  • NO_PROFILE indicates that there should be no execution profiling.

  • DEVICE_PROFILE indicates that the execution profile should contain only device wide events.

  • IPU_PROFILE indicates that the profile should contain IPU level execution events.

  • TILE_PROFILE indicates that the profile should contain Tile level execution events.

class tensorflow.python.ipu.utils.SelectionOrder

Depending on the communication pattern of the model, the order in which the IPUs are selected and mapped to shards can impact the performance.

For example, given a model which executes on multiple IPUs:

 def sharded_graph(pa, pb, pc, pd):
  with ipu.scopes.ipu_shard(0):
    o1 = pa + pb
  with ipu.scopes.ipu_shard(1):
    o2 = o1 + pc
  with ipu.scopes.ipu_shard(2):
    o3 = o2 + pd
    return o3

and a typical machine with 8 Graphcore C2 cards:

  _______               _______
|       |             |       |
|  14   |=============|  15   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  12   |=============|  13   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  10   |=============|  11   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   8   |=============|   9   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   6   |=============|   7   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   4   |=============|   5   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   2   |=============|   3   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   0   |=============|   1   |
|_______|             |_______|

(where each numbered square represents an IPU with the given device ID and the == and || connections represent IPUs being directly connected via IPU-Links)

we can see that the ipu_shard(0) directly communicates with ipu_shard(1) and that ipu_shard(1) directly communicates with ipu_shard(2). If the shards 0, 1, 2 were mapped to IPUs 0, 1, 2 in that order, then the communication between shards 1 and 2 would not have a direct connection via an IPU-Link and would have to perform a “hop” via an IPU. If the shards 0, 1, 2 were mapped to IPUs 0, 1, 3 in that order, then the communication between shards 1 and 2 would have a direct connection via an IPU-Link which will reduce the communication cost.

This Enum class is used to control the order in which the IPUs are selected. Currently, the following IPU selection orderings are supported: * AUTO: automatically try and select the best selection given the network. * ZIGZAG: follow the natural ordering of IPUs. In the above example, the

IPUs would be selected in the following order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.

  • SNAKE: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after. In the above example, the IPUs would be selected in the following order: 0, 1, 3, 2, 4, 5, 7, 6, 8, 9, 11, 10, 12, 13, 15, 14.

  • HOOF: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after and the last and first shard are on the same C2 cards. In the above example, the IPUs would be selected in the following order: 0, 2, 4, 6, 8, 10, 12, 14, 15, 13, 11, 9, 7, 5, 3, 1.

The SNAKE and HOOF IPU selection orders are particularly beneficial for pipelined models.

tensorflow.python.ipu.utils.auto_select_ipus(opts, num_ipus)

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple Tensorflow devices, each with control of one of more IPUs. The devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each device can control a specific number of IPUs, given by the num_ipus parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.

Examples:

 # Create a single device, with one IPU
opts = create_ipu_config()
opts = auto_select_ipus(opts, num_ipus=1)
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two devices, with 2 IPUs per device.
opts = create_ipu_config()
opts = auto_select_ipus(opts, num_ipus=[2,2])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two devices, with 1 IPU in the first device and 2 IPUs
# in the second device.
opts = create_ipu_config()
opts = auto_select_ipus(opts, num_ipus=[1,2])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • num_ipus – List of IPUs per Tensorflow device

Returns

The IpuOptions configuration protobuf, configured for auto-selecting a set of IPU devices.

tensorflow.python.ipu.utils.configure_ipu_system(config, device='cpu')

Configure an IPU system. Passing an IpuOptions protobuf created by the create_ipu_config function.

Parameters
  • config – An IpuOptions configuration protobuf

  • device – The CPU device which is local to the IPU hardware

Returns

None

tensorflow.python.ipu.utils.create_ipu_config(profiling=False, enable_ipu_events=False, use_poplar_text_report=False, use_poplar_cbor_report=False, profile_execution=None, report_every_nth_execution=0, max_report_size=268435456, report_directory='', scheduler_selection='', always_rearrange_copies_on_the_host=False, merge_infeed_io_copies=False, disable_graph_convolution_caching=False, disable_graph_outlining=False, retain_control_dependencies=False, max_cross_replica_sum_buffer_size=0, max_inter_ipu_copies_buffer_size=0, max_scheduler_lookahead_depth=5, max_scheduler_search_space_size=64, prefetch_data_streams=True, selection_order=None)

Create an empty IPU session configuration structure. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (max_cross_replica_sum_buffer_size, max_inter_ipu_copies_buffer_size). They will be removed in a future version. Instructions for updating: Use set_optimization_options() instead.

Parameters
  • profiling – Enable compilation reports, and IPU trace events.

  • enable_ipu_events – Enable IPU trace events without poplar reports.

  • use_poplar_text_report – Enable the poplar textual report summary

  • use_poplar_cbor_report – Enable the poplar CBOR reports

  • profile_execution – Include Poplar execution profiles in the execution events. Can only be enabled if profling is also enabled. If set, can be True, ‘False`, or a member of the ExecutionProfileType enumeration. A True value indicates ExecutionProfileType.DEVICE_PROFILE.

  • report_every_nth_execution – Only produce an execution report on every Nth execution. 0 = One report only.

  • max_report_size – The maximum size of Poplar profiles to include in the profile events.

  • report_directory – When set, reports will be written to files in this directory, instead of being written into the events. The events will contain the full paths of the report files.

  • scheduler_selection – When set, this forces the compiler to use a specific scheduler when ordering the instructions. See the documentation for a list of valid schedulers.

  • always_rearrange_copies_on_the_host* Experimental Flag * The data which is streamed to/from the device might be stored in different layouts on the device and on the host. If that is the case the rearrangment is performed on the device by default. By enabling this option the rearrangment will be performed on the host at the expense of latency.

  • merge_infeed_io_copies – When true, this flag will merge the streamed host->device input copies into one larger copy. This may reduce the time to copy data from the host, at the expense of increasing the live tensor memory on the device.

  • disable_graph_convolution_caching – By default, the convolution operation searches for an equivalent cached operation, and uses this instead of creating a new convolution. Setting this flag forces the creation of a new convolution. This can improve runtime at the expense of graph size.

  • disable_graph_outlining – By default, some operations, such as matrix multiplications, which occur in the graph multiple times but with different input tensors might be optimised to reduce the total code size of the graph at the expense of the execution time. Setting this flag will disable these optimisations. This option is not valid for the convolution operation (also see disable_graph_convolution_caching)

  • retain_control_dependencies – Deprecated.

  • max_cross_replica_sum_buffer_size – The maximum number of bytes that can be waiting before a cross replica sum op is scheduled.

  • max_inter_ipu_copies_buffer_size – The maximum number of bytes that can be waiting before a inter IPU copy between IPUs is scheduled.

  • max_scheduler_lookahead_depth – The maximum distance to look into the future when considering valid schedules.

  • max_scheduler_search_space_size – The maximum number of nodes to consider when building the tree of future schedules.

  • prefetch_data_streams – When set to true, the prefetching of data for data streams on the host will be overlapped with execution on the IPU.

  • selection_order – the order in which IPUs are selected and mapped to physical IPU devices when using a multi-IPU devices (see SelectionOrder). When not specified, then automatic selection order is used, otherwise an instance of SelectionOrder.

Returns

An IpuOptions configuration protobuf, suitable for passing to configure_ipu_system

tensorflow.python.ipu.utils.export_dataset_to_file(dataset_or_infeed, output_filename, num_elements, feed_name='')

Export as binary num_elements from the given infeed to the specified output_filename.

If the infeed elements are tuples then one file per tuple element will be created. For example if dataset looks like [{ “a”: A_0, “b”: B_0}, { “a”: A_1, “b”: B_1}, …] Then export_dataset_to_file(dataset, “my_dataset.bin”, 100) will generate:

my_dataset.0.bin # Contains tensors [ A_0, A_1, …, A_99] my_dataset.1.bin # Contains tensors [ B_0, B_1, …, B_99]

Parameters
  • dataset_or_infeed – An unary dataset with the same input and output

  • or an IPUInfeedQueue. (structure) –

  • output_filename – Where to export the tensors to.

  • num_elements – Number of elements to export from the dataset.

  • feed_name – Specify the feed name

tensorflow.python.ipu.utils.export_inputs_to_file(inputs, output_filename, feed_dict)

Export as binary the list of inputs provided to the specified output_filename.

Parameters
  • inputs – List of graph inputs to export.

  • output_filename – Where to export the tensors to.

  • feed_dict – Feed dictionary containing the inputs’ values.

tensorflow.python.ipu.utils.extract_all_events(events)

Extract a list containing each event as an event object

Parameters

events – A tensor containing a list of IPU events as protobuf strings

Returns

A list containing IpuTraceEvent objects

tensorflow.python.ipu.utils.extract_all_strings_from_event_trace(events)

Extract a concatenation of all data strings from an IPU event trace.

Parameters

events – An array of IPU events as returned from the ipu_compile_summary operation.

Returns

A string containing the concatenation of all of the data fields of the events.

tensorflow.python.ipu.utils.extract_all_types_from_event_trace(events)

Return a list of the types of each event in an event trace tensor

Parameters

events – A tensor containing a list of IPU events as protobuf strings

Returns

A list containing the type of each event

tensorflow.python.ipu.utils.extract_compile_reports(events)

Get a list of all compiler reports in the event list.

Parameters

events – A list of trace event serialized protobufs

Returns

A list of tuples containing the module namd and report.

tensorflow.python.ipu.utils.extract_execute_reports(events)

Get a list of all compiler reports in the event list.

Parameters

events – A list of trace event serialized protobufs

Returns

A list of tuples containing the module namd and report.

tensorflow.python.ipu.utils.move_variable_initialization_to_cpu(graph=None)

For all variables in the VARIABLES collection, move any initialization ops onto the CPU.

Parameters

graph – Operations are moved around on this graph. The default graph will be used if not specified.

Returns

None

tensorflow.python.ipu.utils.reset_ipu_seed(seed, device='/device:IPU:0', cpu_device='cpu')

Reset the seed used to generate stateful random numbers and perform stochastic rounding.

Parameters
  • seed – The new random number generator seed.

  • device – The device to which the seed will be applied.

  • cpu_device – The CPU device which is on the same hardware to the IPU device.

Returns

None

tensorflow.python.ipu.utils.running_on_ipu_model()

Check if XLA is configured to run on the ipu model.

Returns

True if XLA is configured to run on the ipu model. False if XLA is configured to run on real hardware.

tensorflow.python.ipu.utils.select_ipus(opts, indices)

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple Tensorflow devices, each with control of one of more IPUs. The Tensorflow devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each Tensorflow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility gc-info -l. For instance, the following listing shows the device configurations available on a system with 16 IPUs.

 user@host:~$ gc-info -l
Graphcore device listing:

-+- Id:  [0], type:      [PCIe], PCI Domain: [0000:1a:00.0]
-+- Id:  [1], type:      [PCIe], PCI Domain: [0000:1b:00.0]
-+- Id:  [2], type:      [PCIe], PCI Domain: [0000:1c:00.0]
-+- Id:  [3], type:      [PCIe], PCI Domain: [0000:1d:00.0]
-+- Id:  [4], type:      [PCIe], PCI Domain: [0000:60:00.0]
-+- Id:  [5], type:      [PCIe], PCI Domain: [0000:61:00.0]
-+- Id:  [6], type:      [PCIe], PCI Domain: [0000:62:00.0]
-+- Id:  [7], type:      [PCIe], PCI Domain: [0000:63:00.0]
-+- Id:  [8], type:      [PCIe], PCI Domain: [0000:b1:00.0]
-+- Id:  [9], type:      [PCIe], PCI Domain: [0000:b2:00.0]
-+- Id: [10], type:      [PCIe], PCI Domain: [0000:b3:00.0]
-+- Id: [11], type:      [PCIe], PCI Domain: [0000:b4:00.0]
-+- Id: [12], type:      [PCIe], PCI Domain: [0000:da:00.0]
-+- Id: [13], type:      [PCIe], PCI Domain: [0000:db:00.0]
-+- Id: [14], type:      [PCIe], PCI Domain: [0000:dc:00.0]
-+- Id: [15], type:      [PCIe], PCI Domain: [0000:dd:00.0]
-+- Id: [32], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
 |--- PCIe Id:  [5], DNC Id: [2], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [3], PCI Domain: [0000:60:00.0]
 |--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [5], PCI Domain: [0000:1c:00.0]
 |--- PCIe Id:  [1], DNC Id: [6], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
 |--- PCIe Id: [11], DNC Id: [8], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [9], PCI Domain: [0000:b3:00.0]
 |--- PCIe Id:  [9], DNC Id: [10], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [11], PCI Domain: [0000:b1:00.0]
 |--- PCIe Id: [15], DNC Id: [12], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [13], PCI Domain: [0000:dc:00.0]
 |--- PCIe Id: [13], DNC Id: [14], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [15], PCI Domain: [0000:da:00.0]
-+- Id: [33], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
 |--- PCIe Id:  [5], DNC Id: [2], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [3], PCI Domain: [0000:60:00.0]
 |--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [5], PCI Domain: [0000:1c:00.0]
 |--- PCIe Id:  [1], DNC Id: [6], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
-+- Id: [34], type: [Multi IPU]
 |--- PCIe Id: [11], DNC Id: [0], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [1], PCI Domain: [0000:b3:00.0]
 |--- PCIe Id:  [9], DNC Id: [2], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:b1:00.0]
 |--- PCIe Id: [15], DNC Id: [4], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [5], PCI Domain: [0000:dc:00.0]
 |--- PCIe Id: [13], DNC Id: [6], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [7], PCI Domain: [0000:da:00.0]
-+- Id: [35], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
 |--- PCIe Id:  [5], DNC Id: [2], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [3], PCI Domain: [0000:60:00.0]
-+- Id: [36], type: [Multi IPU]
 |--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [1], PCI Domain: [0000:1c:00.0]
 |--- PCIe Id:  [1], DNC Id: [2], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [3], PCI Domain: [0000:1a:00.0]
-+- Id: [37], type: [Multi IPU]
 |--- PCIe Id: [11], DNC Id: [0], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [1], PCI Domain: [0000:b3:00.0]
 |--- PCIe Id:  [9], DNC Id: [2], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:b1:00.0]
-+- Id: [38], type: [Multi IPU]
 |--- PCIe Id: [15], DNC Id: [0], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:dc:00.0]
 |--- PCIe Id: [13], DNC Id: [2], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [3], PCI Domain: [0000:da:00.0]
-+- Id: [39], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
-+- Id: [40], type: [Multi IPU]
 |--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [1], PCI Domain: [0000:60:00.0]
-+- Id: [41], type: [Multi IPU]
 |--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [1], PCI Domain: [0000:1c:00.0]
-+- Id: [42], type: [Multi IPU]
 |--- PCIe Id:  [1], DNC Id: [0], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [1], PCI Domain: [0000:1a:00.0]
-+- Id: [43], type: [Multi IPU]
 |--- PCIe Id: [11], DNC Id: [0], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [1], PCI Domain: [0000:b3:00.0]
-+- Id: [44], type: [Multi IPU]
 |--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [1], PCI Domain: [0000:b1:00.0]
-+- Id: [45], type: [Multi IPU]
 |--- PCIe Id: [15], DNC Id: [0], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:dc:00.0]
-+- Id: [46], type: [Multi IPU]
 |--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [1], PCI Domain: [0000:da:00.0]

Examples based on the listing above:

 # Create a single device with 1 IPU at PCI address 0000:1a:00.0 by using
# IPU configuration index 0
opts = create_ipu_config()
opts = select_ipus(opts, indices=[0])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create a single device with 1 IPU at PCI address 0000:b1:00.0 by using
# IPU configuration index 8
opts = create_ipu_config()
opts = select_ipus(opts, indices=[8])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two Tensorflow devices, with one IPU each, being devices at
# indices 0 and 1
opts = create_ipu_config()
opts = select_ipus(opts, indices=[0, 1])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two Tensorflow devices, with four IPUs each. The device
# configurations at indices 37 (0000:b4:00.0, 0000:b3:00.0, 0000:b2:00.0,
# 000:b1:00.0) and 38 (0000:dd:00.0, 0000:dc:00.0, 0000:db:00.0,
# 00:da:00.0)
opts = create_ipu_config()
opts = select_ipus(opts, indices=[37, 38])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create four Tensorflow devices each with one IPU, at addresses
# 0000:1a:00.0, 0000:1b:00.0, 0000:1c:00.0, 0000:1d:00.0.
opts = create_ipu_config()
opts = select_ipus(opts, indices=[0, 1, 2, 3])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • indices – List of IPU configuration indices.

Returns

The IpuOptions configuration protobuf, with a number of devices selected by IPU configuration index.

tensorflow.python.ipu.utils.set_compilation_options(opts, compilation_options=None)

Set the IPU compilation options for the session.

 # Create a device with debug execution profile flag set to "compute_sets"
opts = create_ipu_config()
opts = set_compilation_options(opts,
    compilation_options={"debug.instrument": "true",
                         "target.workerStackSizeInBytes": "64"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • compilation_options – A dictionary of poplar compilation option flags to be sent to the executor.

Returns

The IpuOptions configuration protobuf, with engine compilation options set.

tensorflow.python.ipu.utils.set_convolution_options(opts, convolution_options=None)

Set the IPU convolution options for the session.

 # Set "availableMemoryProportion" flag to "0.1"
opts = create_ipu_config()
opts = set_convolution_options(opts,
    convolution_options={"availableMemoryProportion": "0.1"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • convolution_options – A dictionary of poplar option flags for convolutions. The “availableMemoryProportion” flag indicates the proportion of tile memory to be made available asß temporary memory for convolutions (float between 0 and 1.0). Less temporary memory will generally result in a convolution that takes more cycles to complete. However, because always live memory (such as control code and vertex state) is not tracked when planning it, a convolution using less temporary memory may use more memory overall, due to an increase of always live memory.

Returns

The IpuOptions configuration protobuf, with convolution options set.

tensorflow.python.ipu.utils.set_floating_point_behaviour_options(opts, inv=True, div0=True, oflo=True, esr=True, nanoo=True)

Set the IPU floating point control behaviour bits

See the Poplar API documentation for poplar::FloatingPointBehaviour.

Parameters
  • inv – If true a floating point invalid operation (defined by IEEE 754) will cause an exception.

  • div0 – If true a floating point divide by zero operation will cause an exception.

  • oflo – If true a floating point overflow will cause an exception.

  • esr – Enable stochastic rounding.

  • nanoo – Enable Not-a-Number on overflow mode.

tensorflow.python.ipu.utils.set_ipu_connection_type(opts, connection_type=None, ipu_version=None)

Configure when to attach to the device.

 # Compile without attaching to the device.
opts = create_ipu_config()
opts = set_ipu_connection_type(opts,
                               DeviceConnectionType.ON_DEMAND))
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • connection_type – One of DeviceConnectionType. Defaults to DeviceConnectionType.ALWAYS if None.

  • ipu_version – Version of the IPU hardware used. Required if the connection_type provided is DeviceConnectionType.NEVER.

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_ipu_model_options(opts, compile_ipu_code=True)

Set the IPU Model options.

Parameters

compile_ipu_code – Whether or not to actually compile real IPU code for modelling.

Returns

The IpuOptions configuration protobuf, with IPU model options set.

tensorflow.python.ipu.utils.set_matmul_options(opts, matmul_options=None, clear_pass_type=False)

Set the IPU matrix multiplication options for the session.

 # Set "availableMemoryProportion" flag to "0.5"
opts = create_ipu_config()
opts = set_matmul_options(opts,
    matmul_options={"availableMemoryProportion": "0.5"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • matmul_options – A dictionary containing the poplar option flag “availableMemoryProportion” for the matrix multiplication operations. It indicates the proportion of tile memory to be made available as temporary memory for the matrix multiplications (float between 0 and 1.0). Less temporary memory will generally result in a multiplication that takes more cycles to complete. However, because always live memory (like code and vertex state) is not tracked when planning it, a multiplication using less temporary memory may use more memory overall, due to an increase of always live memory.

  • clear_pass_type – When set to True, the Pass type will not be set in the options passed to the poplar operation.

Returns

The IpuOptions configuration protobuf, with matmul options set.

tensorflow.python.ipu.utils.set_norm_options(opts, use_stable_statistics=False)

Set the IPU options related to norms.

Parameters

use_stable_statistics – If True, computes the mean first and subtracts the activations by it before computing the variance. The implementation with this flag set to True is slower than when set to False.

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_optimization_options(opts, combine_embedding_lookups=False, combine_matmuls=False, max_cross_replica_sum_buffer_size=0, max_reduce_scatter_buffer_size=0, max_inter_ipu_copies_buffer_size=0, max_send_recv_cluster_size=0, gather_simplifier=False)

Set the IPU options related to performance / optimizations.

 # Create a device with fusion for multiSlices sharing the same input
# enabled.
opts = create_ipu_config()
opts = set_optimization_options(opts,
                                combine_embedding_lookups=True)
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • combine_embedding_lookups – Fuse embedding lookups on the same tensor. This might improve performance but increase memory usage.

  • combine_matmuls – Fuse matmul operations if they share the same weights or the same input.

  • max_cross_replica_sum_buffer_size – The maximum number of bytes that can be waiting before a cross replica sum op is scheduled.

  • max_reduce_scatter_buffer_size – The maximum number of bytes that can be waiting before a reduce scatter op is scheduled.

  • max_inter_ipu_copies_buffer_size – The maximum number of bytes that can be waiting before a inter IPU copy between IPUs is scheduled.

  • max_send_recv_cluster_size – The maximum number of bytes that can be waiting before a cluster of send/recv instructions to/from the host is scheduled. These are lowered to stream copies that can be merged by Poplar.

  • gather_simplifier – Will enable more aggressive optimisation for embedding lookups.

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_pooling_options(opts, pooling_options=None)

Set the IPU pooling compilation options for the session.

 # Set "poolUseIntrospectiveMapping" flag to "false"
opts = create_ipu_config()
opts = set_pooling_options(opts,
    pooling_options={"poolUseIntrospectiveMapping": "false"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • pooling_options – A dictionary of poplar option flags for the pooling operation.

Returns

The IpuOptions configuration protobuf, with pooling options set.

tensorflow.python.ipu.utils.set_recomputation_options(opts, allow_recompute=True, allow_stateful_recompute=True)

Set re-computation options.

Parameters
  • allow_recompute – Whether or not to re-compute instructions during training. If this is enabled then we will attempt to pattern match instructions/pipeline stages in the forward pass and recompute them in the backward pass to avoid having to preserve activations which increase the maximum memory liveness. Enabling this option can reduce memory usage at the expense of extra computation.

  • allow_stateful_recompute – Whether or not to extend the re-compute of pipeline stages to stages containing stateful operations (Has no effect if allow_recompute is False).

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_report_options(opts, report_options=None, graph_options=None, execution_options=None)

Set the options used to influence Poplar graph and execution reports (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (report_options). They will be removed in a future version. Instructions for updating: report_options is deprecated, use graph_options and execution_options instead

generation.

 opts = create_ipu_config()
opts = set_report_options(opts,
    report_options={"reportOption1": "false"},
    graph_options={"graphOptions": "false"},
    execution_options={"executionOptions": "false"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • report_options – (Deprecated) A dictionary of poplar option flags for the report generation.

  • graph_options – A dictionary of poplar option flags for the graph report generation.

  • execution_options – A dictionary of poplar option flags for the execution report generation.

Returns

The IpuOptions configuration protobuf, with convolution options set.

tensorflow.python.ipu.utils.set_serialization_options(opts, output_folder='')

Enable / disable the serialization to disk of the compiled executables.

 # Create a device that will save to disk all the compiled executables.
opts = create_ipu_config()
opts = set_serialization_options(opts,
                                output_folder="/tmp/my_network")
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters

output_folder – Where to save the compiled executables. Set to “” to disable serialization.

Returns

The IpuOptions configuration protobuf.

Popops all to all and all gather operators

tensorflow.python.ipu.ops.all_to_all_op.all_gather(x, replication_factor, name)
Gather the data on all replicas to all other replicas. Each replica will

have the exact same output.

Parameters
  • x – The tensor to gather

  • replication_factor – The replication factor of the model.

  • name – Optional op name.

Returns

A tensor of [num_replicas][x] with each replica having the same tensor.

tensorflow.python.ipu.ops.all_to_all_op.all_to_all(x, split_dimension, concat_dimension, replication_factor, name=None)

Perform an XLA all to all operation across all replicas (https://www.tensorflow.org/xla/operation_semantics#alltoall)

Parameters
  • split_dimension – A value in the interval [0,n) that names the dimension along which the operand is split

  • concat_dimension – A value in the interval [0,n) that names the dimension along which the split blocks are concatenated.

  • replication_factor – The replication factor of the model.

  • name – Optional op name.

Returns

A tensor of the same size where each replica will have a different value.

Popops embedding operators

class tensorflow.python.ipu.ops.embedding_ops.HostEmbedding(name, embedding_tensor, optimizer_spec=None)

Host Embedding wrapper.

HostEmbedding encapsulates the embedding tensor and the additional meta-data required to coordinate the host embedding and the device lookup. Through an instance of this class, an IPU can perform lookups on an embedding that resides on the host.

It is assumed that the given embedding will be rank two where the outtermost dimension zero is the token dimension, and the innermost dimension is the encoding dimension.

Parameters
  • name – The name which uniquely identifies the embedding.

  • embedding_tensor – The tensor which holds the embedding.

  • optimizer_spec – A description of how the embedding will be optimized. When None, the embedding is assumed to not be trainable.

lookup(indices, count=1, clip_indices=True)

Perform a host embedding lookup on an IPU.

Parameters
  • indices – The indices to lookup.

  • count – The number of times, per iteration, that this op will be executed.

  • clip_indices – Whether to enforce a the valid range on the lookup indices with clipping. When False, out-of-range values have undefined behaviour.

Returns

A Tensor containing the elements requested by the user indices.

class tensorflow.python.ipu.ops.embedding_ops.HostEmbeddingOptimizerSpec(learning_rate)

Description of the Host Embedding optimizer.

Despite the embedding living on the host, we want to compute the gradients on the device. Additionally, the communication channel between the device and host is opaque to TensorFlow. For these reasons we need to describe the optimiser parameters seperatenly.

Currently only supports SGD.

Parameters

learning_rate – The SGD learning rate.

tensorflow.python.ipu.ops.embedding_ops.embedding_lookup(params, ids, name=None, one_hot_threshold=0, min_encoding_size=1216)

Looks up ids in a list of embedding tensors. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (min_encoding_size, one_hot_threshold). They will be removed in a future version. Instructions for updating: stop passing this argument.

This is designed to be a drop-in replacement for the typical use cases with tf.nn.embedding_lookup for the IPU.

Parameters
  • params – A single tensor representing the complete embedding tensor.

  • ids – A Tensor with type int32 containing the slices to be extracted from params.

  • name – A name for the operation.

  • one_hot_threshold – The threshold below which the embedding lookup will become a one-hot with matmul.

  • min_encoding_size – The minimum encoding size for the embedding. This is used to decide whether to split the embedding tensor.

Returns

A Tensor with the same type as the tensors in params.

Popnn normalization operators

tensorflow.python.ipu.ops.normalization_ops.group_norm(inputs, groups=2, channels_axis=- 1, reduction_axes=None, center=True, scale=True, epsilon=1e-06, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)

Functional interface for the group normalization layer. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (reduction_axes). They will be removed in a future version. Instructions for updating: reduction_axes is deprecated as it has no effect.

Reference: https://arxiv.org/abs/1803.08494.

“Group Normalization”, Yuxin Wu, Kaiming He

Parameters
  • inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.

  • groups – Integer. Divide the channels into this number of groups over which normalization statistics are computed. This number must be commensurate with the number of channels in inputs.

  • channels_axis – An integer. Specifies index of channels axis which will be broken into groups, each of which whose statistics will be computed across. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.

  • reduction_axes – Deprecated.

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling can be done by the next layer.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • param_initializers – Optional initializers for beta and gamma.

  • reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.

  • variables_collections – Optional collections for the variables.

  • training – Whether this is operation is being used in a training network.

  • trainable – If True also add variables to the graph collection GraphKeys.TRAINABLE_VARIABLES (see tf.Variable).

  • scope – Optional scope for variable_scope.

Returns

A Tensor representing the output of the operation.

Raises
  • ValueError – If the rank of inputs is undefined.

  • ValueError – If rank or channels dimension of inputs is undefined.

  • ValueError – If channels dimension is not 1 or 3.

  • ValueError – If number of groups is not commensurate with number of channels.

tensorflow.python.ipu.ops.normalization_ops.instance_norm(inputs, channels_axis=- 1, reduction_axes=None, center=True, scale=True, epsilon=1e-06, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)

Functional interface for the instance normalization layer. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (reduction_axes). They will be removed in a future version. Instructions for updating: reduction_axes is deprecated as it has no effect.

Reference: https://arxiv.org/abs/1607.08022.

“Instance Normalization: The Missing Ingredient for Fast Stylization” Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

Instance normalization will generate normalization statistics across the spatial (X,Y,…) dimensions. Each slice along the feature channels dimension (C) is normalized independently. It is equivalent to a group normalization where the number of groups is the same as the size of the feature channels dimension.

Parameters
  • inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.

  • channels_axis – An integer. Specifies index of channels axis. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.

  • reduction_axes – Deprecated.

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling can be done by the next layer.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • param_initializers – Optional initializers for beta and gamma.

  • reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.

  • variables_collections – Optional collections for the variables.

  • training – Whether this is operation is being used in a training network.

  • trainable – If True also add variables to the graph collection GraphKeys.TRAINABLE_VARIABLES (see tf.Variable).

  • scope – Optional scope for variable_scope.

Returns

A Tensor representing the output of the operation.

Raises
  • ValueError – If data_format is neither NHWC nor NCHW.

  • ValueError – If the rank of inputs is undefined.

  • ValueError – If rank or channels dimension of inputs is undefined.

tensorflow.python.ipu.ops.normalization_ops.layer_norm(inputs, channels_axis=- 1, reduction_axes=None, center=True, scale=True, epsilon=1e-06, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)

Adds a Layer Normalization layer. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (reduction_axes). They will be removed in a future version. Instructions for updating: reduction_axes is deprecated as it has no effect.

Based on the paper:

“Layer Normalization”

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

https://arxiv.org/abs/1607.06450.

Layer normalization will generate normalization statistics across the spatial (X,Y,…) dimensions and the feature channels dimension (C). It is equivalent to a group normalization where all of the features in the feature channels dimension are put into a single group.

The shapes of beta and gamma are inputs.shape[begin_params_axis:], and this part of the inputs’ shape must be fully defined.

Parameters
  • inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.

  • channels_axis – An integer. Specifies index of channels axis. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.

  • reduction_axes – Deprecated.

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling can be done by the next layer.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • param_initializers – Optional initializers for beta and gamma.

  • reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.

  • variables_collections – Optional collections for the variables.

  • training – Whether this is operation is being used in a training network.

  • trainable – If True also add variables to the graph collection GraphKeys.TRAINABLE_VARIABLES (see tf.Variable).

  • scope – Optional scope for variable_scope.

Returns

A Tensor representing the output of the operation, having the same shape and dtype as inputs.

Raises

ValueError – If the rank of inputs is not known at graph build time, or if inputs.shape[begin_params_axis:] is not fully defined at graph build time.

Pipelining operators

class tensorflow.python.ipu.ops.pipelining_ops.OptimizerFunctionOutput(opt, loss)

A helper class used for returning a structured output from an optimizer_function in a pipeline.

__init__(opt, loss)

Creates an OptimizerFunctionOutput object.

Parameters
  • opt – An instance of optimizer.Optimizer which is used to generate the back-propagation and the weight update pipeline stages.

  • loss – The loss which is passed to the optimizer.

class tensorflow.python.ipu.ops.pipelining_ops.PipelineSchedule

The PipelineSchedule describes how stages are interleaved on the IPUs servicing the pipeline. The forward and backward passes of each stage will execute on the same IPUs. So, in the core of the pipeline there is a choice as to whether to run the forward stages together, or the backward stages and the forward stages together.

Grouped

This groups the forward passes on multiple IPUs. This requires more memory since activations need to be stored until the backward stages run together. However, since forward passes tend to be smaller than backward passes, Grouped tends to improve the speed of the execution, as different IPUs don’t spend so much time waiting for each other.

Interleaved

This schedules the backward passes whenever the forward passes have just generated some activations. Consequently fewer activations are required to be stored between the forward and backward pipeline stages, so less memory is required. However, since forward and backward stages tend to be very different in terms of execution cycles, the overall performance of the pipeline tends to be slower.

Sequential

This is a debug mode, where the pipeline is scheduled in the same way as if it were a sharded model.

class tensorflow.python.ipu.ops.pipelining_ops.PipelineStageOptions(convolution_options=None, matmul_options=None)

A helper class which can be used to configure Poplar compilation options (such as ‘availableMemoryProportion’) inside a pipeline forward, backward and weight update stage. This will override the global options set by ipu.utils.set_convolution_options and ipu.utils.set_matmul_options.

__init__(convolution_options=None, matmul_options=None)

Creates an PipelineStageOptions object.

Parameters
  • convolution_options – If provided, a dictionary of Poplar option flags for all the convolution operations in the stage.

  • matmul_options

    If provided, a dictionary of Poplar option flags for

    all the matmul operations in the stage.

    loss: The loss which is passed to the optimizer.

     

tensorflow.python.ipu.ops.pipelining_ops.pipeline(computational_stages, pipeline_depth, repeat_count=1, inputs=None, infeed_queue=None, outfeed_queue=None, optimizer_function=None, device_mapping=None, pipeline_schedule=None, forward_propagation_stages_poplar_options=None, backward_propagation_stages_poplar_options=None, weight_update_poplar_options=None, offload_weight_update_variables=True, continuous_weight_updates=False, outfeed_loss=False, name=None)

Sets up a series of computational stages, where the outputs of one stage are the inputs to the next one. These stages are then executed in parallel across multiple IPUs. This approach can be used to split the model where layer(s) are executed on different IPUs.

The first stage takes the inputs and the infeed_queue (if provided) as its inputs. If the infeed_queue is provided, it is automatically dequeued (similar to the ipu.loops API) therefore care needs to be taken to make sure the signature of the first pipeline stage matches both the arguments from inputs and the infeed_queue, otherwise an error is thrown.

All tensors which are used in the pipeline which are not TensorFlow Variables need to be explicitly passed as inputs to the pipeline. If an input does not change its value during the execution of the pipeline op (for example hyperparameters such as learning rate), it needs to be passed as part of inputs. Alternatively, if these values change during execution (for example the model processes different batches of data) the input should be passed through the infeed_queue (see ipu.ipu_infeed_queue.IPUInfeedQueue).

When training a model, an optional optimizer_function function can be provided. This function takes all the outputs from the last computational stage as inputs, and returns an instance of OptimizerFunctionOutput that is used to generate the backwards pass of the model using the TensorFlow Optimizer API. This will internally create corresponding backpropagation pipeline stages for each pipeline stage and colocate them such that the activations and weights required for the gradient calculation and application stay on the device in order to minimise the number of copies between IPUs.

Note that the gradients, which are calculated by the compute_gradients function, will be accumulated automatically during the execution of the pipeline, unless continuous_weight_updates is enabled.

If the last computational stage has any outputs, then an outfeed_queue (see ipu.ipu_outfeed_queue.IPUOutfeedQueue) is required and all the outputs from the last computational stage are enqueued to the outfeed_queue.

Note that pipelining also supports recomputation, to enable it, use the tensorflow.ipu.utils.set_recomputation_options() function when configuring the device.

For example a simple inference network for the MNIST can be split across two IPUs:

 from tensorflow import keras

# Create the dataset
#...

# Create the data queues from/to IPU.
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset, "infeed")
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue("outfeed")

# Create a pipelined model which is split accross two stages.
def stage1(image):
  partial = keras.layers.Dense(256, activation=tf.nn.relu)(image)
  partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial)
  return partial

def stage2(partial):
  logits = keras.layers.Dense(10)(partial)
  probabilities = tf.nn.softmax(logits)
  classes = tf.argmax(input=logits, axis=1)
  return probabilities, classes

def model():
  with variable_scope.variable_scope("vs", use_resource=True):
    pipeline_op = pipelining_ops.pipeline(
                      computational_stages=[stage1, stage2],
                      pipeline_depth=250,
                      repeat_count=2,
                      inputs=[],
                      infeed_queue=infeed_queue,
                      outfeed_queue=outfeed_queue,
                      device_mapping=[3,1],
                      name="Pipeline")
  return pipeline_op

with ops.device("/device:IPU:0"):
  compiled_model = ipu_compiler.compile(model, inputs=[])

outfeed_op = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(compiled_model)
  probabilities, classes = sess.run(outfeed_op)

In this set up, the model is split across two IPUs. By default the first two layers would be executed on the first IPU and the third layer and the probabilities and classes on the second IPU but here device_mapping is used to override the default IPU allocation and instead the first two layers will be executed on the fourth IPU and the third layer and the probabilities and classed on the second IPU.

This creates a pipeline of depth 250 (specified by the pipeline_depth), which means each pipeline stage is executed 250 times.

This pipeline is then executed 2 times (specified by the repeat_count) The results of the pipeline (probabilities and classes) are returned to the host by the outfeed queue.

We can also train this network by providing optimizer_function:

 from tensorflow import keras

# Create the dataset
#...

# Create the data queues from/to IPU.
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset, "infeed")
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue("outfeed")

# Create a pipelined model which is split accross two stages.
def stage1(lr, images, labels):
  partial = keras.layers.Dense(256, activation=tf.nn.relu)(images)
  partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial)
  return lr, partial, labels

def stage2(lr, partial, labels):
  logits = keras.layers.Dense(10)(partial)
  cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
                        labels=labels, logits=logits)
  loss = tf.reduce_mean(cross_entropy)
  return lr, loss

def optimizer_function(lr, loss):
  optimizer = tf.train.GradientDescentOptimizer(lr)
  return pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

def model(lr):
  with variable_scope.variable_scope("vs", use_resource=True):
    pipeline_op = pipelining_ops.pipeline(
                      computational_stages=[stage1, stage2],
                      pipeline_depth=128,
                      repeat_count=10,
                      inputs=[lr],
                      infeed_queue=infeed_queue,
                      outfeed_queue=outfeed_queue,
                      optimizer_function=optimizer_function,
                      name="Pipeline")
  return pipeline_op

with ops.device('cpu'):
  lr = tf.placeholder(np.float16, [])

with ops.device("/device:IPU:0"):
  compiled_model = ipu_compiler.compile(model, inputs=[lr])

outfeed_op = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(compiled_model, {lr: 0.01})
  losses = sess.run(outfeed_op)

Here the tf.train.GradientDescentOptimizer generates the pipeline stages which calculate the gradients and apply them to the weights. Note how the loss is returned to the host by the outfeed queue.

If a model requires multiple computational pipeline stages to access the same tf.Variable, then all of these computational stages need to be placed on the same IPU using the device_mapping argument.

Note that modifying tf.Variable values in a pipeline stage and/or during the gradient calculation will result in undefined behavior. These variables can only be modified by the apply_gradients member function of the applied Optimizer.

Parameters
  • computational_stages – a list of python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline state as its inputs.

  • pipeline_depth – the number of times each pipeline stage will be executed.

  • repeat_count – the number of times the pipeline will be executed.

  • inputs – arguments passed to the first pipeline stage.

  • infeed_queue – optional IPUInfeedQueue, if passed, it is dequeued and passed as an input in the first pipeline stage.

  • outfeed_queue – IPUOutfeedQueue, required if the last computational stage has any outputs. The outputs of these are enqueued to this queue and they can be accessed on the host.

  • optimizer_function – optional Python function which takes the output of the last computational stage as parameters and returns an instance of pipelining_ops.OptimizerFunctionOutput in order to generate the back-propagation and weight-update parts of the model suitable for training.

  • device_mapping – If provided, a list of length equal to the number of computational stages. An element at index i in the list represents which IPU the computational stage computational_stages[i] should reside on. This can be used to make sure computational stages which share `tf.Variable`s are resident on the same IPU.

  • pipeline_schedule – Which scheduling algorithm to use for pipeline lowering. Defaults to PipelineSchedule.Grouped.

  • forward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.

  • backward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grained control of the Poplar options for a given backward propagation computational stage.

  • weight_update_poplar_options – If provided, a PipelineStageOptions object which allows for fine grained control of the Poplar options for the weight update stage.

  • offload_weight_update_variables – not supported in SDK1.1

  • continuous_weight_updates – ** CURRENTLY UNIMPLEMENTED ** When training, this option will apply the gradients to the resource variables immediately, rather than accumulating the gradients and applying them at the end of each execution of the pipeline.

  • outfeed_loss – If True, the loss given by the optimizer_function will be enqueued on the outfeed, instead of the outputs from the last computational stage.

  • name – name of this pipeline.

Returns

An Operation that executes the pipeline.

Popops reduce scatter operator

tensorflow.python.ipu.ops.reduce_scatter_op.reduce_scatter(x, replication_factor, name=None)

Reduce (sum) the given replicated tensor with the result scattered across the replicas. For an input of shape [num_elements], the output will have shape [ceil(num_elements / replication_factor)]. If replication_factor does not evenly divide num_elements, the result is zero-padded. Example:

 Input:  Replica0: [x0, y0, z0]
        Replica1: [x1, y1, z1]
Output: Replica0: [x0 + x1, y0 + y1]
        Replica1: [z0 + z1, 0]
Parameters
  • x – The input Tensor. Must have rank 1.

  • replication_factor – The replication factor of the model.

  • name – Optional op name.

Returns

A Tensor with the result for this replica.

Popnn recurrent operators

class tensorflow.python.ipu.ops.rnn_ops.PopnnLSTM(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, name=None)

XLA compatible, time-major Popnn implementation of an LSTM layer.

Below is a typical workflow:

 with tf.Graph().as_default():
  lstm = PopnnLSTM(num_units, ...)

  outputs, output_states = lstm(inputs, initial_states, training=True)
build(input_shape)

Create variables of the PopnnLSTM.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the LSTM model.

Parameters
  • inputs – 3-D tensor with shape [time_len, batch_size, input_size].

  • initial_state – An LSTMStateTuple of state tensors, each shaped [batch_size, num_units]. If not provided, the state is initialized to zeros. DEPRECATED a tuple of tensor (input_h_state, input_c_state) each of shape [batch_size, num_units].

  • training – whether this operation will be used in training or inference.

Returns

 

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_states: An LSTMStateTuple of the same shape and structure as

    initial_state. If the initial state used the deprecated behaviour of not passing LSTMStateTuple, then a tuple (output_h_state, output_c_state) is returned.

 

Return type

tuple of output and output states

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn LSTM states.

Shape is a 2-element tuple. Each is [batch_size, num_units]

Parameters

batch_size – an int

Returns

a tuple of python arrays.

class tensorflow.python.ipu.ops.rnn_ops.PopnnGRU(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, name=None)

XLA compatible, time-major Popnn implementation of an GRU layer.

Below is a typical workflow:

 with tf.Graph().as_default():
  lstm = PopnnGRU(num_units, ...)

  outputs, output_state = lstm(inputs, initial_state, training=True)
build(input_shape)

Create variables of the PopnnGRU.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the GRU model.

Parameters
  • inputs – 3-D tensor with shape [time_len, batch_size, input_size].

  • initial_state – Initial state tensor, shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – whether this operation will be used in training or inference.

Returns

a tensor of shape [time_len, batch_size, num_units]. output_state: The output state of the last cell.

Return type

output

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn GRU state.

State shape is [batch_size, num_units].

Parameters

batch_size – an int

Returns

A python array.

Popnn random operators

tensorflow.python.ipu.ops.rand_ops.dropout(x, seed=None, rate=0.5, scale=1, seed_modifier=1, name=None)

This targets the poplibs popnn dropout operation, optimized for execution on the IPU.

Parameters
  • x – The input tensor.

  • rate – The probability that a given element will be zeroed out.

  • scale – An optional factor to apply to all other elements.

  • seed_modifier – An optional parameter given to poplar which uses it to modify the seed.

  • name – Optional op name.

Returns

A Tensor which has some nodes set to zero, as randomly selected based on other parameters.

Popops cross replica operators

tensorflow.python.ipu.ops.cross_replica_ops.cross_replica_sum(x, name=None)

Sum the input tensor across replicas.

Parameters
  • x – The local tensor to the sum.

  • name – Optional op name.

Returns

A Tensor which is summed across replicas.

Summary operations for IPUs

tensorflow.python.ipu.ops.summary_ops.get_ipu_reports()

Extracts all reports and converts them from EagerTensor to array of events.

Parameters

None

Returns

A two dimensional numpy.ndarray of IPUTraceEvents protobufs.

tensorflow.python.ipu.ops.summary_ops.ipu_compile_summary(name, op_list, collections=None)

Create an IPU compiler summary operation.

Parameters
  • name – A name for the summary.

  • op_list – An operation or list of operations to make this summary dependent upon.

  • collections – Optional collections to add the summary into.

Returns

The new summary operation

Custom operations

class tensorflow.python.ipu.optimizers.map_gradient_optimizer.MapGradientOptimizer(wrapped_optimizer, gradient_mapping_function, name='MapGradientOptimizer')

This class enables modification of the computed gradients, before they are passed to the final optimizer for application.

MapGradientOptimizer needs a map function that will modify the gradients, and an optimizer to which the modified gradients are passed.

The map function has two arguments: gradient and variable. The map function must return the modified gradient.

Example

 # Define function which will modify computed gradients.
# This is a gradient decay function.

def map_fn_decay(grad, var):
  return grad + (WEIGHT_DECAY * var)

# To run the code we need a session:
with self.cached_session():
  optimizer = gradient_descent.GradientDescentOptimizer(0.000001)
  # We define MapGradientOptimizer
  map_optimizer = map_gradient_optimizer.MapGradientOptimizer(
      optimizer, map_fn_decay)
  # Gradients are computed by compute_gradients(), where our map function
  # modifies computed gradients. compute_gradients(loss, var_list) arguments
  # are loss and var_list so define arguments and call
  # map_optimizer.compute_gradients().
  values = [1.0, 2.0, 3.0]
  vars_ = [variables.Variable([v], dtype=dtypes.float32) for v in values]
  grads_and_vars = map_optimizer.compute_gradients(
      vars_[0] * vars_[1] + vars_[0] * vars_[2] + vars_[1] * vars_[2],
      vars_)
  # The output grads_and_vars contains computed gradients modified by
  # the decay map function.
  # grads are 5.01, 4.02 and 3.03. If we did not use MapGradientOptimizer
  # they would be 5, 4 and 3.
Parameters
  • wrapped_optimizer – tensorflow (derived) optimizer.

  • gradient_mapping_function – is applied on grads and variables which are provided by wrapped_optimizer.compute_gradients().

Returns

compute_gradients() returns a list of (gradient, variable) pairs.

apply_gradients(grads_and_vars, global_step=None, name=None)

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Parameters
  • grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.

Raises
  • TypeError – If grads_and_vars is malformed.

  • ValueError – If none of the variables have gradients.

  • RuntimeError – If you should use _distributed_apply() instead.

compute_gradients(loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Parameters
  • loss – A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.

  • var_list – Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.

  • gate_gradients – How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.

  • aggregation_method – Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.

  • colocate_gradients_with_ops – If True, try colocating gradients with the corresponding op.

  • grad_loss – Optional. A Tensor holding the gradient computed for loss.

Returns

A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.

Raises
  • TypeError – If var_list contains anything else than Variable objects.

  • ValueError – If some arguments are invalid.

  • RuntimeError – If called with eager execution enabled and loss is not callable.

@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

get_slot(var, name)

Return a slot named name created for var by the Optimizer.

Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variable objects if for some reason you need them.

Use get_slot_names() to get the list of slot names created by the Optimizer.

Parameters
  • var – A variable passed to minimize() or apply_gradients().

  • name – A string.

Returns

The Variable for the slot if it was created, None otherwise.

get_slot_names()

Return a list of the names of slots created by the Optimizer.

See get_slot().

Returns

A list of strings.

minimize(loss, global_step=None, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, name=None, grad_loss=None)

Add operations to minimize loss by updating var_list.

This method simply combines calls compute_gradients() and apply_gradients(). If you want to process the gradient before applying them call compute_gradients() and apply_gradients() explicitly instead of using this function.

Parameters
  • loss – A Tensor containing the value to minimize.

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • var_list – Optional list or tuple of Variable objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.

  • gate_gradients – How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.

  • aggregation_method – Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.

  • colocate_gradients_with_ops – If True, try colocating gradients with the corresponding op.

  • name – Optional name for the returned operation.

  • grad_loss – Optional. A Tensor holding the gradient computed for loss.

Returns

An Operation that updates the variables in var_list. If global_step was not None, that operation also increments global_step.

Raises

ValueError – If some of the variables are not Variable objects.

@compatibility(eager) When eager execution is enabled, loss should be a Python function that takes no arguments and computes the value to be minimized. Minimization (and gradient computation) is done with respect to the elements of var_list if not None, else with respect to any trainable variables created during the execution of the loss function. gate_gradients, aggregation_method, colocate_gradients_with_ops and grad_loss are ignored when eager execution is enabled. @end_compatibility

variables()

A list of variables which encode the current state of Optimizer.

Includes slot variables and additional global variables created by the optimizer in the current default graph.

Returns

A list of variables.

Dataset benchmarking

tensorflow.python.ipu.dataset_benchmark.dataset_benchmark(dataset, number_of_epochs, elements_per_epochs, print_stats=True, apply_options=True)

Allows the user to benchmark performance of a tf.data.Dataset.

Parameters
  • dataset – An instance of tf.data.Dataset which will be benchmarked.

  • number_of_epochs – The number of epochs this dataset will be run for.

  • elements_per_epochs – The number of elements there are in each epoch.

  • print_stats – Whether to print statistics about the performance to the console.

  • apply_options – Whether to apply optimization options which can improve the dataset performance.

Returns

 

A JSON string with performance statistics, which records the following metrics every epoch:

  • elements_processed - number of elements processed.

  • total_bytes_processed - total number of bytes which was processed.

  • time_elapsed - the time it took (in seconds) for the epoch to complete.

  • elements_per_second - number of elements processed per second.

  • bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native python JSON library (

see https://docs.python.org/3/library/json.html).

 

Raises
  • TypeError – if dataset is not an instance of tf.data.Dataset.

  • ValueError – if number_of_epochs or elements_per_epochs is less than 1.

tensorflow.python.ipu.dataset_benchmark.infeed_benchmark(infeed_queue, number_of_epochs, elements_per_epochs, print_stats=True)

Allows the user to benchmark performance of a ipu.ipu_infeed_queue.IPUInfeedQueue.

Parameters
  • infeed_queue – An instance of ipu.ipu_infeed_queue.IPUInfeedQueue which will be benchmarked.

  • number_of_epochs – The number of epochs this infeed queue will be run for.

  • elements_per_epochs – The number of elements there are in each epoch.

  • print_stats – Whether to print statistics about the performance to the console.

Returns

 

A JSON string with performance statistics, which records the following metrics every epoch:

  • elements_processed - number of elements processed.

  • total_bytes_processed - total number of bytes which was processed.

  • time_elapsed - the time it took (in seconds) for the epoch to complete.

  • elements_per_second - number of elements processed per second.

  • bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native python JSON library (

see https://docs.python.org/3/library/json.html).

 

Raises
  • TypeError – if infeed_queue is not an instance of ipu.ipu_infeed_queue.IPUInfeedQueue.

  • ValueError – if number_of_epochs or elements_per_epochs is less than 1.

TensorFlow operators supported by the IPU

Supported operators for device: XLA_IPU_JIT

Operator

Type Constraint

Abs

T={float,half,int32,int64}

Acos

T={float,half,int32,int64}

Acosh

T={float,half}

Add

T={float,half,int32,int64}

AddN

T={float,half,int32,int64,variant}

AddV2

T={float,half,int32,int64}

AdjustContrastv2

T={float,half}

AdjustHue

T={float,half}

AdjustSaturation

T={float,half}

All

Tidx={int32,int64}

Any

Tidx={int32,int64}

ApproximateEqual

T={float,half,int32,int64}

ArgMax

output_type={int32,int64}
T={float,half,int32,int64}
Tidx={int32,int64}

ArgMin

output_type={int32,int64}
T={float,half,int32,int64}
Tidx={int32,int64}

Asin

T={float,half,int32,int64}

Asinh

T={float,half}

AssignAddVariableOp

dtype={float,half,int32,int64}

AssignSubVariableOp

dtype={float,half,int32,int64}

AssignVariableOp

dtype={bool,float,half,int32,int64}

Atan

T={float,half,int32,int64}

Atan2

T={float,half}

Atanh

T={float,half}

AvgPool

T={float,half}

AvgPool3D

T={float,half}

AvgPool3DGrad

T={float,half}

AvgPoolGrad

T={float,half}

BatchMatMul

T={float,half,int32,int64}

BatchMatMulV2

T={float,half,int32,int64}

BatchToSpace

Tidx={int32,int64}
T={bool,float,half,int32,int64}

BatchToSpaceND

Tcrops={int32,int64}
T={bool,float,half,int32,int64}
Tblock_shape={int32,int64}

BesselI0e

T={float,half}

BesselI1e

T={float,half}

Betainc

T={float}

BiasAdd

T={float,half,int32,int64}

BiasAddGrad

T={float,half,int32,int64}

BiasAddV1

T={float,half,int32,int64}

Bitcast

type={float,half,int32,int64}
T={float,half,int32,int64}

BitwiseAnd

T={int32,int64}

BitwiseOr

T={int32,int64}

BitwiseXor

T={int32,int64}

BroadcastArgs

T={int32,int64}

BroadcastGradientArgs

T={int32,int64}

BroadcastTo

Tidx={int32,int64}
T={bool,float,half,int32,int64}

Bucketize

T={float,int32,int64}

Case

Tout={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

Cast

DstT={bool,float,half,int32,int64}
SrcT={bool,float,half,int32,int64}

Ceil

T={float,half}

Cholesky

T={float,half}

ClipByValue

T={float,half,int32,int64}

Concat

T={bool,float,half,int32,int64}

ConcatOffset

 

ConcatV2

Tidx={int32}
T={bool,float,half,int32,int64}

ConjugateTranspose

Tperm={int32,int64}
T={bool,float,half,int32,int64}

Const

dtype={bool,float,half,int32,int64,string}

ControlTrigger

 

Conv2D

T={float,half,int32}

Conv2DBackpropFilter

T={float,half}

Conv2DBackpropInput

T={float,half,int32}

Conv3D

T={float,half}

Conv3DBackpropFilterV2

T={float,half}

Conv3DBackpropInputV2

Tshape={int32,int64}
T={float,half}

Cos

T={float,half}

Cosh

T={float,half}

Cross

T={float,half,int32,int64}

Cumprod

Tidx={int32,int64}
T={float,half,int32}

Cumsum

Tidx={int32,int64}
T={float,half,int32}

DataFormatDimMap

T={int32,int64}

DataFormatVecPermute

T={int32,int64}

DepthToSpace

T={bool,float,half,int32,int64}

DepthwiseConv2dNative

T={float,half}

DepthwiseConv2dNativeBackpropFilter

T={float,half}

DepthwiseConv2dNativeBackpropInput

T={float,half}

Diag

T={float,half,int32,int64}

DiagPart

T={float,half,int32,int64}

Digamma

T={float,half}

Div

T={float,half,int32,int64}

DivNoNan

T={float,half}

DynamicStitch

T={bool,float,half,int32,int64}

Einsum

T={float,half,int32}

Elu

T={float,half}

EluGrad

T={float,half}

Empty

dtype={bool,float,half,int32,int64}

EmptyTensorList

shape_type={int32,int64,variant}
element_dtype={bool,float,half,int32,int64,variant}

Equal

T={bool,float,half,int32,int64}

Erf

T={float,half}

Erfc

T={float,half}

Exp

T={float,half}

ExpandDims

Tdim={int32,int64}
T={bool,float,half,int32,int64}

Expm1

T={float,half}

ExtractImagePatches

T={float,half,int32,int64}

FakeParam

dtype={bool,float,half,int32,int64}

FakeQuantWithMinMaxArgs

 

FakeQuantWithMinMaxArgsGradient

 

FakeQuantWithMinMaxVars

 

FakeQuantWithMinMaxVarsGradient

 

Fill

index_type={int32,int64}
T={bool,float,half,int32,int64}

Floor

T={float,half}

FloorDiv

T={float,half,int32,int64}

FloorMod

T={float,half,int32,int64}

FusedBatchNorm

T={float}

FusedBatchNormGrad

T={float}

FusedBatchNormGradV2

V={float,half}
T={float,half}
U={float,half}

FusedBatchNormGradV3

V={float,half}
T={float,half}
U={float,half}

FusedBatchNormV2

U={float,half}
T={float,half}

FusedBatchNormV3

U={float,half}
T={float,half}

Gather

Tindices={int32,int64}
Tparams={bool,float,half,int32,int64}

GatherNd

Tindices={int32,int64}
Tparams={bool,float,half,int32,int64}

GatherV2

Taxis={int32,int64}
Tparams={bool,float,half,int32,int64}
Tindices={int32,int64}

Greater

T={float,half,int32,int64}

GreaterEqual

T={float,half,int32,int64}

HSVToRGB

T={float,half}

Identity

T={bool,float,half,int32,int64,resource,variant}

IdentityN

T={bool,float,half,int32,int64,resource,variant}

If

Tout={bool,float,half,int32,int64,resource,variant}
Tcond={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

InTopKV2

T={int32,int64}

Inv

T={float,half,int32,int64}

Invert

T={int32,int64}

InvertPermutation

T={int32}

IsFinite

T={float,half}

IsInf

T={float,half}

IsNan

T={float,half}

L2Loss

T={float,half}

LRN

T={float,half}

LRNGrad

T={float,half}

LeakyRelu

T={float,half}

LeakyReluGrad

T={float,half}

LeftShift

T={int32,int64}

Less

T={float,half,int32,int64}

LessEqual

T={float,half,int32,int64}

Lgamma

T={float,half}

LinSpace

Tidx={int32,int64}
T={float,half}

ListDiff

out_idx={int32,int64}
T={int32,int64}

Log

T={float,half}

Log1p

T={float,half}

LogSoftmax

T={float,half}

LogicalAnd

 

LogicalNot

 

LogicalOr

 

MatMul

T={float,half}

MatrixBandPart

Tindex={int32,int64}
T={bool,float,half,int32,int64}

MatrixDiag

T={bool,float,half,int32,int64}

MatrixDiagPart

T={bool,float,half,int32,int64}

MatrixDiagPartV2

T={bool,float,half,int32,int64}

MatrixDiagV2

T={bool,float,half,int32,int64}

MatrixInverse

T={float,half}

MatrixSetDiag

T={bool,float,half,int32,int64}

MatrixSetDiagV2

T={bool,float,half,int32,int64}

MatrixSolve

T={float,half}

MatrixTriangularSolve

T={float,half}

Max

Tidx={int32,int64}
T={float,half,int32,int64}

MaxPool

T={float,half,int32,int64}

MaxPool3D

T={float,half}

MaxPool3DGrad

TInput={float,half}
T={float,half}

MaxPoolGrad

T={float,half,int32,int64}

MaxPoolGradGradV2

T={float}

MaxPoolGradV2

T={float,half,int32,int64}

MaxPoolV2

T={float,half,int32,int64}

Maximum

T={float,half,int32,int64}

Mean

Tidx={int32,int64}
T={float,half,int32,int64}

Min

Tidx={int32,int64}
T={float,half,int32,int64}

Minimum

T={float,half,int32,int64}

MirrorPad

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

Mod

T={float,half,int32,int64}

Mul

T={float,half,int32,int64}

MulNoNan

T={float,half}

Multinomial

output_dtype={int32,int64}
T={float,half,int32,int64}

Neg

T={float,half,int32,int64}

NextAfter

T={float}

NoOp

 

NotEqual

T={bool,float,half,int32,int64}

OneHot

TI={int32,int64}
T={bool,float,half,int32,int64}

OnesLike

T={bool,float,half,int32,int64}

Pack

T={bool,float,half,int32,int64}

Pad

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

PadV2

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

ParallelDynamicStitch

T={bool,float,half,int32,int64}

ParameterizedTruncatedNormal

T={int32,int64}
dtype={float}

PartitionedCall

Tout={bool,float,half,int32,int64,resource,string,variant}
Tin={bool,float,half,int32,int64,resource,string,variant}

PlaceholderWithDefault

dtype={bool,float,half,int32,int64}

Pow

T={float,half,int32,int64}

PreventGradient

T={bool,float,half,int32,int64}

Prod

Tidx={int32,int64}
T={float,half,int32,int64}

QuantizeAndDequantizeV2

T={float,half}

QuantizeAndDequantizeV3

T={float,half}

RGBToHSV

T={float,half}

RandomShuffle

T={bool,float,half,int32,int64}

RandomStandardNormal

T={int32,int64}
dtype={float,half}

RandomUniform

T={int32,int64}
dtype={float,half}

RandomUniformInt

T={int32,int64}
Tout={int32,int64}

Range

Tidx={float,half,int32,int64}

Rank

T={bool,float,half,int32,int64}

ReadVariableOp

dtype={bool,float,half,int32,int64}

RealDiv

T={float,half,int32,int64}

Reciprocal

T={float,half,int32,int64}

ReciprocalGrad

T={float,half}

Relu

T={float,half,int32,int64}

Relu6

T={float,half,int32,int64}

Relu6Grad

T={float,half,int32,int64}

ReluGrad

T={float,half,int32,int64}

Reshape

Tshape={int32,int64}
T={bool,float,half,int32,int64}

ResizeBilinear

T={float,half,int32,int64}

ResizeBilinearGrad

T={float,half}

ResizeNearestNeighbor

T={float,half,int32,int64}

ResourceApplyAdaMax

T={float,half}

ResourceApplyAdadelta

T={float,half}

ResourceApplyAdagrad

T={float,half}

ResourceApplyAdagradDA

T={float,half}

ResourceApplyAdagradV2

T={float,half}

ResourceApplyAdam

T={float,half}

ResourceApplyAddSign

T={float,half}

ResourceApplyCenteredRMSProp

T={float,half}

ResourceApplyFtrl

T={float,half}

ResourceApplyFtrlV2

T={float,half}

ResourceApplyGradientDescent

T={float,half}

ResourceApplyKerasMomentum

T={float,half}

ResourceApplyMomentum

T={float,half}

ResourceApplyPowerSign

T={float,half}

ResourceApplyProximalAdagrad

T={float,half}

ResourceApplyProximalGradientDescent

T={float,half}

ResourceApplyRMSProp

T={float,half}

ResourceGather

Tindices={int32,int64}
dtype={bool,float,half,int32,int64}

ResourceScatterAdd

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterDiv

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterMax

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterMin

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterMul

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterNdAdd

Tindices={int32,int64}
T={bool,float,half,int32,int64}

ResourceScatterNdSub

Tindices={int32,int64}
T={bool,float,half,int32,int64}

ResourceScatterNdUpdate

Tindices={int32,int64}
T={bool,float,half,int32,int64}

ResourceScatterSub

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterUpdate

Tindices={int32,int64}
dtype={bool,float,half,int32,int64}

ResourceStridedSliceAssign

Index={int32,int64}
T={bool,float,half,int32,int64}

Reverse

T={bool,float,half,int32,int64}

ReverseSequence

Tlen={int32,int64}
T={bool,float,half,int32,int64}

ReverseV2

T={bool,float,half,int32,int64}
Tidx={int32,int64}

RightShift

T={int32,int64}

Rint

T={float,half}

Roll

Taxis={int32,int64}
T={bool,float,half,int32,int64}
Tshift={int32,int64}

Round

T={float,half,int32,int64}

Rsqrt

T={float,half}

RsqrtGrad

T={float,half}

ScatterNd

Tindices={int32,int64}
T={bool,float,half,int32,int64}

Select

T={bool,float,half,int32,int64}

SelectV2

T={bool,float,half,int32,int64}

SelfAdjointEigV2

T={float,half}

Selu

T={float,half}

SeluGrad

T={float,half}

Shape

out_type={int32,int64}
T={bool,float,half,int32,int64}

ShapeN

out_type={int32,int64}
T={bool,float,half,int32,int64}

Sigmoid

T={float,half}

SigmoidGrad

T={float,half}

Sign

T={float,half,int32,int64}

Sin

T={float,half}

Sinh

T={float,half}

Size

out_type={int32,int64}
T={bool,float,half,int32,int64}

Slice

Index={int32,int64}
T={bool,float,half,int32,int64}

Snapshot

T={bool,float,half,int32,int64}

Softmax

T={float,half}

SoftmaxCrossEntropyWithLogits

T={float,half}

Softplus

T={float,half}

SoftplusGrad

T={float,half}

Softsign

T={float,half}

SoftsignGrad

T={float,half}

SpaceToBatch

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

SpaceToBatchND

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}
Tblock_shape={int32,int64}

SpaceToDepth

T={bool,float,half,int32,int64}

SparseMatMul

Tb={float}
Ta={float}

SparseSoftmaxCrossEntropyWithLogits

Tlabels={int32,int64}
T={float,half}

SparseToDense

Tindices={int32,int64}
T={bool,float,half,int32,int64}

Split

T={bool,float,half,int32,int64}

SplitV

Tlen={int32,int64}
T={bool,float,half,int32,int64}

Sqrt

T={float,half}

SqrtGrad

T={float,half}

Square

T={float,half,int32,int64}

SquaredDifference

T={float,half,int32,int64}

Squeeze

T={bool,float,half,int32,int64}

StackCloseV2

 

StackPopV2

elem_type={bool,float,half,int32,int64}

StackPushV2

T={bool,float,half,int32,int64}

StackV2

elem_type={bool,float,half,int32,int64}

StatefulPartitionedCall

Tout={bool,float,half,int32,int64,resource,string,variant}
Tin={bool,float,half,int32,int64,resource,string,variant}

StatefulStandardNormalV2

shape_dtype={bool,float,half,int32,int64}
dtype={float}

StatefulTruncatedNormal

shape_dtype={bool,float,half,int32,int64}
dtype={float}

StatefulUniform

shape_dtype={bool,float,half,int32,int64}
dtype={float}

StatefulUniformFullInt

shape_dtype={bool,float,half,int32,int64}
dtype={int32,int64}

StatefulUniformInt

shape_dtype={bool,float,half,int32,int64}
dtype={int32,int64}

StatelessIf

Tout={bool,float,half,int32,int64,resource,variant}
Tcond={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

StatelessMultinomial

T={float}
output_dtype={int32,int64}
Tseed={int32}

StatelessRandomNormal

Tseed={int32}
dtype={float}
T={int32,int64}

StatelessRandomUniform

Tseed={int32}
dtype={float}
T={int32,int64}

StatelessRandomUniformInt

Tseed={int32}
dtype={int32,int64}
T={int32,int64}

StatelessTruncatedNormal

Tseed={int32}
dtype={float}
T={int32,int64}

StatelessWhile

T={bool,float,half,int32,int64,resource,variant}

StopGradient

T={bool,float,half,int32,int64}

StridedSlice

Index={int32,int64}
T={bool,float,half,int32,int64}

StridedSliceGrad

Index={int32,int64}
T={bool,float,half,int32,int64}

Sub

T={float,half,int32,int64}

Sum

Tidx={int32,int64}
T={float,half,int32,int64}

Svd

T={float,half}

SymbolicGradient

Tout={bool,float,half,int32,int64}
Tin={bool,float,half,int32,int64}

Tan

T={float,half,int32,int64}

Tanh

T={float,half}

TanhGrad

T={float,half}

TensorArrayCloseV3

 

TensorArrayConcatV3

dtype={bool,float,half,int32,int64}

TensorArrayGatherV3

dtype={bool,float,half,int32,int64}

TensorArrayGradV3

 

TensorArrayReadV3

dtype={bool,float,half,int32,int64}

TensorArrayScatterV3

T={bool,float,half,int32,int64}

TensorArraySizeV3

 

TensorArraySplitV3

T={bool,float,half,int32,int64}

TensorArrayV3

dtype={bool,float,half,int32,int64}

TensorArrayWriteV3

T={bool,float,half,int32,int64}

TensorListElementShape

shape_type={int32,int64}

TensorListFromTensor

shape_type={int32,int64}
element_dtype={bool,float,half,int32,int64}

TensorListGather

element_dtype={bool,float,half,int32,int64}

TensorListGetItem

element_dtype={bool,float,half,int32,int64}

TensorListLength

 

TensorListPopBack

element_dtype={bool,float,half,int32,int64,variant}

TensorListPushBack

element_dtype={bool,float,half,int32,int64,variant}

TensorListReserve

shape_type={int32,int64}
element_dtype={bool,float,half,int32,int64}

TensorListSetItem

element_dtype={bool,float,half,int32,int64}

TensorListStack

element_dtype={bool,float,half,int32,int64}

TensorScatterAdd

Tindices={int32,int64}
T={bool,float,half,int32,int64}

TensorScatterSub

Tindices={int32,int64}
T={bool,float,half,int32,int64}

TensorScatterUpdate

Tindices={int32,int64}
T={bool,float,half,int32,int64}

Tile

Tmultiples={int32,int64}
T={bool,float,half,int32,int64}

TopKV2

T={float,int32}

Transpose

Tperm={int32,int64}
T={bool,float,half,int32,int64}

TruncateDiv

T={float,half,int32,int64}

TruncateMod

T={float,half,int32,int64}

TruncatedNormal

T={int32,int64}
dtype={float}

Unpack

T={bool,float,half,int32,int64}

UnsortedSegmentMax

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

UnsortedSegmentMin

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

UnsortedSegmentProd

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

UnsortedSegmentSum

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

VarIsInitializedOp

 

VariableShape

out_type={int32,int64}

While

T={bool,float,half,int32,int64,resource,variant}

Xdivy

T={float,half}

XlaBroadcastHelper

Tindices={int32,int64}
T={float,half,int32,int64}

XlaConv

Tindices={int32,int64}
T={float,half,int32,int64}

XlaDequantize

 

XlaDot

T={float,half,int32,int64}

XlaDynamicSlice

Tindices={int32,int64}
T={bool,float,half,int32,int64}

XlaDynamicUpdateSlice

Tindices={int32,int64}
T={bool,float,half,int32,int64}

XlaEinsum

T={float}

XlaIf

Tout={bool,float,half,int32,int64,resource,variant}
Tcond={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

XlaKeyValueSort

V={bool,float,half,int32,int64}
K={float,half,int32,int64}

XlaPad

Tindices={int32,int64}
T={bool,float,half,int32,int64}

XlaRecv

dtype={bool,float,half,int32,int64}

XlaReduce

T={float,half,int32,int64}

XlaReduceWindow

Tindices={int32,int64}
T={float,half,int32,int64}

XlaReplicaId

 

XlaSelectAndScatter

Tindices={int32,int64}
T={float,half,int32,int64}

XlaSelfAdjointEig

T={float,half}

XlaSend

T={bool,float,half,int32,int64}

XlaSharding

T={bool,float,half,int32,int64}

XlaSort

T={bool,float,half,int32,int64}

XlaSvd

T={float,half}

XlaWhile

T={bool,float,half,int32,int64,resource,variant}

Xlogy

T={float,half}

ZerosLike

T={bool,float,half,int32,int64,variant}

_Arg

T={bool,float,half,int32,int64,resource,variant}

_ArrayToList

out_types={bool,float,half,int32,int64}
T={bool,float,half,int32,int64}

_FusedBatchNormEx

U={float,half}
T={float,half}

_ListToArray

T={bool,float,half,int32,int64}
Tin={bool,float,half,int32,int64}

_Retval

T={bool,float,half,int32,int64,resource,variant}

_UnaryOpsComposition

T={float,half}

To regenerate this table, run:

 bazel run -c opt -- tensorflow/compiler/tf2xla:tf2xla_supported_ops --device=XLA_IPU_JIT