Targeting the IPU from TensorFlow

Introduction

The purpose of this document is to introduce the TensorFlow framework from the perspective of developing and training models for the IPU. It assumes you have some knowledge of TensorFlow and machine learning.

See the “Getting Started” guide for your IPU system on the Graphcore support portal for installation instructions.

To some extent, implementing at the framework level is relatively independent of the underlying hardware as it relates to the specifics of defining a graph and its components (for example, how a convolutional layer is defined).

However, there are critical elements of targeting the IPU from TensorFlow that need to be understood to successfully use it as a training and inference engine. These include IPU-specific API configurations, model parallelism, error logging and report generation, as well as strategies for dealing with out-of-memory (OOM) issues.

Tutorial

TensorFlow is a powerful graph-modelling framework that can be used for the development, training and deployment of deep learning models. In the Graphcore software stack, TensorFlow sits at the highest level of abstraction where Poplar and Poplibs interface TensorFlow to actual IPU operations.

TensorFlow abstraction

TensorFlow abstraction in relation to Poplar and the IPU

For the discussion that follows, it is important to understand the three key concepts of graph, session and device as well as their functional interdependence.

Session graph device illustration

Interdependence between session, graph and device in TensorFlow

Graph

A computational graph is the connectivity framework of a deep learning model, where nodes are operators and edges are the data streams that connect them. Building a deep learning model in TensorFlow is the functional equivalent of designing a graph, where specified layer operations (for example, fully-connected layers) are nodes, and the sequence and connectivity of layers (such as a convolutional layer followed by max-pooling) define the edges.

Session

A session is the computational platform that encapsulates a graph. It handles data flow into and out of the graph, variable initialisation, model/weight storage and weight restoration along with a number of other operations that are required to manage the computational task.

Device

The device is an object that identifies the hardware to which a session is ported; such as the IPU, CPU or TPU. In many of the applications targeting the IPU, it will be helpful to segregate tasks between the CPU and IPU to leverage those aspects of the computation that they are each ideally suited to undertake.

In the sections that follow, these three concepts will form a recurrent theme in building and deploying models from TensorFlow.

There are a number of references, user guides, model repos and texts that can be valuable in learning the framework. See the References section.

Preliminary graphs

The focus now is to implement our first basic graphs targeting the IPU. The first step will be a straightforward additive graph with nothing save the fundamental components required for running on an IPU.

From there, we add the XLA library, which is required for a number of TensorFlow operators.

Finally, we add the concept of sharding, in which we take our first steps to model parallelism by splitting a basic graph across four IPUs and consolidate calculations spread across separate IPUs to produce a single final result.

A basic graph

We begin with the most humble of aspirations: the ability to add.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
 import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure arguments for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")


def basic_graph(pa, pb, pc):
  # Do basic addition with tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  result = basic_graph(pa, pb, pc)

with tf.Session() as sess:
  # Run the graph through the session feeding it an arbitrary dictionary
  result = sess.run(result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })
  print(result)

Let’s review the various key sections of the code as they are presented. In lines 1-5 are the basic import statements, two of which pertain to the IPU specifically. Line 3 imports the IPU API, which will be the main interface to set configuration options for running the IPU session. ipu_scope is a helper function that ensures that the device and resource scopes are set (that is, the hardware is properly initiated when called by the script).

 # Configure arguments for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

In this section of the code basic configuration options are being defined. Boolean flags are passed to create_ipu_config, which turns on profiling and a text-format report.

  • The profiling parameter enables trace event logging on the IPU. This will monitor operations on the chip, providing a detailed description of the session as it runs on hardware.

  • use_poplar_text_report configures the text format of the generated report, making it more readable for debugging purposes.

Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.

Running on the IPU Model simulator

You can run the graph on IPU hardware or on an IPU Model running on the host. The IPU model is an accurate simulation of the behaviour of the IPU hardware. It may be easier to debug problems, such as out-of-memory errors, using the IPU Model.

When using an IPU Model instead of actual IPU hardware, the runtime operations will behave exactly as they would on hardware. However, the profiler will estimate the performance of operations and the memory use so the profiling information will not be as precise as running on hardware. By default, the memory use will not include that required for IPU code.

If you set the set_ipu_model_options option compile_ipu_code to True then Poplar will compile code for the IPU (in addition to the CPU code that is actually executed by the host). In this case, the reported IPU memory usage will include the memory used for code.

Because the IPU Model has access to all the memory of the host CPU it is possible to run graphs that would run out of memory on the IPU. This makes the IPU Model an important tool for debugging OOM-related issues. See Using the IPU Model device for more information.

By default, the code will be run on IPU hardware. To actually run on the IPU Model instead, you would need to set the environment variable TF_POPLAR_FLAGS='--use_ipu_model', for example:

 # Using IPU model instead of IPU hardware
if self.base_dictionary['ipu_model']:
    os.environ['TF_POPLAR_FLAGS'] = '--use_ipu_model'

Selecting hardware to run on

The auto_select_ipus function enables you to select from the available IPUs on a server. In this instance, one IPUs selected. This can be changed to any number between 1 and 16 since for a system, such as the Dell EMC DSS8440 IPU Server, with eight C2 cards installed, each with two IPUs. This option will be important when we explore sharding, in which a single graph is segregated into separate sections, each section targeting a distinct IPU.

 with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")

In this section, TensorFlow placeholders are being placed into the CPU part of the graph. These will be used to feed data using a feed dictionary when executing session.run().

 def basic_graph(pa, pb, pc):
  # Do basic addition with tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  result = basic_graph(pa, pb, pc)

In this section, a graph of operations is created to do simple arithmetic on three input tensors. The ipu_scope directive is used to ensure that these operations are placed on the IPU system.

Then the graph is executed by using session.run(), the following output can be seen in the console log.

 ... [VARIOUS OUTPUT LINES FROM SCRIPT]...
...: I tensorflow/compiler/plugin/poplar/driver/executor.cc:660] Device /device:IPU:0 attached to IPU: 0
[3. 8.]

Beyond summing the vectors correctly, the line directly preceding informs us that the targeted device was the IPU, and the index of the actual IPU that ran the graph was IPU 0.

Note that "/device:IPU:0" in the script is an identifier for the IPU, and so when using auto_select_ipus, the actual IPU selected to run the graph may not be IPU 0, but could be any of the other IPUs that are free and available on the server. This will be covered in more detail in Sharding a graph.

An XLA graph

The previous script introduced a very basic graph that consisted of the summation of three vectors and published the results of a forward pass. For certain applications, it will be necessary to incorporate control flow structures, as in conditionals of the nature of if or while statements. Certain recurrent neural network (RNN) layers and long-short term memory (LSTM) cells have conditionals implicitly defined in their source code. In those cases, it will be necessary to use the XLA library to define the graph. XLA is an optimised linear algebra library that interfaces the graph to a set of optimisation parsers that render highly efficient computation sets.

Using XLA has certain restrictions, the most pertinent of which for the current discussion is that the dimensions of all tensors involved in the computational graph must be fully defined at compile time. Dealing with this restriction can at times require some meticulous refactoring of placeholders or input tensors, (especially when dealing with mini-batch processing), but does not constitute a significant development overhead.

The main interface to the XLA library is ipu.ipu_compiler.compile(), which will take a graph and a feed dictionary for input tensors, and return a tensor set. ipu.ipu_compiler.compile sits between the graph definition and the session construct, as shown below:

``xla.compile`` in relation to a session and graph

xla.compile in relation to a session and graph

In most IPU-specific implementations, it is likely that an entire graph will be parsed through ipu.ipu_compiler.compile. However, it is also possible to compile only a portion of a graph with XLA and then combine the resulting tensor set with other, non-XLA, graph. Further details of XLA compilation are available on the TensorFlow website: https://www.tensorflow.org/xla/tutorials/xla_compile

Let’s now build on our previous TensorFlow script by adding ipu.ipu_compiler.compile to the session definition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
 import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")


def basic_graph(pa, pb, pc):
  # Do basic addition on tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(basic_graph, [pa, pb, pc])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })

  print(result)

The script has now gone from calling basic_graph directly, to feeding it as the graph input to ipu.ipu_compiler.compile. This takes the graph, along with the corresponding placeholders, as input.

Note that the dimensions of the placeholders fed to ipu.ipu_compiler.compile have been defined on the CPU. The actual values of these tensors are not defined until the session.run call.

In other words, it is only the dimensions of the placeholders that are the critical information for ipu.ipu_compiler.compile so that it can parse the graph correctly at compile time.

Given that this graph and the one in the previous example are the same, it is apparent that ipu.ipu_compiler.compile is not actually required to execute the graph. However, if the following code:

 def basic_graph(pa, pb, pc):
    # Do basic addition on tensors
    o1 = pa + pb
    o2 = pa + pc
    simple_graph_output = o1 + o2
    return simple_graph_output

Were to be replaced with:

 def while_loop_graph(pa):
        c = tf.constant(0)

        def body_of_while_loop(i):
            return i+1

        cond = lambda i: i < 10
        loop = tf.while_loop(cond, body_of_while_loop, [c])
        square = pa * pa
        return loop, square, tf.no_op()
Then ipu.ipu_compiler.compile would be strictly required, because of the use

of the tf.while_loop() conditional statement.

Sharding a graph

The final script of this introductory series focuses on sharding: the process of splitting a graph across multiple IPUs. In essence, the session continues to be a single entity, so that the graph construct is treated as a single model, but distinct portions of the graph live on different IPUs, as illustrated below:

Sharding across two IPUs

Sharding across two IPUs

Let’s now return to our basic script and add the sharding component.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
from tensorflow.compiler.plugin.poplar.ops import gen_ipu_ops
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

NUM_IPUS = 4

# Configure the IPU system
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, NUM_IPUS)
ipu.utils.configure_ipu_system(cfg)

# Create the CPU section of the graph
with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")

# Define a trace event
with tf.device('cpu'):
  report = gen_ipu_ops.ipu_event_trace()


# Distribute the computation across four shards
def sharded_graph(pa, pb, pc):
  with ipu.scopes.ipu_shard(0):
    o1 = pa + pb
  with ipu.scopes.ipu_shard(1):
    o2 = pa + pc
  with ipu.scopes.ipu_shard(2):
    o3 = pb + pc
  with ipu.scopes.ipu_shard(3):
    out = o1 + o2 + o3
    return out


# Create the IPU section of the graph
with ipu_scope("/device:IPU:0"):
  result = ipu.ipu_compiler.compile(sharded_graph, [pa, pb, pc])

with tf.Session() as sess:
  # sharded run
  result = sess.run(result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })

  print(result)

Focusing on the sharding facets of this new script, line 14 uses auto_select_ipus to select four separate IPUs for the task. This will allow the script to go through the IPUs currently available to the host, determine which are being utilised and which are free, and then subscribe to those IPUs that are available.

In lines 29-38, the standard sum graph is defined, (with the addition of one more sum for shard 2), and now each portion of the sum is performed on a distinct shard, using

 with ipu.scopes.ipu_shard(shard_index):

As a result, shards 0 through 2 perform independent tensor sums, while shard 3 performs an accumulated sum from the other two shards. In line 43 we are using xla.compile to parse the graph. Note that sharding can be performed without running through the XLA library.

Reviewing the output of the session run,

 ... [VARIOUS OUTPUT LINES FROM SCRIPT]...
...:  I tensorflow/compiler/plugin/poplar/driver/executor.cc:660] Device /device:IPU:0 attached to IPUs: 24
[array([ 4., 14.], dtype=float32)]

The first thing to note is that the sum is correct.

The second thing to note is that the IPU ID is reported as 24. This is a multi-IPU ID and corresponds to the individual IPUs 4, 5, 6 and 7. These are the IPUs selected to host the graph and to process respective shards as indexed in the code. See the IPU Command Line Tools document for more information about how IPU IDs are allocated.

Targeting the Poplar XLA device

The name of the Poplar XLA devices are /device:IPU:X.

A Python context handler is available for setting up all appropriate scoping while creating the graph:

 # Create the IPU section of the graph
with ipu_scope("/device:IPU:0"):
  result = ipu.ipu_compiler.compile(sharded_graph, [pa, pb, pc])

For very simple graphs, it is sufficient to use the IPU scope to define the parts of the graph which will be compiled. For most graphs, the function ipu_compiler.compile() must be used. This must be placed inside an IPU device scope.

The function ipu_compiler.compile() will cause all operations created by the Python function passed into its first argument to be placed on the IPU system, and be compiled together into a single Poplar executable.

Supported types

Poplar and the Poplibs libraries support the following data types:

 tf.float32
tf.float16
tf.int32
tf.bool

Device selection

Hardware configuration options enable you to select the number of IPU devices. By default, TensorFlow will create one device. This device will be for a single IPU. The first available single IPU will be used.

Two API calls are available for selecting the number and configuration of the IPU system.

auto_select_ipus allows the selection of a number of IPUs. The process searches for the first set of IPUs which match the number requested.

select_ipus allows the selection of a specific IPU hardware device, as returned by the gc-info tool.

Each of these functions takes as a first argument the options structure returned by the create_ipu_config function. The second argument is either an integer or a list. When an integer is supplied, then you will get a single TensorFlow device (/device:IPU:0) configured with the appropriate number of IPUs. When a list of integers is provided, then the system is configured with multiple TensorFlow IPU devices (/device:IPU:0, /device:IPU:1, and so on), configured as specified. For examples, see the documentation in the Python API.

Once the hardware configuration structure has been configured, the API call ipu.utils.configure_ipu_system must be used to attach and to configure the hardware.

 # Configure the IPU system
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, NUM_IPUS)
ipu.utils.configure_ipu_system(cfg)

Configuring compilation options

The create_ipu_config function has many options for system configuration. They are divided into roughly three categories.

  1. Profiling and report generation.

  2. IO control.

  3. Graph creation.

In addition to auto_select_ipus and select_ipus, several other functions exist for configuring the hardware and compiler.

  • set_compilation_options sets general options to be passed to the Poplar compiler.

  • set_convolution_options, set_matmul_options and set_pooling_options pass specific options directly to the Poplibs convolution and pooling operations.

  • set_report_options passes options directly to the Poplar summary report generator.

  • set_ipu_model_options controls the Poplar IPU Model device type.

  • set_recomputation_options turns on recomputation, to reduce the memory requirement at the expense of speed.

  • set_floating_point_behaviour_options controls the IPUs floating point control register.

  • set_optimization_options controls the performance and memory use trade offs.

More options are available on the create_ipu_config function itself. These mostly control specific features of the Poplar and Poplibs operations.

  • max_scheduler_lookahead_depth controls how far the scheduler can look beyond a given scheduling decision to understand the max-liveness implications. This search space grows very quickly and can take an unacceptable amount of time for large max_scheduler_lookahead_depth.

  • max_scheduler_search_space_size introduces an upper-limit to the size of the schedule search space to guarantee that it will terminate in a reasonable amount of time.

  • scheduler_selection controls the particular scheduler that is selected to perform the scheduling of instructions in the compilation stage. By default, several schedules will be created and the one with the lowest predicted liveness chosen. This can sometimes produce incorrect results because the overall peak liveness isn’t always a good measure for the maximum liveness on one tile of the processor. The available schedulers are:

    • Clustering, which groups clusters of operations together in order to look through stretches of instructions with potentially high liveness.

    • PostOrder, which schedules the instructions in the order which is obtained by walking the graph in ‘post order’.

    • LookAhead, which looks ahead a number of operations from any schedulable one, as given by the max_scheduler_lookahead_depth and max_scheduler_search_space_size options described above. It attempts to look through areas of high liveness.

    • ShortestPath, which schedules the graph giving priority to the shortest path to the root.

See the documentation in Python API for more details.

TF_POPLAR_FLAGS environment variable

The options passed through create_ipu_config and configure_ipu_system can be directed at any machine in a TensorFlow cluster. Some configuration options are provided by an environment variable called TF_POPLAR_FLAGS.

Setting TF_POPLAR_FLAGS=--help and executing a TF session will produce some help for each option. For a full lst, refer to the API documentation.

--use_synthetic_data will prevent the system from downloading or uploading data to the card when executing code. This is used for testing performance without the overhead of data transfer.

--synthetic_data_initializer when used in combination with the --use_synthetic_data flag, all the inputs to the graph will be initialized directly on the IPU either randomly (synthetic_data_initializer=random) or to a constant value X (synthetic_data_initializer=X)

--max_compilation_threads sets the maximum number of threads which Poplar is allowed to use for compiling the executable.

--max_infeed_threads sets the maximum number of threads which each infeed queue is allowed to use when accessing data from datasets.

--save_vertex_graph dumps the Poplar vertex graph (DOT file) to the given directory.

--save_interval_report dumps the Poplar interval report to the given directory.

--executable_cache_path enables the Poplar executable cache. See below.

--dump_schedule_as_dot creates a file containing the scheduled HLO graph as a graphviz DOT file.

--tensor_map_file_path will cause a JSON file containing the tile mapping of all tensors to be written to this directory.

--fallback_scheduler uses the standard TensorFlow scheduler, instead of the Graphcore specific one.

--allow_nans will allow NaNs.

--log_cycle_count will log the number of cycles used in evaluating the main graph. The numeric argument indicates on which tile the cycle count operation will be created. This may be used as an alternative to profiling for graphs with dynamic control flow.

Multiple options can be specified at the same time by concatenating them like command line switches, for example: --executable_cache_path=/tmp/cache --allow_nans.

Caching of compiled executables

It can take a long time to compile a large fused graph into an executable suitable for the IPU. To prevent the need for compiling every time a TensorFlow process is started, it is possible to enable an executable cache.

The environment variable TF_POPLAR_FLAGS can have the argument --executable_cache_path set to a directory where compiled files will be placed. Fused XLA/HLO graphs are hashed into a 64 bit hash and stored in this directory.

 TF_POPLAR_FLAGS='--executable_cache_path=/tmp/cachedir'

A pair of files will be saved for each compiled graph, the TensorFlow metadata and the Poplar executable.

The cache does not manage the files within the directory. It is your responsibility to delete files. No index is kept of the files, so they can be deleted without risk.

Supported operations

A list of all TensorFlow operations is provided in TensorFlow operators supported by the IPU.

Unsupported operations

TensorFlow core operations which use variable buffers or strings are not supported. For instance, JpegDecode.

Unsupported operations will cause the compilation to fail. By including config=tf.ConfigProto(log_device_placement=True) as an argument to the creation of the session, you can check whether the operations in your graph have been targeted at the Poplar device. For example:

 # Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Adding variables

Do not add variables using tf.Variable([shape], initializer), they will fail to obey certain operations, such as assign_add. Make sure that all variables are added using a variable scope that is marked as a resource. This can be done globally, as shown below:

 vscope = tf.get_variable_scope()
vscope.set_use_resource(True)
...
var = tf.get_variable(name, shape=[...], dtype=tf.float32, initializer=tf.constant_initializer(0.5))
...

or locally, in a specific scope:

 with tf.variable_scope("vs", use_resource=True):
  var = tf.get_variable(name, shape=[...], dtype=tf.float32, initializer=tf.constant_initializer(0.5))

Note on the global_step counter

More advanced execution control frameworks in TensorFlow use a scalar counter called global_step to count the number of iterations of training which have occurred. This counter is serialised along with the model. It allows the model to base parameters on the step count, even if the model is run multiple times.

There is an add operation which adds to the global_step scalar on each training pass. If the global_step variable is placed on the IPU device, then this increment operation will occur on the IPU too. This will cause the Poplar training engine to be swapped out for the increment engine on each training step, causing very poor performance.

To avoid this, in the CPU context, use the expression tf.train.get_or_create_global_step() before you create any special training sessions. This will ensure that the global_step variable is on the CPU.

 with tf.device("cpu"):
  tf.train.get_or_create_global_step()

with ipu.ops.ipu_scope("/device:IPU:0"):
  out = ipu.ipu_compiler.compile(model_fn, [...])

IEEE half precision floating point and stochastic rounding

The IPU supports IEEE half-precision floating-point numbers, and supports hardware stochastic rounding. The IPU extensions to TensorFlow expose this floating point functionality through two interfaces.

See the Python API for more details of the functions described here.

Controlling the half precision floating point unit

The floating point unit has a control register that controls its behaviour. When configuring the IPU system hardware, the function tensorflow.python.ipu.utils.set_floating_point_behaviour_options() will set the control register.

The esr bit enables the stochastic rounding unit. Three of the remaining options control the generation of hardware exceptions on various conditions. The nanoo bit selects between clipping on overflow of a half precision number or generating a NaN.

Resetting the global random number seed

The stochastic rounding unit, and the TensorFlow stateful random number generators both use a common global random number seed to initialize the random number generator hardware. Each TensorFlow IPU device has its own seed.

By default this seed is set randomly, but it can be reset by using the tensorflow.python.ipu.utils.reset_ipu_seed() function.

Due to the hardware threading in the device, if the seed reset function is used then the target.deterministicWorkers Poplar Engine option will need to be set to true.

This can be done using the tensorflow.python.ipu.utils.set_compilation_options() function.

Debugging numerical issues

The values held in a tensor can be printed by calling ipu.ops.internal_ops.print_tensor. This function takes a tensor and will print it to standard error as a side effect.

See tensorflow.python.ipu.ops.internal_ops.print_tensor().

Retrieving information about compilation and execution

When developing models for the IPU, it is important to be able to see how compute tiles are being utilized and what is the balance of the memory across them. In certain cases, such as when investigating memory over-consumption of a model or investigating any tile imbalance issues, it is useful to produce a trace report that will disclose a number of different aspects of graph deployment to the IPU.

Several mechanisms are available to retrieve trace information about the Poplar IPU compilation and execution. Firstly, there are environment variables provided by Poplar itself to dump the compilation and execution reports into a file. The Poplar documentation can give more information about these.

Within TensorFlow, the basic steps for this are:

  • Include an operation in the graph that can retrieve reports

  • Enable tracing in the hardware configuration options

  • Execute the graph, including the operation to retrieve the reports

  • Extract the reports from the returned events

Adding an operation to the graph to get compilation and execution events

Two operations are available to fetch events from the Poplar backend. The first is an operation which fetches the reporting events into a tensor, and is typically executed independently of the main graph. The second is a summary event which will extract the reports along with any other summary events. These events will typically be written into a file using the tensorflow.summary.FileWriter class.

ipu_event_trace()

This is an op which retrieves all IPU events since the last time it was executed. The operation must be placed on the CPU, and returns the events as a one dimensional tensor of strings containing serialised IPU event protobufs, from tensorflow.compiler.plugin.poplar.driver.trace_pb2.IpuTraceEvent.

This is the example from the tutorial with a few lines of additional code to create a trace report:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 import numpy as np

# IPU imports
from tensorflow.compiler.plugin.poplar.ops import gen_ipu_ops
from tensorflow.python.ipu import utils
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = utils.auto_select_ipus(cfg, 1)
utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  pa = tf.placeholder(np.float32, [2], name="a")
  pb = tf.placeholder(np.float32, [2], name="b")
  pc = tf.placeholder(np.float32, [2], name="c")

  # Create a trace event
  report = gen_ipu_ops.ipu_event_trace()


def basic_graph(pa, pb, pc):
  # Do basic addition with tensors
  o1 = pa + pb
  o2 = pa + pc
  simple_graph_output = o1 + o2
  return simple_graph_output


with ipu_scope("/device:IPU:0"):
  result = basic_graph(pa, pb, pc)

with tf.Session() as sess:
  # Run the graph through the session feeding it an arbitrary dictionary
  result = sess.run(result,
                    feed_dict={
                        pa: [1., 1.],
                        pb: [0., 1.],
                        pc: [1., 5.]
                    })

  # Generate report based on the event run in session
  trace_out = sess.run(report)
  trace_report = utils.extract_all_strings_from_event_trace(trace_out)

  # Write trace report to file
  with open('Trace_Event_Report.rep', "w") as f:
    f.write(trace_report)

  # Print the result
  print(result)

The example starts by importing two new elements that are IPU-specific APIs. The first import is gen_ipu_ops, which will generate the actual event trace. The second import is an assortment of utility functions, one of which is used here to parse the event trace to a readable output.

The event trace operation is created when gen_ipu_ops is called to instantiate the trace and returns it to report. This is then fed to the TensorFlow session as a run argument, directly following the session run call to the feed-forward pass through basic_graph. In essence the report is generated based on the last session graph call. The trace output is then parsed through extract_all_strings_from_event_trace, and a log file is generated. The final step of writing the trace to a file is done near the end of the example where a file is opened and the parsed trace data written to it.

ipu_compile_summary(name, [op list])

This produces a summary, which can be tied into the rest of the summary system to produce output for Tensorboard. The parameter name is the name of the summary, and op is one of the ops in the IPU graph. It is best to choose either the inference output for an inference graph, the loss output for an evaluation graph, or the train op for a training graph.

 import tensorflow as tf
from tensorflow.python import ipu

...

tf.summary.scalar('c_out', c)
ipu.summary_ops.ipu_compile_summary('report', [c])
all_sum = tf.summary.merge_all()

...

f = tf.summary.FileWriter('logs')
with tf.Session() as s:
  sum_out, ... = s.run([all_sum, ...])
  f.add_summary(sum_out, 0)

  print("c = {}".format(c))

Enabling tracing in the hardware configuration options

The main function for producing an IPU system hardware configuration is called create_ipu_config. It provides several options for controlling the logging and tracing of Poplar compilations.

profiling

This enables compilation and execution graph reports in Poplar, and generates COMPILE_BEGIN and COMPILE_END events in the trace.

enable_ipu_events

Setting this to True leaving profiling as False will generate trace events without setting the Poplar compilation and execution reports in them. This is useful for getting timing information from the event trace without the overhead of the Poplar reporting.

use_poplar_text_report

Normally, the Poplar reports are generated in JSON format. Setting this parameter to True will generate a text summary report instead of JSON.

use_poplar_cbor_report

Instead of a JSON format report, a CBOR format report will be generated.

profile_execution

When this is set to True, then EXECUTE events will be generated in addition to compilation events. By default the execution events will contain a device type trace. If a different type of execution trace is required, then instead of True, one of ExecutionProfileType.DEVICE_PROFILE, ExecutionProfileType.IPU_PROFILE or ExecutionProfileType.TILE_PROFILE can be used.

report_every_nth_execution

This will restrict the number of execution reports to a subset of all executions.

max_report_size

Poplar reports can get very large. This parameter can be used to restrict the maximum size of report generated. Reports larger than this value will be discarded and a warning message sent to the TensorFlow log.

report_directory

Rather than reports being placed directly into the events, they can be written to a file, and the filename written into the event log. This behaviour is enabled by setting this parameter to a directory name.

Extract the reports from the returned events

If the summary event generator has been used then the events will be inside Tensor type events in the Tensorboard logs. A tool is available for extracting these all from the log. This is available in the Graphcore Toolshed repository on GitHub.

If the individual report gathering event is used then executing it will return an array of Tensors. Within each Tensor is a string which is an IpuTraceEvent of one type.

The IpuTraceEvent is within the tensorflow namespace at tensorflow.compiler.plugin.poplar.driver.trace_pb2.IpuTraceEvent. It is a protobuf that can be decoded from the string into an object with fields containing trace information.

Several utility functions are available for extracting fields.

 rep = sess.run(report)
compile_reports = ipu.utils.extract_compile_reports(rep)
execute_reports = ipu.utils.extract_execute_reports(rep)
events = ipu.utils.extract_all_events(rep)

See the Python API section.

COMPILE_BEGIN

This event is generated when the Poplar compilation begins. It contains the XLA module name, a timestamp and the ordinal of the device that the compilation was done for.

COMPILE_END

This is generated when the Poplar compilation ends. It contains the module name, a timestamp, an ordinal and some compilation trace fields.

  • compilation_report is the Poplar compilation report.

  • duration is the duration of the compilation.

  • tensor_map is a mapping of tensors generated by XLA/HLO instructions to the IPU tiles where those tensors are mapped.

The tensor_map field has the following format. It is JSON, but in order to keep it dense, it is mostly JSON lists, instead of keyed dictionaries.

At the top level there is a map called ‘mappings’ which contains an entry for each XLA computation, keyed by the name of that computation. The value is a list of tensors generated by that computation.

 { 'mapping' : {'computation_0' : [ ... ], 'computation_1' : [ ... ] } }

Each tensor in that list is also a list, consisting of the following items:

  • 0 - name of the XLA/HLO instruction generating the tensor.

  • 1 - the ordinal of the tensor produced by that instruction.

  • 2 - a list of integers indicating the shape of the tensor.

  • 3 - a string indicating the tensor element type.

  • 4 - a Boolean indicating if the tensor contains any constant elements.

  • 5 - a Boolean indicating if the tensor contains any aliases.

  • 6 - the total number of elements in the tensor.

  • 7 - a list of information about the elements on each tile like

     [ 'add.0', 0, [32, 32], 'float', 0, 0, 2, 256, [ ... ] ]
    

The list of elements on each tile has one entry per tile that contains elements of the tensor. Each entry is itself a list, containing the following items.

  • the tile index number.

  • the total number of elements on that tile.

The instruction_info field contains information about how the specific HLO instructions were mapped to Poplar API calls. Its format is as follows.

 { 'ml_types': {'instruction': <ml_type>, ... } }

The instruction is the name of the instruction at the HLO level, which is similar to the name in the main compilation report. The ml_type field takes one of the following values, for instructions which are convolution or matmul.

  • 0 - Unclassified

  • 1 - Standalone

  • 2 - The forward pass of training

  • 3 - The input gradient of training

  • 4 - The filter gradient of training

EXECUTE

This event contains the Poplar execution report in the execution_report field.

Using the IPU Model device

If you encounter an out of memory error, it is useful to do the compilation using the IPU Model device to enable more debugging of the problem.

Consider the situation in which the event trace is being monitored to investigate a graph that creates a tile memory imbalance. In those instances, running on the IPU will lead to an out of memory exception before the actual report is generated, and so it is important to target the IPU Model instead of actual hardware.

The IPU Model is an emulator that mimics the IPU computational framework on the host device. It is functionally equivalent to the IPU, but obviously the compute timings will be completely different. There are a number of ways to target the IPU Model, but let’s assume the previous code is in the active current directory, and all the relevant library variables required by the IPU are set correctly.

At the terminal command line, one could then type:

 $ TF_POPLAR_FLAGS="--use_ipu_model" python basic_graph.py

See the TF_POPLAR_FLAGS environment variable for details about the TF_POPLAR_FLAGS environment variable.

 ...] Device /device:IPU:0 attached to IPU: 0

where the “Device /device:IPU:0 attached to IPU: 0” indicates that the device known to TensorFlow as “/device:IPU:0” is IPU 0. The numbering of IPUs in your machine can be found by using the gc-info -l command.

TensorFlow options for reporting

Some tracing and reporting options are provided by TensorFlow as standard, and can be useful when developing graphs for the IPU.

TF_CPP_MIN_VLOG_LEVEL is an environment variable that enables the logging of the main C++ backend. Setting TF_CPP_MIN_VLOG_LEVEL=1 will show a lot of output. Included in this is the compilation and execution of the IPU code. The output of TF_CPP_MIN_VLOG_LEVEL can be overwhelming. TF_CPP_VMODULE provides a mechanism to reduce the logging to certain translation units (source files). This combination is quite useful:

 TF_CPP_VMODULE='poplar_compiler=1,poplar_executable=1'

Finally, there is an environment variable called XLA_FLAGS which provides options to the general XLA backend. For example, the follow will produce a Graphviz DOT file of the optimized HLO graph which is passed to the Poplar compiler.

 XLA_FLAGS='--xla_dump_to=. --xla_dump_hlo_as_dot --xla_dump_hlo_pass_re=forward-allocation --xla_hlo_graph_sharding_color'

The HLO pass forward-allocation is the final pass to run before the HLO instructions are scheduled for passing to the Poplar graph compiler. Running with these options will create a file called something like module_0001.0001.IPU.after_forward-allocation.before_hlo-memory-scheduler.dot. The Graphviz dot command can be used to convert this to an image.

Reading the Poplar textual summary report

If the example code is run, a new file is generated called Trace_Event_Report.rep. This is the Poplar compilation report. The report is broken into a number of sections, but here, we will focus on the first three: Target, Graph, and Memory Usage.

Target describes the target hardware, where in absence of sharding, will be a single IPU, for instance:

 Target:
  Number of IPUs:         1
  Tiles per IPU:          1,216
  Total Tiles:            1,216
  Memory Per-Tile:        256.0 kB
  Total Memory:           304.0 MB
  Clock Speed (approx):   1,600.0 MHz

It is important to note that this section of the report does not distinguish between hardware or the IPU Model, and in essence it is only dependent on the number of IPUs selected for deployment via the sharding utility.

The next section is Graph, which describes the topology of the deployed graph.

For instance:

 Graph:
  Number of vertices:            1,219
  Number of edges:               1,223
  Number of variables:          30,562
  Number of compute sets:            4

You may see different numbers, depending on the version of the software.

This is from the report generated by the adder example. The graph map includes control code, not just compute graph components. Note that the number of vertices in the graph is suspiciously close to the 1,216 tiles on the IPU.

The Memory Usage section gives the memory consumption profile of the graph from a number of different perspectives:

 Memory Usage:
  Total:
    Including Gaps:         23,878,396 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:     5,355,604 B
        Interleaved:                 0 B
        Overflowed:                  0 B
      By Data Type:
          Variables:                            39,108 B
          Constants:                                 0 B
          Host Exchange Packet Headers:         10,512 B
          Global Exchange Packet Headers:            0 B
          Stack:                             3,852,288 B
          Vertex Instances:                     14,640 B
          Copy Descriptors:                          0 B
          VectorList Descriptors:                    0 B
          Vertex Field Data:                         0 B
          Control Table:                             0 B
          Control Code:                        851,272 B
          Vertex Code:                         170,788 B
          Internal Exchange Code:               60,792 B
          Host Exchange Code:                  351,328 B
          Global Exchange Code:                      0 B
          Instrumentation Results:               4,876 B
          Shared Code Storage:                       0 B
          Shared Data Storage:                       0 B
        Vertex Data (14,640B):
          By Category:
            Internal vertex state:          9,736 B
            Edge pointers:                  4,904 B
            Copy pointers:                      0 B
            Padding:                            0 B
            Descriptors:                        0 B
          By Type:
            poprand::SetSeedSupervisor                                                  34,048 B
            popops::ScaledAddSupervisor<float,float,true>                                   60 B
            popops::BinaryOp1DSupervisor<popops::expr::BinaryOpType::ADD,float>             16 B

  By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
         4 - 5 ****************************************  1,215
         5 - 6 *                                             1

    Maximum (Including Gaps): 49,184 (48.0 K) on tile 0
    Maximum (Excluding Gaps): 5,780 (5.6 K) on tile 0
    0 tile(s) out of memory

The information is presented in distinct sections, where first is the total memory usage including gaps. This is followed by a breakdown of the gap-excluding memory: first in terms of interleaved and non-interleaved usage, then by data type, followed by vertex data.

A useful portion of the report is the tile histogram memory consumption profile, which in this simple case is confined to two categories. When the graph is more complex, the histogram will most likely have a more distributed profile. In those instances, where there is in fact a tile imbalance, the histogram produced may look more like:

 By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
       0 -   8 *                                            20
       8 -  16 ****************************************  1,192
      16 -  24 *                                             2
      24 -  32                                               0
      32 -  40                                               0
    .
    .
    .
     488 - 496                                               0
     496 - 504                                               0
     504 - 512 *                                             1
     512 - 520                                               0
     520 - 528                                               0
    .
    .
    .
     784 - 792                                               0
     792 - 800                                               0
     800 - 808                                               0
     808 - 816 *                                             1

    Maximum (Including Gaps): 834,416 (814.9 K) on tile 0
    Maximum (Excluding Gaps): 834,339 (814.8 K) on tile 0
    2 tile(s) out of memory

In this case, two tiles are out of physical memory, while most of the allocation is well within the single tile budget. In those instances where a memory imbalance occurs, the report will produce a detailed description of the operations running on five of the most memory-subscribed tiles (regardless of whether they are over their physical limit or not) and list them in descending order of memory consumption.

In the above case, tile 0 is the most over-subscribed tile, and the report produces the following:

 Tile # 0 memory usage:
Memory Usage:
  Total:
    Including Gaps:            834,416 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:       122,880 B
        Interleaved:           131,072 B
        Overflowed:            580,387 B
      By Data Type:
          Variables:                           807,658 B
          Constants:                                 0 B
          Host Exchange Packet Headers:          1,160 B
          Global Exchange Packet Headers:            0 B
          Stack:                                 3,168 B
          Vertex Instances:                     12,074 B
          Copy Descriptors:                      1,385 B
          VectorList Descriptors:                  960 B
          Vertex Field Data:                     7,934 B
          Control Table:                             0 B
          Control Code:                              0 B
            .
            .
            .

        Vertex Data (22,353B):
          By Category:
            Internal vertex state:          4,152 B
            Edge pointers:                 10,798 B
            .
            .
            .
          By Type:
            poplin::ConvPartial1x1Out<float,float,true,false>                                                             6,648 B
            poplar_rt::DstStridedCopy64BitMultiAccess                                                                     2,669 B
            popops::Reduce<popops::ReduceAdd,float,float,false,0>                                                         2,542 B
            popops::ScaledAddSupervisor<float,float,true>                                                                 1,440 B
            poplar_rt::StridedCopyDA32                                                                                    1,374 B
            poplar_rt::DstStridedCopyDA32                                                                                 1,101 B
            popops::BinaryOp1DSupervisor<popops::expr::BinaryOpType::MULTIPLY,float>                                        752 B
            .
            .
            .

This information can be very useful when tracking down the source of the over-allocation.

Producing an ELF image of the compilation

There is another method to produce much of the same detailed information provided in the trace event report. This generates code for IPU hardware (not an emulator on the host) and then extracts the memory allocation information from the generated ELF object file created at compile time. This technique will be described briefly here, only showing how the object file is created and memory-per-tile information extracted.

When compiling the graph, a Poplar engine option can be used to dump the ELF file to a specified location.

 POPLAR_ENGINE_OPTIONS='{"target.saveArchive":"binaries.a", "debug.allowOutOfMemory": "true"}' python basic_graph.py

The file binaries.a is created, which is an archive file of the compiled graph. To extract the memory size information from it run the following command:

 $ size -A binaries.a > tiles_elf.txt

This pipes a tile-by-tile rendition of the memory consumed in bytes to the file tiles_elf.txt. All of the memory allocated is part of the text section. This can be extracted from the tiles’ ELF files to produce a single column where each entry is the size of the text section corresponding to a tile:

 $ size -A binaries.a | grep -e ".text" | awk '{print $2}' > memory_usage_per_tile.txt

The file memory_usage_per_tile.txt will contain this memory allocation information. Further details of the deployed graph can be extracted with this approach.

Dumping auxiliary Poplar information

Two environment variable flags are available to get to extra Poplar information.

Poplar vertex graph

The Poplar vertex graph is a DOT file containing a complete description of the lowered Poplar graph. Each node in the graph represents one vertex in the Poplar graph operating on one region of a tensor.

It can be used for generating a Graphcore circular graph image.

Poplar interval report

The interval report is a CSV file describing the number of tiles executing, exchanging and syncing on each instruction cycle.

It can be used for generating a Graphcore linear activity diagram.

The TF_POPLAR_FLAGS environment variable describes how to set the environment flags correctly.

Using IPU optimized operations

Several custom versions of operators are provided to target operators available in Poplibs. See the Python API for more details.

Dropout

The Poplibs version of dropout does not need to store the dropout mask between the forward and backward parts of the graph, saving memory.

See tensorflow.python.ipu.ops.rand_ops.dropout().

Embedding lookup

This is a version of embedding lookup which will produce a smaller memory footprint for small lookups. Instead of using dynamic lookup into the main embedding dictionary, it uses a one hot operator and a multiply.

See tensorflow.python.ipu.embedding_ops.embedding_lookup().

Group normalization

Group normalization is an alternative to batch normalization, and produces smaller and more optimized graphs.

The original paper on group normalization: “Group Normalization”, Yuxin Wu, Kaiming He.

See tensorflow.python.ipu.normalization_ops.group_norm().

Instance normalization

Instance normalization is another alternative to batch normalization.

The original paper on instance normalization: “Instance Normalization: The Missing Ingredient for Fast Stylization” Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky.

See tensorflow.python.ipu.normalization_ops.group_norm().

Layer normalization

Layer normalization is another alternative to batch normalization.

The original paper on layer normalization: “Layer Normalization” Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton.

See tensorflow.python.ipu.normalization_ops.layer_norm().

Training a model

TensorFlow XLA and Poplar provide the ability to combine an entire training graph into a single operation in the TensorFlow graph. This accelerates training by removing the need to make calls to the IPU hardware for each operation in the graph.

However, if the Python code with the training pass in it is called multiple times, once for each batch in the training data set, then there is still the overhead of calling the hardware for each batch.

The Graphcore IPU support for TensorFlow provides three mechanisms to improve the training performance: training loops, data set feeds, and replicated graphs.

Training loops, datasets and feed queues

By placing the training operations inside a loop, they can be executed multiple times without returning control to the host. It is possible to use a standard TensorFlow while_loop operation to wrap the training operation, but the IPU library provides a convenient and feature rich version.

Normally when TensorFlow runs, operations which are not inside a loop will be executed once, and those operations will return one or more tensors with fixed values. However, when a training operation is placed into a loop, the inputs to that training operation need to provide a stream of values. Standard TensorFlow Python feed dictionaries cannot provide data in this form, so when training in a loop, data must be fed from a TensorFlow DataSet.

More information can be found on the DataSet class and its use in normal operation at https://www.tensorflow.org/guide/performance/datasets. TensorFlow provides many pre-configured DataSets for use in training models. See the site https://www.tensorflow.org/datasets.

To construct a system that will train in a loop, you will need to do the following:

  • Wrap your optimizer training operation in a loop.

  • Create an IPUInfeedQueue to feed data to that loop.

  • Create an IPUOutfeedQueue to take results out of that loop.

  • Create a TensorFlow DataSet to provide data to the input queue.

The following example shows how to construct a trivial DataSet, attach it to a model using in IPUInfeedQueue, feed results into an IPUOutfeedQueue, and construct a loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
 from tensorflow.python.ipu import ipu_compiler
from tensorflow.python.ipu import ipu_infeed_queue
from tensorflow.python.ipu import ipu_outfeed_queue
from tensorflow.python.ipu import loops
from tensorflow.python.ipu import scopes
from tensorflow.python.ipu import utils
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# The dataset for feeding the graphs
ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[800]))
ds = ds.map(lambda x: [x, x])
ds = ds.repeat()

# The host side queues
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds, feed_name="infeed")
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")


# The device side main
def body(x1, x2):
  d1 = x1 + x2
  d2 = x1 - x2
  outfeed = outfeed_queue.enqueue({'d1': d1, 'd2': d2})
  return outfeed


def my_net():
  r = loops.repeat(10, body, [], infeed_queue)
  return r


with scopes.ipu_scope('/device:IPU:0'):
  run_loop = ipu_compiler.compile(my_net, inputs=[])

# The outfeed dequeue has to happen after the outfeed enqueue
dequeue_outfeed = outfeed_queue.dequeue()

# Configure the hardware
config = utils.create_ipu_config()
config = utils.auto_select_ipus(config, 1)
utils.configure_ipu_system(config)

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)

  sess.run(run_loop)
  result = sess.run(dequeue_outfeed)
  print(result)

In this case the DataSet is a trivial one. It constructs a base DataSet from a single TensorFlow constant, and then maps the output of that DataSet into a pair of tensors. It then arranges for the DataSet to be repeated indefinitely.

After the DataSet is constructed, the two data feed queues are constructed. The IPUInfeedQueue takes the DataSet as a parameter, along with its name. Every queue in the system must have a unique name.

The IPUOutfeedQueue has extra options to control how it collects and outputs the data sent to it. None of these are used in this example.

Now that we have the DataSet and the queues for getting data in and out of the device side code, we can construct the device side part of the model. In this example, the body function constructs a very simple model, which does not even have an optimizer. It takes the two data samples which will be provided by the DataSet, and performs some simple maths on them, and inserts the results into the output queue.

Typically, in this function, the full ML model would be constructed, and a TensorFlow Optimizer would be used to generate a backward pass and variable update operations. The returned data would typically be a loss value, or perhaps nothing at all if all we do is call the training operation.

The my_net function is where the loops.repeat function is called. This wraps the body function in a loop. It takes as the first parameter the number of times to execute the operation, in this case 10. It also takes the function that generated the body of the loop, in this case the function body, a list of extra parameters to pass to the body, in this case none, and finally the infeed queue which will feed data into the loop.

Next we create an IPU scope at the top level and call ipu_compiler.compile passing the my_net function, to create the training loop in the main graph. The output of the ipu_compiler.compile will be an operation that can be called to execute the training loop.

Finally, we create an operation which can be used to fetch results from the outfeed queue. Note that it isn’t necessary to use an outfeed queue if you do not wish to receive any per-sample output from the training loop. If all you require is the final value of a tensor, then it can be output normally without the need for a queue.

If you run this example then you will find that the result is a Python dictionary containing two numpy arrays. The first is the d1 array and will contain x1 + x2 for each iteration in the loop. The second is the d2 array and will contain x1 - x2 for each iteration in the loop.

See entries in the Python API for more details.

Replicated graphs

To improve performance, multiple IPUs can be configured to run in a data parallel mode. The graph is said to be replicated across multiple IPUs.

Selecting the number of replicas

During system configuration, you specify the number of IPUs for the TensorFlow device using the auto_select_ipus() function, or the select_ipus() function.

A graph can be sharded across multiple IPUs (model parallelism), and then replicated across IPUs (data parallelism). When specifying the number of IPUs in the system, you must specify a multiple of the number of shards used by the graph.

For instance, if a graph is sharded over 2 IPUs, and you specify 8 IPUs to the auto_select_ipus function, then the graph will be replicated four times.

Supplying data

Data must be fed to a replicated graph using DataSets and infeeds. The IPUInfeedQueue and IPUOutfeedQueue classes require the number of replicas to be passed into the constructor in the replication_factor parameter.

Performing parameter updates

Each replica maintains its own copy of the graph, but during training it is important to ensure that the graph parameters are updated so that they are in sync across replicas.

A wrapper for standard TensorFlow optimizers is used to add extra operations to the parameter update nodes in the graph to average updates across replicas. It is called CrossReplicaOptimizer. See the Python API for more details.

Pipelined training

The IPU pipeline API creates a series of computational stages, where the outputs of one stage are the inputs to the next one. These stages are then executed in parallel across multiple IPUs. This approach can be used to split the model where layer(s) are executed on different IPUs.

This improves utilization of the hardware when a model is too large to fit into a single IPU and must be sharded across multiple IPUs.

Each of the stages is a set of operations, and is described using a Python function, in much the same way as the ipu.compile takes a function that describes the graph to compile onto the IPU.

See the Python API for more specific details of the ipu.pipeline operator.

The pipeline API requires data inputs to be provided by a tf.DataSet source connected via an infeed operation. If you would like per-sample output, for instance the loss, then this will have to be provided by an outfeed operation.

The computational stages can be interleaved on the devices in two different ways as described by the pipeline_schedule parameter. By default the API will use the PipelineSchedule.Grouped mode, where the forward passes are grouped together, and the backward passes are grouped together. The alternative is the PipelineSchedule.Interleaved, where the forward and backward passes are interleaved, so that fewer activations need to be stored.

Sharded scheduling

Sharded pipeline schedule illustration

Interleaved scheduling

Interleaved pipeline schedule illustration

Grouped scheduling

Grouped pipeline schedule illustration

Pipeline stage inputs and outputs

The first pipeline stage needs to have inputs which are a combination of the tensors from the DataSet, and the tensors given as arguments to the pipeline operation. Any data which changes for every sample or minibatch of the input should be included in the DataSet, while data which can vary only on each run of the pipeline should be passed as arguments to the pipeline operation. Parameters like the learning rate would fit into this latter case.

Every subsequent pipeline stage must have its inputs as the outputs of the previous stage. Note that things like the learning rate must be threaded through each pipeline stage until they are used.

Applying an optimizer to the graph

The optimizer must be applied by creating it in a special optimizer function and then returning a handle to it from that function. The function is passed into the optimizer_function argument of the pipeline operation.

When a pipeline is running it will accumulate the gradients from each step of the pipeline and only apply the updates to the graph parameters at the end of each pipeline run, given by the pipeline_depth parameter. Consequently it is important for the system to have more knowledge of the optimizer and so it must be given to the pipeline operator using this function.

Device mapping

By default the pipeline operation will map the pipeline stages onto IPUs in order to minimise the inter-IPU communication lengths. If you need to override this order, then yu can use the device_mapping parameter.

Dataset benchmarking

In order to fully utilize the potential of the IPU, the tf.data.Dataset used by the IPUInfeedQueue needs to be optimal so that the IPU is not constantly waiting for more data to become available.

To benchmark your tf.data.Dataset, you can make use of the ipu.dataset_benchmark tool, see the Python API for more specific details of the ipu.dataset_benchmark functions which allow you to obtain the maximum throughput of your tf.data.Dataset.

If the throughput of your tf.data.Dataset is the bottleneck, you can try and optimize it using:

Accessing the JSON data

The functions in ipu.dataset_benchmark return the JSON as a string which can be loaded into a JSON object using the native JSON library, for example:

 import json

# Create your tf.data.Dataset
dataset = ...
benchmark_op = ipu.dataset_benchmark.dataset_benchmark(dataset, 10, 512)

with tf.Session() as sess:
    json_string = sess.run(benchmark_op)
    json_object = json.loads(j_str[0])

Troubleshooting

The following error (especially the lines containing VariableV2) indicate that a variable has been created which is not a resource variable.

 InvalidArgumentError (see above for traceback): Cannot assign a device for operation
  'InceptionV1/Logits/Conv2d_0c_1x1/biases': Could not satisfy explicit device specification
  '/device:IPU:0' because no supported kernel for IPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices:
Const: CPU IPU XLA_CPU
Identity: CPU IPU XLA_CPU
Fill: CPU IPU XLA_CPU
Assign: CPU
VariableV2: CPU

Example using IPUEstimator

This example shows how to use the IPUEstimator to train a simple CNN on the CIFAR-10 dataset. The XLA compilation is already handled while using the IPUEstimator, so the model_fn should not be manually compiled with ipu_compiler.

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import time

import tensorflow.compat.v1 as tf

from tensorflow.keras import Sequential
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.python import ipu

NUM_CLASSES = 10


def model_fn(features, labels, mode, params):
  """A simple CNN based on https://keras.io/examples/cifar10_cnn/"""

  model = Sequential()
  model.add(Conv2D(32, (3, 3), padding="same"))
  model.add(Activation("relu"))
  model.add(Conv2D(32, (3, 3)))
  model.add(Activation("relu"))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Conv2D(64, (3, 3), padding="same"))
  model.add(Activation("relu"))
  model.add(Conv2D(64, (3, 3)))
  model.add(Activation("relu"))
  model.add(MaxPooling2D(pool_size=(2, 2)))
  model.add(Dropout(0.25))

  model.add(Flatten())
  model.add(Dense(512))
  model.add(Activation("relu"))
  model.add(Dropout(0.5))
  model.add(Dense(NUM_CLASSES))

  logits = model(features, training=mode == tf.estimator.ModeKeys.TRAIN)

  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  if mode == tf.estimator.ModeKeys.EVAL:
    predictions = tf.argmax(input=logits, axis=-1)
    eval_metric_ops = {
        "accuracy": tf.metrics.accuracy(labels=labels,
                                        predictions=predictions),
    }
    return tf.estimator.EstimatorSpec(mode,
                                      loss=loss,
                                      eval_metric_ops=eval_metric_ops)
  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(params["learning_rate"])
    train_op = optimizer.minimize(loss=loss)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
  raise NotImplementedError(mode)


def parse_args():
  parser = argparse.ArgumentParser()

  parser.add_argument(
      "--test-only",
      action="store_true",
      help="Skip training and test using latest checkpoint from model_dir.")

  parser.add_argument("--batch-size",
                      type=int,
                      default=32,
                      help="The batch size.")

  parser.add_argument(
      "--iterations-per-loop",
      type=int,
      default=100,
      help="The number of iterations (batches) per loop on IPU.")

  parser.add_argument("--log-interval",
                      type=int,
                      default=10,
                      help="Interval at which to log progress.")

  parser.add_argument("--summary-interval",
                      type=int,
                      default=1,
                      help="Interval at which to write summaries.")

  parser.add_argument("--training-steps",
                      type=int,
                      default=200000,
                      help="Total number of training steps.")

  parser.add_argument(
      "--learning-rate",
      type=float,
      default=0.01,
      help="The learning rate used with stochastic gradient descent.")

  parser.add_argument(
      "--model-dir",
      help="Directory where checkpoints and summaries are stored.")

  return parser.parse_args()


def create_ipu_estimator(args):
  ipu_options = ipu.utils.create_ipu_config(
      profiling=False,
      use_poplar_text_report=False,
  )

  ipu.utils.auto_select_ipus(ipu_options, num_ipus=1)

  ipu_run_config = ipu.ipu_run_config.IPURunConfig(
      iterations_per_loop=args.iterations_per_loop,
      ipu_options=ipu_options,
  )

  config = ipu.ipu_run_config.RunConfig(
      ipu_run_config=ipu_run_config,
      log_step_count_steps=args.log_interval,
      save_summary_steps=args.summary_interval,
      model_dir=args.model_dir,
  )

  return ipu.ipu_estimator.IPUEstimator(
      config=config,
      model_fn=model_fn,
      params={"learning_rate": args.learning_rate},
  )


def train(ipu_estimator, args, x_train, y_train):
  """Train a model on IPU and save checkpoints to the given `args.model_dir`."""
  def input_fn():
    # If using Dataset.from_tensor_slices(), the data will be embedded
    # into the graph as constants, which makes the training graph very
    # large and impractical. So use Dataset.from_generator() here instead,
    # but add prefetching and caching to improve performance.

    def generator():
      return zip(x_train, y_train)

    types = (x_train.dtype, y_train.dtype)
    shapes = (x_train.shape[1:], y_train.shape[1:])

    dataset = tf.data.Dataset.from_generator(generator, types, shapes)
    dataset = dataset.prefetch(len(x_train)).cache()
    dataset = dataset.repeat()
    dataset = dataset.shuffle(len(x_train))
    dataset = dataset.batch(args.batch_size, drop_remainder=True)

    return dataset

  # Training progress is logged as INFO, so enable that logging level
  tf.logging.set_verbosity(tf.logging.INFO)

  t0 = time.time()
  ipu_estimator.train(input_fn=input_fn, steps=args.training_steps)
  t1 = time.time()

  duration_seconds = t1 - t0
  images_per_second = args.training_steps * args.batch_size / duration_seconds
  print("Took {:.2f} minutes, i.e. {:.0f} images per second".format(
      duration_seconds / 60, images_per_second))


def calc_batch_size(num_examples, batches_per_loop, batch_size):
  """Reduce the batch size if needed to cover all examples without a remainder."""
  assert batch_size > 0
  assert num_examples % batches_per_loop == 0
  while num_examples % (batch_size * batches_per_loop) != 0:
    batch_size -= 1
  return batch_size


def test(ipu_estimator, args, x_test, y_test):
  """Test the model on IPU by loading weights from the final checkpoint in the
  given `args.model_dir`."""

  num_test_examples = len(x_test)

  test_batch_size = calc_batch_size(num_test_examples,
                                    args.iterations_per_loop, args.batch_size)

  if test_batch_size != args.batch_size:
    print("Test batch size changed to {}.".format(test_batch_size))

  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
    dataset = dataset.batch(test_batch_size, drop_remainder=True)
    return dataset

  num_steps = num_test_examples // test_batch_size
  metrics = ipu_estimator.evaluate(input_fn=input_fn, steps=num_steps)
  test_loss = metrics["loss"]
  test_accuracy = metrics["accuracy"]

  print("Test loss: {:g}".format(test_loss))
  print("Test accuracy: {:.2f}%".format(100 * test_accuracy))


def main():
  args = parse_args()
  train_data, test_data = cifar10.load_data()

  num_test_examples = len(test_data[0])
  if num_test_examples % args.iterations_per_loop != 0:
    raise ValueError(("iterations_per_loop ({}) must evenly " +
                      "divide the number of test examples ({})").format(
                          args.iterations_per_loop, num_test_examples))

  ipu_estimator = create_ipu_estimator(args)

  def normalise(x, y):
    return x.astype("float32") / 255.0, y.astype("int32")

  if not args.test_only:
    print("Training...")
    x_train, y_train = normalise(*train_data)
    train(ipu_estimator, args, x_train, y_train)

  print("Testing...")
  x_test, y_test = normalise(*test_data)
  test(ipu_estimator, args, x_test, y_test)


if __name__ == "__main__":
  main()

Example using IPUPipelineEstimator

This example shows how to use the IPUPipelineEstimator to train a simple CNN on the CIFAR-10 dataset. It can be compared to the example using the IPUEstimator (Example using IPUEstimator) to see the changes required to add pipelined execution to a model.

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import time

import tensorflow.compat.v1 as tf

from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.python import ipu

NUM_CLASSES = 10


def model_fn(mode, params):
  """A simple CNN based on https://keras.io/examples/cifar10_cnn/ split
  into two pipeline stages placed on different IPUs."""

  # Tell the dropout layers whether we are training to avoid a placeholder.
  is_training = mode == tf.estimator.ModeKeys.TRAIN

  def stage1(features, labels):
    partial = Conv2D(32, (3, 3), padding="same")(features)
    partial = Activation("relu")(partial)
    partial = Conv2D(32, (3, 3))(partial)
    partial = Activation("relu")(partial)
    partial = MaxPooling2D(pool_size=(2, 2))(partial)
    partial = Dropout(0.25)(partial, training=is_training)

    return partial, labels

  def stage2(partial, labels):
    partial = Conv2D(64, (3, 3), padding="same")(partial)
    partial = Activation("relu")(partial)
    partial = Conv2D(64, (3, 3))(partial)
    partial = Activation("relu")(partial)
    partial = MaxPooling2D(pool_size=(2, 2))(partial)
    partial = Dropout(0.25)(partial, training=is_training)

    partial = Flatten()(partial)
    partial = Dense(512)(partial)
    partial = Activation("relu")(partial)
    partial = Dropout(0.5)(partial, training=is_training)
    logits = Dense(NUM_CLASSES)(partial)

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

    if mode == tf.estimator.ModeKeys.TRAIN:
      # This return value is passed to the `optimizer_function`.
      return loss

    if mode == tf.estimator.ModeKeys.EVAL:
      predictions = tf.argmax(input=logits, axis=1, output_type=tf.int32)
      # These return values are passed to the `eval_metrics_fn`.
      return loss, predictions, labels

    raise NotImplementedError(mode)

  def optimizer_function(loss):
    optimizer = tf.train.GradientDescentOptimizer(params["learning_rate"])
    return ipu.ops.pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

  def eval_metrics_fn(loss, predictions, labels):
    # This is executed on the host.
    return {
        "loss": loss,
        "accuracy": tf.metrics.accuracy(predictions=predictions,
                                        labels=labels),
    }

  return ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec(
      mode,
      computational_stages=[stage1, stage2],
      optimizer_function=optimizer_function,
      eval_metrics_fn=eval_metrics_fn,
      pipeline_depth=params["pipeline_depth"])


def parse_args():
  parser = argparse.ArgumentParser()

  parser.add_argument(
      "--test-only",
      action="store_true",
      help="Skip training and test using latest checkpoint from model_dir.")

  parser.add_argument("--batch-size",
                      type=int,
                      default=16,
                      help="The batch size.")

  parser.add_argument(
      "--pipeline-depth",
      type=int,
      default=4,
      help="The the number of batches that will be pipelined together.")

  parser.add_argument(
      "--iterations-per-loop",
      type=int,
      default=100,
      help="The number of iterations (pipelines executions) per loop on IPU.")

  parser.add_argument("--log-interval",
                      type=int,
                      default=10,
                      help="Interval at which to log progress.")

  parser.add_argument("--summary-interval",
                      type=int,
                      default=1,
                      help="Interval at which to write summaries.")

  parser.add_argument("--training-steps",
                      type=int,
                      default=100000,
                      help="Total number of training steps.")

  parser.add_argument(
      "--learning-rate",
      type=float,
      default=0.01,
      help="The learning rate used with stochastic gradient descent.")

  parser.add_argument(
      "--model-dir",
      help="Directory where checkpoints and summaries are stored.")

  return parser.parse_args()


def create_ipu_estimator(args):
  num_ipus_in_pipeline = 2

  ipu_options = ipu.utils.create_ipu_config()
  ipu.utils.auto_select_ipus(ipu_options, num_ipus_in_pipeline)

  ipu_run_config = ipu.ipu_run_config.IPURunConfig(
      num_shards=num_ipus_in_pipeline,
      iterations_per_loop=args.iterations_per_loop,
      ipu_options=ipu_options,
  )

  config = ipu.ipu_run_config.RunConfig(
      ipu_run_config=ipu_run_config,
      log_step_count_steps=args.log_interval,
      save_summary_steps=args.summary_interval,
      model_dir=args.model_dir,
  )

  return ipu.ipu_pipeline_estimator.IPUPipelineEstimator(
      config=config,
      model_fn=model_fn,
      params={
          "learning_rate": args.learning_rate,
          "pipeline_depth": args.pipeline_depth,
      },
  )


def train(ipu_estimator, args, x_train, y_train):
  """Train a model on IPU and save checkpoints to the given `args.model_dir`."""
  def input_fn():
    # If using Dataset.from_tensor_slices(), the data will be embedded
    # into the graph as constants, which makes the training graph very
    # large and impractical. So use Dataset.from_generator() here instead,
    # but add prefetching and caching to improve performance.

    def generator():
      return zip(x_train, y_train)

    types = (x_train.dtype, y_train.dtype)
    shapes = (x_train.shape[1:], y_train.shape[1:])

    dataset = tf.data.Dataset.from_generator(generator, types, shapes)
    dataset = dataset.prefetch(len(x_train)).cache()
    dataset = dataset.repeat()
    dataset = dataset.shuffle(len(x_train))
    dataset = dataset.batch(args.batch_size, drop_remainder=True)

    return dataset

  # Training progress is logged as INFO, so enable that logging level
  tf.logging.set_verbosity(tf.logging.INFO)

  t0 = time.time()
  ipu_estimator.train(input_fn=input_fn, steps=args.training_steps)
  t1 = time.time()

  duration_seconds = t1 - t0
  images_per_step = args.batch_size * args.pipeline_depth
  images_per_second = args.training_steps * images_per_step / duration_seconds
  print("Took {:.2f} minutes, i.e. {:.0f} images per second".format(
      duration_seconds / 60, images_per_second))


def calc_batch_size(num_examples, batches_per_loop, batch_size):
  """Reduce the batch size if needed to cover all examples without a remainder."""
  assert batch_size > 0
  assert num_examples % batches_per_loop == 0
  while num_examples % (batch_size * batches_per_loop) != 0:
    batch_size -= 1
  return batch_size


def test(ipu_estimator, args, x_test, y_test):
  """Test the model on IPU by loading weights from the final checkpoint in the
  given `args.model_dir`."""

  num_test_examples = len(x_test)

  batches_per_loop = args.pipeline_depth * args.iterations_per_loop
  test_batch_size = calc_batch_size(num_test_examples, batches_per_loop,
                                    args.batch_size)

  if test_batch_size != args.batch_size:
    print("Test batch size changed to {}.".format(test_batch_size))

  def input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
    dataset = dataset.batch(test_batch_size, drop_remainder=True)
    return dataset

  num_steps = num_test_examples // (test_batch_size * args.pipeline_depth)
  metrics = ipu_estimator.evaluate(input_fn=input_fn, steps=num_steps)
  test_loss = metrics["loss"]
  test_accuracy = metrics["accuracy"]

  print("Test loss: {:g}".format(test_loss))
  print("Test accuracy: {:.2f}%".format(100 * test_accuracy))


def main():
  args = parse_args()
  train_data, test_data = cifar10.load_data()

  num_test_examples = len(test_data[0])
  batches_per_loop = args.pipeline_depth * args.iterations_per_loop
  if num_test_examples % batches_per_loop != 0:
    raise ValueError(("pipeline_depth * iterations_per_loop ({} * {}) must " +
                      "evenly divide the number of test examples ({})").format(
                          args.pipeline_depth, args.iterations_per_loop,
                          num_test_examples))

  ipu_estimator = create_ipu_estimator(args)

  def normalise(x, y):
    return x.astype("float32") / 255.0, y.astype("int32")

  if not args.test_only:
    print("Training...")
    x_train, y_train = normalise(*train_data)
    train(ipu_estimator, args, x_train, y_train)

  print("Testing...")
  x_test, y_test = normalise(*test_data)
  test(ipu_estimator, args, x_test, y_test)


if __name__ == "__main__":
  main()

Distributed training example

This example shows how to use the IPUEstimator with the IPUMultiWorkerStrategy to perform distributed training of a model on the MNIST dataset.

The example is based on the following official tutorial with some modifications for usage with the IPU: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_estimator

We highlight the changes needed to convert code using IPUEstimator to support distributed training below.

The input function

In multi-worker training, it is necessary to shard the dataset such that each worker processes distinct portions of the dataset.

When used in a distributed context, the input function is passed an additional argument input_context that can be used to get the current worker index and the total number of workers. We pass this information to the Dataset.shard() function to perform the sharding.

Note that the batch size provided by the input function is the per-worker batch size. The global batch size will be this multiplied by the number of workers.

The model function

The optimizer will automatically divide the loss by the number of workers, so in the model function we should only divide the loss by the local batch size.

We will do some changes to how we update the weights of the model. Instead of using the high-level Optimizer.minimize() function, we will use the Optimizer.compute_gradients() and Optimizer.apply_gradients() separately in order to control their placement. The Optimizer.compute_gradients() call (the backward pass) is placed on the IPU, while the Optimizer.apply_gradients() call (the allreduce of gradients and weight updates) is placed on the host. This is done by using the host_call parameter in IPUEstimatorSpec.

In practice this means that the gradients will be streamed from the IPU to the host as soon as they are computed. The worker hosts will then start reducing the gradients amongst themselves, allowing overlap between the backward pass on the IPUs with the reductions on the hosts. After a gradient is reduced across the workers, the corresponding weight update is also done on the host.

The reduction is done using a ring-based collectives implementation with gRPC as the cross-host communication layer.

One benefit of this approach is that any additional optimizer state (such as momentum) is only needed in host memory, so there is no additional IPU memory consumption when using stateful optimizers with this approach.

Cluster definition

We use the TFConfigClusterResolver which reads the TF_CONFIG environment variable to determine the cluster definition.

There are two components of TF_CONFIG: cluster and task. cluster provides information about the entire cluster, namely the workers and parameter servers in the cluster. task provides information about the current task. In this example, the task type is worker and the task index is 0.

You could run this example with two workers on the same machine (in different terminals) like this:

 $ TF_CONFIG='{"cluster":{"worker":["localhost:3737","localhost:3738"]},"task":{"type":"worker","index":0}}' python distributed_training_example.py
$ TF_CONFIG='{"cluster":{"worker":["localhost:3737","localhost:3738"]},"task":{"type":"worker","index":1}}' python distributed_training_example.py

Complete example

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import numpy as np
import tensorflow as tf
from tensorflow.python import ipu

BATCH_SIZE = 64


def input_fn(mode, input_context=None):  # pylint: disable=unused-argument
  train_data, _ = tf.keras.datasets.mnist.load_data()

  def normalise(image, label):
    image = image.astype(np.float32) / 255.0
    image = np.expand_dims(image, axis=-1)
    label = label.astype(np.int32)
    return image, label

  x_train, y_train = normalise(*train_data)

  def generator():
    return zip(x_train, y_train)

  types = (x_train.dtype, y_train.dtype)
  shapes = (x_train.shape[1:], y_train.shape[1:])
  mnist_dataset = tf.data.Dataset.from_generator(generator, types, shapes)

  if input_context:
    mnist_dataset = mnist_dataset.shard(input_context.num_input_pipelines,
                                        input_context.input_pipeline_id)

  mnist_dataset = mnist_dataset.shuffle(len(y_train)) \
      .cache().batch(BATCH_SIZE, drop_remainder=True).repeat()
  return mnist_dataset


def model_fn(features, labels, mode):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation="relu"),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation="relu"),
      tf.keras.layers.Dense(10)
  ])
  logits = model(features, training=mode == tf.estimator.ModeKeys.TRAIN)

  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {"logits": logits}
    return tf.estimator.EstimatorSpec(labels=labels, predictions=predictions)

  optimizer = tf.compat.v1.train.AdamOptimizer()
  loss = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True, reduction=tf.compat.v1.losses.Reduction.NONE)(labels,
                                                                      logits)
  loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)
  if mode == tf.estimator.ModeKeys.EVAL:
    predictions = tf.argmax(input=logits, axis=-1)
    eval_metric_ops = {
        "accuracy":
        tf.compat.v1.metrics.accuracy(labels=labels, predictions=predictions),
    }
    return tf.estimator.EstimatorSpec(mode,
                                      loss=loss,
                                      eval_metric_ops=eval_metric_ops)

  variables = model.trainable_variables

  def host_model_fn(*host_gradients):
    # This will allreduce the gradients and update the weights on the host.
    return optimizer.apply_gradients(zip(host_gradients, variables))

  train_op = tf.identity(loss)
  grads_and_vars = optimizer.compute_gradients(loss, var_list=variables)
  gradients = [g for (g, _) in grads_and_vars]
  host_call = (host_model_fn, gradients)

  return ipu.ipu_estimator.IPUEstimatorSpec(mode=mode,
                                            loss=loss,
                                            train_op=train_op,
                                            host_call=host_call)


# Get the cluster configuration from the TF_CONFIG environment variable.
cluster = tf.distribute.cluster_resolver.TFConfigClusterResolver()
strategy = ipu.ipu_multi_worker_strategy.IPUMultiWorkerStrategy(cluster)

ipu_options = ipu.utils.create_ipu_config()
ipu.utils.auto_select_ipus(ipu_options, num_ipus=1)
ipu_run_config = ipu.ipu_run_config.IPURunConfig(ipu_options=ipu_options)

config = ipu.ipu_run_config.RunConfig(
    ipu_run_config=ipu_run_config,
    train_distribute=strategy,
)

parser = argparse.ArgumentParser()
parser.add_argument("--num-steps", type=int, default=10000)
parser.add_argument("--model-dir")
args = parser.parse_args()

classifier = ipu.ipu_estimator.IPUEstimator(
    config=config,
    model_fn=model_fn,
    model_dir=args.model_dir,
)

# Training progress is logged as INFO, so enable that logging level.
tf.logging.set_verbosity(tf.logging.INFO)

tf.estimator.train_and_evaluate(
    classifier,
    train_spec=tf.estimator.TrainSpec(input_fn=input_fn,
                                      max_steps=args.num_steps),
    eval_spec=tf.estimator.EvalSpec(input_fn=input_fn))

Custom IPU operations

There are three mechanisms for providing custom operations to the IPU through the TensorFlow interface. The first uses a fully custom codelet and host build file.

The second case is a custom operation which is executed on the CPU.

The third possibility is a custom, fused elementwise arithmetic operation. In this last case, the gradient creation in the Optimizers will not produce a gradient operation for the custom operation.

Fully customized IPU operations

You can provide a custom operation to be compiled into the Poplar executable and run on the IPU hardware. You must provide a host-side shared object library that implements the action of adding vertices to a Poplar graph, given some Poplar tensor inputs. They can optionally provide a Poplar source code or binary file containing one or more “codelets” (code that runs on the IPU).

For more details writing codelets, please refer to the Poplar and Poplibs User Guide.

These operations are added with ipu.user_ops.precompiled_user_op. More information about this can be found in the Python API. An example of this can be found below.

The shared object file must contain an undecorated symbol, that should be declared as below. It should add vertices to the graph that perform the custom operation. The name of the symbol should match the name of the operation in the graph. By default these types of operations are called Build.

 extern "C"
poplar::program::Program Build(
  poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
  std::vector<poplar::Tensor>& outputs, const std::string &debug_prefix)

The arguments are:

graph

the poplar graph into which to add tensors and vertices.

inputs

a vector of poplar tensors which are inputs to the operation.

outputs

a vector into which to store the outputs of the operation. The vector will contain zero entries when the Build function is called.

debug_prefix

the debug name that has been given to the operation in the TensorFlow graph.

If the operation can have its gradient taken, then the shared object can contain a separate function with the same name as the forward pass builder. The function must be given the same name as the forward operation with _grad appended. The signature of the builder function is slightly different, as it takes the forward pass outputs and inputs as arguments, as well as the gradient outputs.

 extern "C"
poplar::program::Program Build_grad(
    poplar::Graph& graph, int input_grad_index,
    const std::vector<poplar::Tensor>& gradients,
    const std::vector<poplar::Tensor>& fwd_outputs,
    const std::vector<poplar::Tensor>& fwd_inputs,
    std::vector<poplar::Tensor>& outputs,
    const std::string& debug_prefix)

The arguments are:

graph

the poplar graph into which to add tensors and vertices.

input_grad_index

The index of the input for which this op is producing the partial derivative. If the gradient operation calculates all of the partial derivatives, then this input should be ignored.

gradients

the inputs to the gradient op, from the previous gradient op or loss.

fwd_outputs

the tensors which are the outputs of the forward operation.

fwd_inputs

the tensors which are the inputs to the forward operation.

outputs

the outputs of this gradient operation. There must be one per input of the original forward operation. Inputs which are not differentiable can have an null Poplar tensor.

debug_prefix

the name of the operation.

Metadata

The shared object file can optionally contain an undecorated symbol that is the same as the builder function with _metadata appended. This function must have the following signature:

 extern "C"
void Build_metadata(std::vector<std::int64_t>& allocating_indices,
  std::uint32_t& num_inplace, bool& is_elementwise,
  std::uint32_t num_inputs)

The arguments are:

allocating_indices

indicates which of the inputs should be allocated using the tensor allocation function. See the description in the Tensor allocation section below.

num_inplace

indicates the number of inputs which are ‘in place’. The first num_inplace of the inputs will be considered to be in-place.

is_elementwise

indicates that this operation is element-wise.

num_inputs

indicates how many inputs are on the operation.

The function should fill in the values of the first three arguments, which are all reference types.

In place operations

If an operation does an in-place modification of an input tensor, as opposed to creating a new output tensor, then the num_inplace can be used to indicate that this is the case. The system will ensure that when a tensor is updated in place, that any other uses of that tensor will be complete before the operation is run.

If a tensor is not marked as in place then the operation must not modify it. If it is modified then other operations which consume it may see an incorrect value on their input.

Elementwise operations

The IPU driver can do a better job of allocating the layout of Poplar tensors if it can associate them with specific operations. If the output of an operation is the same shape and layout as its first input, then it should be marked as elementwise.

Typically the graph building code for the operation will clone the input in order to generate the output Poplar tensor.

Tensor allocation

When generating the Poplar graph, sometimes the backend has the freedom to allocate an input to an operation. This happens when an input to an op is also the input to the graph, or when previous operations do not put constraints on the input tensor.

If this condition occurs, then by default the backend will create the Poplar tensor with linear mapping. See the section on tile mapping in the Poplar API guide.

To override this behaviour and allocate a tensor using a specific layout mapping, the custom operation can provide a function with the following signature:

 extern "C" poplar::Tensor Build_allocator(
  poplar::Graph& graph, std::uint32_t operand,
  const std::vector<size_t>& shape, poplar::Type type,
  const std::string& debug_prefix)

The arguments are:

graph

the Poplar graph where the tensor should be created.

operand

the operand number of the input to allocate.

shape

the shape of the tensor.

type

the Poplar data type for the tensor.

Gradient operations

As described above, when the gradient of the forward operation is generated, either a single op, or multiple operations can be inserted into the graph.

You can use the parameter separate_gradients on the precompiled_user_op function to select which of the two options are required. The compiled code must match this setting.

If the separate_gradients parameter is set to False, then the compiled function for generating the gradient operation should fill in one output for each of the inputs of the forward pass function. Each output should be the partial derivative with respect to one of the inputs.

If the separate_gradients parameter is True, then the gradient operation building function should produce an operation with a single output, which is the partial differential with respect to only one of the forward pass inputs.

The specific input will be given by the input_grad_index input of the call to the sharded object Build_grad function.

Example

This example shows the source file for a rotate op, which takes three vectors and rotates the x and y ones by the angle one.

 /* Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

#include <vector>

#include <poplar/Graph.hpp>
#include <poplar/Tensor.hpp>
#include <poputil/Util.hpp>
#include <poputil/VertexTemplates.hpp>
#include <poputil/exceptions.hpp>

extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
                               std::uint32_t& num_inplace, bool& is_elementwise,
                               std::uint32_t num_inputs) {
  allocating_indices.clear();
  num_inplace = 0;
  is_elementwise = true;
}

extern "C" poplar::program::Program Build(
    poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
    std::vector<poplar::Tensor>& outputs, const std::string& debugPrefix) {
  if (inputs.size() != 3) {
    throw poputil::poplibs_error("Rotate requires 3 inputs");
  }

  if (inputs[0].numElements() == 0) {
    return poplar::program::Sequence();
  }

  if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
    throw poputil::poplibs_error("All inputs must be rank 1");
  }

  if (inputs[0].dim(0) != inputs[1].dim(0) ||
      inputs[0].dim(0) != inputs[2].dim(0)) {
    throw poputil::poplibs_error(
        "Length of rotate vector and data vectors must match");
  }

  if (inputs[0].elementType() != inputs[1].elementType() ||
      inputs[0].elementType() != inputs[2].elementType()) {
    throw poputil::poplibs_error(
        "Data types of angle vector and data vectors must match");
  }

  auto dType = inputs[0].elementType();

  /*
   * Create a ComputeSet which will be executed, and contains the vertices
   */
  auto cs = graph.addComputeSet(debugPrefix + "/rotate");

  /*
   * Get the tile mapping for the complete tensor.  We will map the vertices so
   * that they match the layout of the 'x' input tensor (input[0]).  If the 'x'
   * tensor was layed out differently to the other ones, then Poplar will
   * insert code to move the data in the other tensors to the mapped tile. So
   * ideally we would choose the best mapping for the vertices by analysing
   * all of the tensor mappings.
   */
  auto tileMapping = graph.getTileMapping(inputs[0]);

  /*
   * Get the target, which descibes properties of the hardware.
   */
  auto target = graph.getTarget();

  /*
   * Get the vector width of the particular data type, so that later we can
   * divide the tensor up between workers in an appropriate way.
   */
  const auto vectorWidth = target.getVectorWidth(dType);

  /*
   * Create the output tensors
   */
  outputs.push_back(graph.clone(inputs[0]));
  outputs.push_back(graph.clone(inputs[1]));

  auto xFlat = inputs[0].flatten();
  auto yFlat = inputs[1].flatten();
  auto aFlat = inputs[2].flatten();
  auto xOutputFlat = outputs[0].flatten();
  auto yOutputFlat = outputs[1].flatten();

  for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
    /*
     * If a tile contains no elements of the tensor then do not create any
     * vertices for it.
     */
    if (tileMapping[tile].empty()) {
      continue;
    }

    /*
     * Split up the regions of the inputs tensors so that they are evenly
     * distributed between the workers on the tile.
     */
    auto vertexRegions = poputil::splitRegionsBetweenWorkers(
        target, tileMapping[tile], vectorWidth, 2 * vectorWidth);

    for (const auto& regions : vertexRegions) {
      /*
       * If a region has no elements, then there is no need to add a vertex for
       * it.
       */
      if (regions.empty()) {
        continue;
      }

      /*
       * Add codelets to tiles which work over the regions in the input
       * tensors.
       */
      auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
                               );

      /* Map the vertex onto the appropriate tile. */
      graph.setTileMapping(v, tile);

      /* Provide a bogus cycle count estimate for the profiler. */
      graph.setCycleEstimate(v, 1);
    }
  }

  return poplar::program::Execute(cs);
}

This is the associated codelet file.

 #include <cmath>

#include <poplar/HalfFloat.hpp>
#include <poplar/Vertex.hpp>

using namespace poplar;

/*
 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
 * tensor 'angle', around the origin.
 */
template <typename FPType>
class Rotate : public Vertex {
 public:
  Vector<Output<Vector<FPType>>> x_out;
  Vector<Output<Vector<FPType>>> y_out;
  Vector<Input<Vector<FPType>>> x_in;
  Vector<Input<Vector<FPType>>> y_in;
  Vector<Input<Vector<FPType>>> angle;

  bool compute() {
    for (unsigned i = 0; i < angle.size(); ++i) {
      for (unsigned j = 0; j != angle[i].size(); ++j) {
        float a = angle[i][j];
        float x = x_in[i][j];
        float y = y_in[i][j];
        x_out[i][j] = x * cos(a) - y * sin(a);
        y_out[i][j] = x * sin(a) + y * cos(a);
      }
    }
    return true;
  }
};

template class Rotate<float>;
template class Rotate<half>;

This is an example of it in use:

 import os
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  x_data = tf.placeholder(np.float32, [4])
  y_data = tf.placeholder(np.float32, [4])
  p_angle = tf.placeholder(np.float32, [4])


def rotate_op(x, y, a):
  outputs = {
      "output_types": [tf.float32, tf.float32],
      "output_shapes": [tf.TensorShape([4]),
                        tf.TensorShape([4])],
  }

  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
  lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
  gp_path = os.path.join(base_path, "custom_codelet.gp")

  o = ipu.custom_ops.precompiled_user_op([x, y, a],
                                         lib_path,
                                         gp_path,
                                         outs=outputs)
  return o


def my_net(x, y, a):
  return rotate_op(x, y, a)


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        x_data: [2., 4., 6., -1.],
                        y_data: [2., 3., 8., -1.],
                        p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
                    })

  print(result)

When compiling the host-size shared object file, it is not necessary to include or link against any TensorFlow header or library files. Only the Poplar headers and link libraries should be necessary.

Fully customized CPU operations

The framework also allows a custom operation which executes code on the CPU instead of on the IPU. A shared object, much like the builder function of the device side custom operation must be written. The signature of this function should be:

 extern "C" void Callback(const std::vector<void*>& data,
                         const std::vector<std::uint32_t>& number_of_elements,
                         std::vector<void*>& outputs,
                         const std::string& name);

The arguments are: :data: the input data. the function should be written to expect a certain data type so the void pointer can be cast into the expected type. :number_of_elements: indicates the number of elements in the input data. :outputs: should be filled in by the operation. :name: is the name of the operation within the XLA/HLO graph.

Custom elementwise expressions

The Python class ipu.custom_ops.codelet_expression_op provides an interface for giving a custom fused expression to the compiler. This will be encoded into a single compute set.

The arguments to the Python function are a callable Python function which encodes the arithmetic expression, and the tensor arguments to the operation.

For instance:

 def my_custom_op(x, y, z):
    return x * x + y * z

ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)

In this example, the Python function my_custom_op provides the expression, and the arguments a, b and c are the three inputs from other parts of the TensorFlow graph.

Python operators which are supported in the function are +, -, *, and abs.

References

The following documents may be useful.

Python API

Automatic graph sharding

tensorflow.python.ipu.autoshard.automatic_sharding(num_shards, input_ts, loss_ts, edge_filter=None, frozen_inference=False)

Automatically set shards for all connected nodes in graph.

Parameters
  • num_shards – number of shards to split graph over.

  • input_ts – tensor closest to the datafeed in graph.

  • loss_ts – tensor closest to the loss in graph.

  • edge_filter – a callable predicate, with the signature fn(edge), where edge is a tuple with the name of the source op, and the name of the destination op.

  • frozen_inference – Flag set to True if running inference on a frozen graph.

tensorflow.python.ipu.autoshard.ipu_autoshard()

Provides a context for autosharding. All operations created within this context will have automatically sharded.

Compiler interface

tensorflow.python.ipu.ipu_compiler.compile(computation, inputs=None)
Builds an operator that compiles and runs computation with the Graphcore

IPU XLA backend.

Parameters
  • computation

    A Python function that builds a computation to apply to the input. If the function takes n inputs, inputs should be a list of n tensors.

    computation may return a list of operations and tensors. Tensors must come before operations in the returned list. The return value of compile is a list of tensors corresponding to the tensors from the output of computation.

    All Operation`s returned from `computation will be executed when evaluating any of the returned output tensors.

     

  • inputs – A list of inputs or None (equivalent to an empty list). Each input can be a nested structure containing values that are convertible to tensors. Note that passing an N-dimension list of compatible values will result in a N-dimension list of scalar tensors rather than a single Rank-N tensors. If you need different behaviour, convert part of inputs to tensors with tf.convert_to_tensor.

Returns

 

Same data structure as if computation(inputs) is called directly with some exceptions for correctness.

  1. None output. a NoOp would be returned which control-depends on computation.

  2. Single value output. A tuple containing the value would be returned.

  3. Operation-only outputs. a NoOp would be returned which control-depends on computation.

 

Raises

Exception – If the computation was not compiled for an IPU device.

Scoping contexts for IPUs

tensorflow.python.ipu.scopes.frontend_attribute(attribute_name, attribute_value, restore_to=None)

Sets the specified scope attribute to the specified value in the graph.

Parameters
  • attribute_name – Name of the attribute.

  • attribute_value – Attribute’s value as a string.

  • restore_to – If at the end of the scope the attribute was to be undefined sets it to this value instead.

Returns

A context

tensorflow.python.ipu.scopes.ipu_jit_scope(ipu_scope)

Provides a scope for compilation of operations.

If you would like to compile several sets of operations together, then this can provide that mechanism.

Parameters

ipu_scope – A name to differentiate between different JIT scopes

Returns

A context

tensorflow.python.ipu.scopes.ipu_scope(device)

Provides a scope for placing operations onto a particular IPU/IPU cluster.

Parameters

device – The name of the Tensorflow device, eg ‘/device:IPU:0’

Returns

A context

tensorflow.python.ipu.scopes.ipu_shard(index)

Control sharding for a set of operations.

Provides a scope which targets operations onto a particular shard (IPU) of a multi-IPU sharded device.

Parameters

index – The index of the IPU on which to place the enclosed operations.

Returns

A context

tensorflow.python.ipu.scopes.outside_compilation_scope(name='outside')

Provides a scope for placing operations on the host, outside the current compilation scope. The operations will be placed on the default host device. This allows for offloading computations from the IPU to the host, which can be useful for operations that are not supported or suitable for execution on the IPU.

Example:

 def my_net(a):
  with ipu_scope("/device:IPU:0"):
    b = a * a
    with outside_compilation_scope():
      c = b + 2  # Placed on the host.
    d = b + c
    return d
Parameters

name – A name for the outside compilation scope.

Returns

A context

tensorflow.python.ipu.scopes.partials_type(override_type)

Override the default type used to store intermediate results by some operations.

Parameters

override_type – Numpy type of the partials (float16 or float32)

Returns

A context

tensorflow.python.ipu.scopes.stochastic_rounding(override)

Control stochastic rounding for a set of operations.

Manually sets the stochastic rounding method to use.

Returns

A context

Infeed queue

class tensorflow.python.ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, feed_name, device_ordinal=0, replication_factor=1, data_to_prefetch=1)

Wraps a tf.Dataset object with infeed operations specific to the IPU.

This class, along with tensorflow.python.ipu.loops is used to create a data pipeline from a dataset into a training/inference loop on the IPU inside a single session.run which reduces the overheads of calling session.run for each iteration of the loop.

You should pass the infeed queue as an argument to a loop from tensorflow.python.ipu.loops. These loops will then handle the dequeuing of the data to the device automatically.

The feed_name allows individual feeds to be named. When including more than one feed in the same graph, each should be independently named.

The following skeleton shows how to use this method when building a training loop. Note how the body signature contains variables which correspond to the nested structure of tf.Tensor objects representing the next element in the infeed queue:

 # Create an example dataset.
dataset = ...  # A `tf.data.Dataset` object.

def dataset_parser(value):
  features, labels = parse_record(value)
  return {"features": features,
          "labels": labels}
# The resulting dataset has a nested structure of: {features, labels}.
dataset = dataset.map(dataset_parser)

infeed_queue = ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, feed_name="training_infeed")

# dataset can no longer be used beyond this point.

def my_net():
  # Note how the nested structure forms part of the loop body signature.
  def body(loss, features, labels):
    with variable_scope.variable_scope("vs", use_resource=True):
      y = tf.conv2d(features, .....)
      ...
      ...
      logits = tf.nn.xw_plus_b(....)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels))
    optimizer = gradient_descent.GradientDescentOptimizer(0.000001)
    train = optimizer.minimize(loss)
    with ops.control_dependencies([train]):
      return array_ops.identity(loss)

  loss = 0.0
  return = tf.python.ipu.loops.repeat(10000, body, [loss], infeed_queue)

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[])

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)
  sess.run(variables.global_variables_initializer())
  result = sess.run(res)
property deleter

A tf.Operation that can be run to delete the resources owned by this IPUInfeedQueue. This allows creating a new IPUInfeedQueue with the same name afterwards.

Returns

A tf.Operation that can be run to delete this IPUInfeedQueue

property dequeued

Returns whether this queue has been dequeued.

Returns

A nested structure of tf.Tensor objects.

get_next()

Obsolete function.

property initializer

A tf.Operation that should be run to initialize this IPUInfeedQueue.

Returns

A tf.Operation that should be run to initialize this IPUInfeedQueue

Raises

ValueError – if the function initializer has already been called.

property number_of_tuple_elements

Returns the number of arguments supplied by this IPUInfeedQueue.

Outfeed queue

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedMode

Types used to control the IPUOutfeedQueue modes.

Contains the following values:

  • ALL: When used with an IPUOutfeedQueue, all the elements which were

    enqueued to the queue will be returned by the outfeed.

  • LAST: When used with an IPUOutfeedQueue, only the last element which was

    enqueued to the queue will be returned by the outfeed.

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueue(feed_name, outfeed_mode=None, outfeed_all=None, device_ordinal=0, replication_factor=1, io_batch_size=1)

Generates and adds outfeed enqueue/dequeue operations to the graph.

The queue has two modes of operation - outfeed all or outfeed last. In outfeed all mode every element that is enqueued will be stored for a subsequent dequeue. All of the enqueued elements will be returned when the dequeue operation is run.

In outfeed last mode only the last enqueued element is stored. The dequeue operation will in this case return a single element.

property deleter

A tf.Operation that can be run to delete the resources owned by this IPUOutfeedQueue. This allows creating a new IPUOutfeedQueue with the same name afterwards. The behaviour is undefined if this op is executed concurrently with the dequeue op.

Returns

A tf.Operation that can be run to delete this IPUOutfeedQueue

dequeue()

Generate host side operation to dequeue the outfeed values. The operation generated by this function will block if called prior to any enqueues.

The return value of this operation depends on the enqueued tensors, replication factor and the execution mode.

  1. Outfeed returning a single tensor:

 outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed", replication_factor=2)

def body(input):
  output = input + 1
  outfeed = outfeed_queue.enqueue(output)
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example the tensor output is of shape [4, 4] and it’s enqueued into the outfeed with replication_factor = 2. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.ALL, then the shape of the resulting outfed tensor will be [20, 2, 4, 4], where the first dimension represents the number of times we have enqueued a tensor to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed. The second dimension is the replication_factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.LAST, then the shape of the resulting outfed tensor will be [2, 4, 4], which represents the value of the output tensor the last time it was enqueued during execution for each of the replicated graphs.

  1. Outfeed returning a tuple of tensors:

 outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue((output, sum))
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a tuple of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.ALL, then the resulting outfed is a two-tuple of tensors with shapes ([20, 4, 4], [20, 1]), where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed for each of the tensors in the tuple. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.LAST, then the outfed is a two tuple of tensors with shapes ([4, 4], [1]), which represents the values of the output and sum tensors the last time they were enqueued during execution.

Note that replication_factor here is the default (=1), which means that the extra replication dimension is not added.

  1. Outfeed returning a dictionary of tensors:

 outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed", replication_factor=8)

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue({"x": output,
                                   "y": sum})
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(40, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a dictionary of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.ALL, then the resulting outfed is a dictionary of tensors with shapes: {“x”: [40, 8, 4, 4], “y”: [40, 8, 1]}, where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 40 times, and therefore we get 40 values back from the outfeed for each of the tensors in the tuple. The second dimension is the replication_factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is outfeed_mode == IPUOutfeedMode.LAST, then the outfed is a dictionary of tensors with shapes: {“x”: [8, 4, 4], “y”: [8, 1]}, which represents the values of the output and sum tensors the last time they were enqueued during execution for each of the replicated graphs.

enqueue(tensors)

Enqueue a tensor, tuple or a dictionary of tensors for being outfed from the IPU graph. This operation is placed on the IPU device. This function returns an Operation which needs be executed (by either returning it or using tf.control_dependencies(…))

Examples: 1. Outfeed returning a single tensor:

   outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

  def body(v):
    v = v + 1
    outfeed = outfeed_queue.enqueue(v)
    return (v, outfeed)

  def my_net(v):
    r = loops.repeat(20, body, (v))
    return r

  with ipu.scopes.ipu_scope("/device:IPU:0"):
    res = ipu_compiler.compile(my_net, inputs=[v])

  ...
  ...

2. Outfeed returning a tuple of tensors:

.. code-block:: python

  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

  def body(v):
    v = v + 1
    x = v * 2
    outfeed = outfeed_queue.enqueue((v, x))
    return (v, outfeed)

  def my_net(v):
    r = loops.repeat(20, body, (v))
    return r

  with ipu.scopes.ipu_scope("/device:IPU:0"):
    res = ipu_compiler.compile(my_net, inputs=[v])

  ...
  ...

3. Outfeed returning a dictionary of tensors:

.. code-block:: python

  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue(feed_name="outfeed")

  def body(v):
    v = v + 1
    x = v * 2
    outfeed = outfeed_queue.enqueue({"output_1": v,
                                     "output_2": x})
    return (v, outfeed)

  def my_net(v):
    r = loops.repeat(20, body, (v))
    return r

  with ipu.scopes.ipu_scope("/device:IPU:0"):
    res = ipu_compiler.compile(my_net, inputs=[v])

  ...
  ...

IPUEstimator

class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator with IPU support.

IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.

For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.

See https://tensorflow.org/guide/estimators for general information about estimators.

Parameters
  • model_fn – The model function. Refer to https://www.tensorflow.org/guide/custom_estimators#write_a_model_function for details on how to write this function.

  • model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

  • configtf.ipu.ipu_run_config.RunConfig configuration object.

  • paramsdict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.

  • warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm-started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters

name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters
  • input_fn

    A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • steps – Number of steps for which to evaluate model.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.

  • checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.

  • name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The string path to the exported directory.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • experimental_modetf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

get_variable_names()

Returns list of all variable names in this model.

Returns

List of names.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters

name – string or a list of string, name of the tensor.

Returns

Numpy array - value of the tensor.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns

The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns

def model_fn(features, labels, mode, config)

Return type

The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True)

Yields predictions for given features.

Note: The returned generator will block forever if you try to consume more elements than what is generated, instead of raising the regular StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. So you cannot simply drain it by using list(predictions), you have to consume the expected number of elements, e.g. using [next(predictions) for _ in range(num_examples)].

Parameters
  • input_fn

    A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:

    • features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.

    • A tuple, in which case the first item is extracted as features.

     

  • predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.

  • checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.

  • yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters
  • input_fn

    A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.

  • steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.

  • max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.

  • saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec

Ops and objects returned from a model_fn and passed to IPUEstimator.

static __new__(cls, mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)

Create new instance of IPUEstimatorSpec(mode, predictions, loss, train_op, eval_metric_ops, host_call, training_hooks, evaluation_hooks, prediction_hooks)

IPUPipelineEstimator

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator for pipelining on IPUs.

IPUPipelineEstimator, like IPUEstimator, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionaly, it adds support for pipelined execution over multiple IPUs.

The major API difference from the IPUEstimator is that the provided model_fn must return a IPUPipelineEstimatorSpec that contains the information needed for pipelined execution.

Refer to the pipelining_ops documentation for more details about pipelining.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters

name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters
  • input_fn

    A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • steps – Number of steps for which to evaluate model.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.

  • checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.

  • name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The string path to the exported directory.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • experimental_modetf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {‘my_asset_file.txt’: ‘/path/to/my_asset_file.txt’}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The string path to the exported directory.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

get_variable_names()

Returns list of all variable names in this model.

Returns

List of names.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters

name – string or a list of string, name of the tensor.

Returns

Numpy array - value of the tensor.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns

The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns

def model_fn(features, labels, mode, config)

Return type

The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True)

Yields predictions for given features.

Note: The returned generator will block forever if you try to consume more elements than what is generated, instead of raising the regular StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. So you cannot simply drain it by using list(predictions), you have to consume the expected number of elements, e.g. using [next(predictions) for _ in range(num_examples)].

Parameters
  • input_fn

    A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:

    • features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.

    • A tuple, in which case the first item is extracted as features.

     

  • predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.

  • checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.

  • yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters
  • input_fn

    A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

     

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.

  • steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.

  • max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.

  • saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec

Ops and objects returned from a model_fn and passed to IPUPipelineEstimator.

static __new__(cls, mode, computational_stages, pipeline_depth, eval_metrics_fn=None, optimizer_function=None, device_mapping=None, pipeline_schedule=None)

Creates a validated IPUPipelineEstimatorSpec instance.

Depending on the value of mode, different arguments are required. Namely

  • For mode == ModeKeys.TRAIN: the optimizer_function is required.

  • For mode == ModeKeys.EVAL: the eval_metrics_fn is required.

Refer to the pipelining_ops documentation for more details about pipelining.

Parameters
  • mode – A ModeKeys. Specifies if this is training, evaluation or prediction.

  • computational_stages – a list of Python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline state as its inputs.

  • pipeline_depth – the number of times each pipeline stage will be executed.

  • eval_metrics_fn – a Python function which takes the output of the last computational stage as parameters and returns a dict of evaluation metrics. The dict must contain a a loss tensor value with the key “loss”. This function will be called on the host.

  • optimizer_function – a Python function which takes the output of the last computational stage as parameters and returns an instance of OptimizerFunctionOutput in order to generate the back-propagation and weight-update parts of the model suitable for training.

  • device_mapping – optional stage to IPU mapping override.

  • pipeline_schedule – the scheduling algorithm to use for pipeline lowering. Must be of type PipelineSchedule.

Returns

A validated IPUPipelineEstimatorSpec object.

Raises

ValueError – If validation fails.

class tensorflow.python.ipu.ipu_run_config.IPURunConfig

IPU related configuration required by IPUEstimator.

Parameters
  • iterations_per_loop – This is the number of iterations running on the IPU device before returning to the CPU host for each Session.run. This means that the global step is increased iterations_per_loop times in one Session.run.

  • ipu_options – An IpuOptions configuration protobuf which is populated prior to being passed into IPURunConfig. Note that if more than one device is being used then ipu_options needs to be populated with a device_config.

  • compile_summary – Generate compilation summary

  • num_replicas – Number of replicated graphs (data parallelism)

  • num_shards – Number of IPU devices on which the graph is sharded (model parallelism)

  • autosharding – Use the IPU automatic_sharding to automatically shard the graph across num_shards devices

class tensorflow.python.ipu.ipu_run_config.RunConfig(ipu_run_config=None, master=None, **kwargs)

RunConfig with IPU support.

Parameters
  • ipu_run_configIPURunConfig object for IPU-specific configuration.

  • master – a string. The address of the distributed master to use for training.

  • **kwargs – keyword config parameters.

Distributed training with IPUs

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMultiWorkerStrategy(cluster_resolver)

This is a distribution strategy for synchronous training using IPUs on multiple workers with between-graph replication.

It places variables on the host device of each worker, and uses multi-worker all-reduce to to keep the variables in sync, using TensorFlow’s implementation of collective operations over gRPC.

It is the responsibility of the user to place the operations on the IPU, while this strategy will make sure that the variables are kept on the host and in sync between the multiple workers.

When used during training with an Optimizer, this means that the variables will be streamed from the host to the IPU when needed, and that the gradients will be streamed back to the host and then all-reduced across the workers. Then the workers will do identical updates to their copies of the variables. In other words, optimizer.compute_gradients() is done on the device, while optimizer.apply_gradients() is done on the host. All the “slot” variables used by the optimizer (e.g. the momentum accumulator) are kept only in host memory and never used on the device, saving device memory.

The default behavior is to sync (allreduce) the variables when they are written (sync-on-write). This is a good choice when reads are at least as common as writes. However, for variables where writes are more common than reads (like metrics or population statistics in batch normalization layers), it is beneficial to only sync (allreduce) the variables when they are read (sync-on-read). In both cases, it is important that all the workers participate in the sync, otherwise progress will be blocked. Take special care in the latter case (with sync-on-read variables), because it implies that all the workers need to read these variables at the same time. For example, it implies that all the workers must checkpoint the model at the same time.

Looping utilities

tensorflow.python.ipu.loops.repeat(n, body, inputs=None, infeed_queue=None, use_while_v1=True)

Builds a loop that executes a fixed number of iterations.

The set of loop-carried tensors correspond to inputs. body must be a function that takes and returns the values of the loop-carried tensors.

Parameters
  • n – the number of loop iterations

  • body – a Python function that builds the loop body.

  • inputs – a list of initial values passed into the loop or None (equivalent to an empty list).

  • infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.

  • use_while_v1 – if True, then use a Tensorflow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises
tensorflow.python.ipu.loops.while_loop(condition, body, inputs=None, infeed_queue=None, maximum_iterations=None, use_while_v1=True)

Builds a while loop for IPUs.

The set of loop-carried tensors corresponds to inputs. Both condition and body take the current value of the loop-carried tensors. condition must return a single boolean value that determines whether iteration continues. body must return an updated list of values for the loop-carried tensors.

Parameters
  • condition – a Python function that builds the loop condition.

  • body – a Python function that builds the loop body.

  • inputs – a list of initial values passed into the loop, or None (equivalent to an empty list).

  • infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.

  • use_while_v1 – if True, then use a Tensorflow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises

TypeError – if body or condition has the wrong signature.

Utility functions for sharding graphs

tensorflow.python.ipu.sharding.dependencies(roots)

Find a list of ancestor operations for a given set of root operations

Parameters

roots – The root operations from which to start.

tensorflow.python.ipu.sharding.get_shard_from_colocation(op)

Find the shard number from an op which shares co-location information with the given operation.

Parameters

op – The operation to apply sharding to.

tensorflow.python.ipu.sharding.has_attr(o, attr_name)

Test for the presence of a specific attribute.

Parameters
  • o – An operation.

  • attr_name – The name of an attribute to test for.

Returns

True if the operation has the given attribute.

tensorflow.python.ipu.sharding.propagate_sharding(g)

Move the sharding from the forward pass operations onto their co-located backward pass operations.

Parameters

g – The graph.

General utility functions

class tensorflow.python.ipu.utils.DeviceConnectionType

Enumeration to describe the mechanism used to attach to the Poplar device.

  • ALWAYS indicates that the system will attach when configuring the device.

  • ON_DEMAND will defer connection to when the IPU is needed.

  • NEVER will never try to attach to a device. Used when compiling offline.

class tensorflow.python.ipu.utils.ExecutionProfileType

The execution profile type indicates the desired information in the execution profile.

  • NO_PROFILE indicates that there should be no execution profiling.

  • DEVICE_PROFILE indicates that the execution profile should contain only device wide events.

  • IPU_PROFILE indicates that the profile should contain IPU level execution events.

  • TILE_PROFILE indicates that the profile should contain Tile level execution events.

class tensorflow.python.ipu.utils.SelectionOrder

Depending on the communication pattern of the model, the order in which the IPUs are selected and mapped to shards can impact the performance.

For example, given a model which executes on multiple IPUs:

 def sharded_graph(pa, pb, pc, pd):
  with ipu.scopes.ipu_shard(0):
    o1 = pa + pb
  with ipu.scopes.ipu_shard(1):
    o2 = o1 + pc
  with ipu.scopes.ipu_shard(2):
    o3 = o2 + pd
    return o3

and a typical machine with 8 Graphcore C2 cards:

  _______               _______
|       |             |       |
|  14   |=============|  15   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  12   |=============|  13   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  10   |=============|  11   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   8   |=============|   9   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   6   |=============|   7   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   4   |=============|   5   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   2   |=============|   3   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   0   |=============|   1   |
|_______|             |_______|

(where each numbered square represents an IPU with the given device ID and the == and || connections represent IPUs being directly connected via IPU-Links)

we can see that the ipu_shard(0) directly communicates with ipu_shard(1) and that ipu_shard(1) directly communicates with ipu_shard(2). If the shards 0, 1, 2 were mapped to IPUs 0, 1, 2 in that order, then the communication between shards 1 and 2 would not have a direct connection via an IPU-Link and would have to perform a “hop” via an IPU. If the shards 0, 1, 2 were mapped to IPUs 0, 1, 3 in that order, then the communication between shards 1 and 2 would have a direct connection via an IPU-Link which will reduce the communication cost.

This Enum class is used to control the order in which the IPUs are selected. Currently, the following IPU selection orderings are supported: * AUTO: automatically try and select the best selection given the network. * ZIGZAG: follow the natural ordering of IPUs. In the above example, the

IPUs would be selected in the following order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.

  • SNAKE: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after. In the above example, the IPUs would be selected in the following order: 0, 1, 3, 2, 4, 5, 7, 6, 8, 9, 11, 10, 12, 13, 15, 14.

  • HOOF: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after and the last and first shard are on the same C2 cards. In the above example, the IPUs would be selected in the following order: 0, 2, 4, 6, 8, 10, 12, 14, 15, 13, 11, 9, 7, 5, 3, 1.

The SNAKE and HOOF IPU selection orders are particularly beneficial for pipelined models.

tensorflow.python.ipu.utils.auto_select_ipus(opts, num_ipus)

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple Tensorflow devices, each with control of one of more IPUs. The devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each device can control a specific number of IPUs, given by the num_ipus parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.

Examples:

 # Create a single device, with one IPU
opts = create_ipu_config()
opts = auto_select_ipus(opts, num_ipus=1)
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two devices, with 2 IPUs per device.
opts = create_ipu_config()
opts = auto_select_ipus(opts, num_ipus=[2,2])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two devices, with 1 IPU in the first device and 2 IPUs
# in the second device.
opts = create_ipu_config()
opts = auto_select_ipus(opts, num_ipus=[1,2])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • num_ipus – List of IPUs per Tensorflow device

Returns

The IpuOptions configuration protobuf, configured for auto-selecting a set of IPU devices.

tensorflow.python.ipu.utils.configure_ipu_system(config, device='cpu')

Configure an IPU system. Passing an IpuOptions protobuf created by the create_ipu_config function.

Parameters
  • config – An IpuOptions configuration protobuf

  • device – The CPU device which is local to the IPU hardware

Returns

None

tensorflow.python.ipu.utils.create_ipu_config(profiling=False, enable_ipu_events=False, use_poplar_text_report=False, use_poplar_cbor_report=False, profile_execution=None, report_every_nth_execution=0, max_report_size=268435456, report_directory='', scheduler_selection='', always_rearrange_copies_on_the_host=False, merge_infeed_io_copies=False, disable_graph_convolution_caching=False, disable_graph_outlining=False, retain_control_dependencies=False, max_cross_replica_sum_buffer_size=0, max_inter_ipu_copies_buffer_size=0, max_scheduler_lookahead_depth=5, max_scheduler_search_space_size=64, prefetch_data_streams=True, selection_order=None)

Create an empty IPU session configuration structure. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (max_cross_replica_sum_buffer_size, max_inter_ipu_copies_buffer_size). They will be removed in a future version. Instructions for updating: Use set_optimization_options() instead.

Parameters
  • profiling – Enable compilation reports, and IPU trace events.

  • enable_ipu_events – Enable IPU trace events without poplar reports.

  • use_poplar_text_report – Enable the poplar textual report summary

  • use_poplar_cbor_report – Enable the poplar CBOR reports

  • profile_execution – Include Poplar execution profiles in the execution events. Can only be enabled if profling is also enabled. If set, can be True, ‘False`, or a member of the ExecutionProfileType enumeration. A True value indicates ExecutionProfileType.DEVICE_PROFILE.

  • report_every_nth_execution – Only produce an execution report on every Nth execution. 0 = One report only.

  • max_report_size – The maximum size of Poplar profiles to include in the profile events.

  • report_directory – When set, reports will be written to files in this directory, instead of being written into the events. The events will contain the full paths of the report files.

  • scheduler_selection – When set, this forces the compiler to use a specific scheduler when ordering the instructions. See the documentation for a list of valid schedulers.

  • always_rearrange_copies_on_the_host* Experimental Flag * The data which is streamed to/from the device might be stored in different layouts on the device and on the host. If that is the case the rearrangment is performed on the device by default. By enabling this option the rearrangment will be perfomed on the host at the expense of latency.

  • merge_infeed_io_copies – When true, this flag will merge the streamed host->device input copies into one larger copy. This may reduce the time to copy data from the host, at the expense of increasing the live tensor memory on the device.

  • disable_graph_convolution_caching – By default, the convolution operation searches for an equivalent cached operation, and uses this instead of creating a new convolution. Setting this flag forces the creation of a new convolution. This can improve runtime at the expense of graph size.

  • disable_graph_outlining – By default, some operations, such as matrix multiplications, which occur in the graph multiple times but with different input tensors might be optimised to reduce the total code size of the graph at the expense of the execution time. Setting this flag will disable these optimisations. This option is not valid for the convolution operration (also see disable_graph_convolution_caching)

  • retain_control_dependencies – When set to true, control dependencies from the Tensorflow graph are passed through to the backend. This can result in a different memory size due to differing constraints on the operation scheduler.

  • max_cross_replica_sum_buffer_size – The maximum number of bytes that can be waiting before a cross replica sum op is scheduled.

  • max_inter_ipu_copies_buffer_size – The maximum number of bytes that can be waiting before a inter IPU copy between IPUs is scheduled.

  • max_scheduler_lookahead_depth – The maximum distance to look into the future when considering valid schedules.

  • max_scheduler_search_space_size – The maximum number of nodes to consider when building the tree of future schedules.

  • prefetch_data_streams – When set to true, the prefetching of data for data streams on the host will be overlapped with execution on the IPU.

  • selection_order – the order in which IPUs are selected and mapped to physical IPU devices when using a multi-IPU devices (see SelectionOrder). When not specified, then automatic selection order is used, otherwise an instance of SelectionOrder.

Returns

An IpuOptions configuration protobuf, suitable for passing to configure_ipu_system

tensorflow.python.ipu.utils.extract_all_events(events)

Extract a list containing each event as an event object

Parameters

events – A tensor containing a list of IPU events as protobuf strings

Returns

A list containing IpuTraceEvent objects

tensorflow.python.ipu.utils.extract_all_strings_from_event_trace(events)

Extract a concatenation of all data strings from an IPU event trace.

Parameters

events – An array of IPU events as returned from the ipu_compile_summary operation.

Returns

A string containing the concatenation of all of the data fields of the events.

tensorflow.python.ipu.utils.extract_all_types_from_event_trace(events)

Return a list of the types of each event in an event trace tensor

Parameters

events – A tensor containing a list of IPU events as protobuf strings

Returns

A list containing the type of each event

tensorflow.python.ipu.utils.extract_compile_reports(events)

Get a list of all compiler reports in the event list.

Parameters

events – A list of trace event serialized protobufs

Returns

A list of tuples containing the module namd and report.

tensorflow.python.ipu.utils.extract_execute_reports(events)

Get a list of all compiler reports in the event list.

Parameters

events – A list of trace event serialized protobufs

Returns

A list of tuples containing the module namd and report.

tensorflow.python.ipu.utils.move_variable_initialization_to_cpu(graph=None)

For all variables in the VARIABLES collection, move any initialization ops onto the CPU.

Parameters

graph – Operations are moved around on this graph. The default graph will be used if not specified.

Returns

None

tensorflow.python.ipu.utils.reset_ipu_seed(seed, device='/device:IPU:0', cpu_device='cpu')

Reset the seed used to generate stateful random numbers and perform stochastic rounding.

Parameters
  • seed – The new random number generator seed.

  • device – The device to which the seed will be applied.

  • cpu_device – The CPU device which is on the same hardware to the IPU device.

Returns

None

tensorflow.python.ipu.utils.running_on_ipu_model(device='cpu')

Check if XLA is configured to run on the ipu model.

Parameters

device – The CPU device which is local to the IPU hardware

Returns

True if XLA is configured to run on the ipu model. False if XLA is configured to run on real hardware.

tensorflow.python.ipu.utils.select_ipus(opts, indices)

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple Tensorflow devices, each with control of one of more IPUs. The Tensorflow devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each Tensorflow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility gc-info -l. For instance, the following listing shows the device configurations available on a system with 16 IPUs.

 user@host:~$ gc-info -l
Graphcore device listing:

-+- Id:  [0], type:      [PCIe], PCI Domain: [0000:1a:00.0]
-+- Id:  [1], type:      [PCIe], PCI Domain: [0000:1b:00.0]
-+- Id:  [2], type:      [PCIe], PCI Domain: [0000:1c:00.0]
-+- Id:  [3], type:      [PCIe], PCI Domain: [0000:1d:00.0]
-+- Id:  [4], type:      [PCIe], PCI Domain: [0000:60:00.0]
-+- Id:  [5], type:      [PCIe], PCI Domain: [0000:61:00.0]
-+- Id:  [6], type:      [PCIe], PCI Domain: [0000:62:00.0]
-+- Id:  [7], type:      [PCIe], PCI Domain: [0000:63:00.0]
-+- Id:  [8], type:      [PCIe], PCI Domain: [0000:b1:00.0]
-+- Id:  [9], type:      [PCIe], PCI Domain: [0000:b2:00.0]
-+- Id: [10], type:      [PCIe], PCI Domain: [0000:b3:00.0]
-+- Id: [11], type:      [PCIe], PCI Domain: [0000:b4:00.0]
-+- Id: [12], type:      [PCIe], PCI Domain: [0000:da:00.0]
-+- Id: [13], type:      [PCIe], PCI Domain: [0000:db:00.0]
-+- Id: [14], type:      [PCIe], PCI Domain: [0000:dc:00.0]
-+- Id: [15], type:      [PCIe], PCI Domain: [0000:dd:00.0]
-+- Id: [32], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
 |--- PCIe Id:  [5], DNC Id: [2], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [3], PCI Domain: [0000:60:00.0]
 |--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [5], PCI Domain: [0000:1c:00.0]
 |--- PCIe Id:  [1], DNC Id: [6], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
 |--- PCIe Id: [11], DNC Id: [8], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [9], PCI Domain: [0000:b3:00.0]
 |--- PCIe Id:  [9], DNC Id: [10], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [11], PCI Domain: [0000:b1:00.0]
 |--- PCIe Id: [15], DNC Id: [12], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [13], PCI Domain: [0000:dc:00.0]
 |--- PCIe Id: [13], DNC Id: [14], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [15], PCI Domain: [0000:da:00.0]
-+- Id: [33], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
 |--- PCIe Id:  [5], DNC Id: [2], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [3], PCI Domain: [0000:60:00.0]
 |--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [5], PCI Domain: [0000:1c:00.0]
 |--- PCIe Id:  [1], DNC Id: [6], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
-+- Id: [34], type: [Multi IPU]
 |--- PCIe Id: [11], DNC Id: [0], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [1], PCI Domain: [0000:b3:00.0]
 |--- PCIe Id:  [9], DNC Id: [2], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:b1:00.0]
 |--- PCIe Id: [15], DNC Id: [4], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [5], PCI Domain: [0000:dc:00.0]
 |--- PCIe Id: [13], DNC Id: [6], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [7], PCI Domain: [0000:da:00.0]
-+- Id: [35], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
 |--- PCIe Id:  [5], DNC Id: [2], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [3], PCI Domain: [0000:60:00.0]
-+- Id: [36], type: [Multi IPU]
 |--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [1], PCI Domain: [0000:1c:00.0]
 |--- PCIe Id:  [1], DNC Id: [2], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [3], PCI Domain: [0000:1a:00.0]
-+- Id: [37], type: [Multi IPU]
 |--- PCIe Id: [11], DNC Id: [0], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [1], PCI Domain: [0000:b3:00.0]
 |--- PCIe Id:  [9], DNC Id: [2], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:b1:00.0]
-+- Id: [38], type: [Multi IPU]
 |--- PCIe Id: [15], DNC Id: [0], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:dc:00.0]
 |--- PCIe Id: [13], DNC Id: [2], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [3], PCI Domain: [0000:da:00.0]
-+- Id: [39], type: [Multi IPU]
 |--- PCIe Id:  [7], DNC Id: [0], PCI Domain: [0000:63:00.0]
 |--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:62:00.0]
-+- Id: [40], type: [Multi IPU]
 |--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:61:00.0]
 |--- PCIe Id:  [4], DNC Id: [1], PCI Domain: [0000:60:00.0]
-+- Id: [41], type: [Multi IPU]
 |--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:1d:00.0]
 |--- PCIe Id:  [2], DNC Id: [1], PCI Domain: [0000:1c:00.0]
-+- Id: [42], type: [Multi IPU]
 |--- PCIe Id:  [1], DNC Id: [0], PCI Domain: [0000:1b:00.0]
 |--- PCIe Id:  [0], DNC Id: [1], PCI Domain: [0000:1a:00.0]
-+- Id: [43], type: [Multi IPU]
 |--- PCIe Id: [11], DNC Id: [0], PCI Domain: [0000:b4:00.0]
 |--- PCIe Id: [10], DNC Id: [1], PCI Domain: [0000:b3:00.0]
-+- Id: [44], type: [Multi IPU]
 |--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:b2:00.0]
 |--- PCIe Id:  [8], DNC Id: [1], PCI Domain: [0000:b1:00.0]
-+- Id: [45], type: [Multi IPU]
 |--- PCIe Id: [15], DNC Id: [0], PCI Domain: [0000:dd:00.0]
 |--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:dc:00.0]
-+- Id: [46], type: [Multi IPU]
 |--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:db:00.0]
 |--- PCIe Id: [12], DNC Id: [1], PCI Domain: [0000:da:00.0]

Examples based on the listing above:

 # Create a single device with 1 IPU at PCI address 0000:1a:00.0 by using
# IPU configuration index 0
opts = create_ipu_config()
opts = select_ipus(opts, indices=[0])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create a single device with 1 IPU at PCI address 0000:b1:00.0 by using
# IPU configuration index 8
opts = create_ipu_config()
opts = select_ipus(opts, indices=[8])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two Tensorflow devices, with one IPU each, being devices at
# indices 0 and 1
opts = create_ipu_config()
opts = select_ipus(opts, indices=[0, 1])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create two Tensorflow devices, with four IPUs each. The device
# configurations at indices 37 (0000:b4:00.0, 0000:b3:00.0, 0000:b2:00.0,
# 000:b1:00.0) and 38 (0000:dd:00.0, 0000:dc:00.0, 0000:db:00.0,
# 00:da:00.0)
opts = create_ipu_config()
opts = select_ipus(opts, indices=[37, 38])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
 # Create four Tensorflow devices each with one IPU, at addresses
# 0000:1a:00.0, 0000:1b:00.0, 0000:1c:00.0, 0000:1d:00.0.
opts = create_ipu_config()
opts = select_ipus(opts, indices=[0, 1, 2, 3])
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • indices – List of IPU configuration indices.

Returns

The IpuOptions configuration protobuf, with a number of devices selected by IPU configuration index.

tensorflow.python.ipu.utils.set_compilation_options(opts, compilation_options=None)

Set the IPU compilation options for the session.

 # Create a device with debug execution profile flag set to "compute_sets"
opts = create_ipu_config()
opts = set_compilation_options(opts,
    compilation_options={"debug.instrument": "true",
                         "target.workerStackSizeInBytes": "64"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • compilation_options – A dictionary of poplar compilation option flags to be sent to the executor.

Returns

The IpuOptions configuration protobuf, with engine compilation options set.

tensorflow.python.ipu.utils.set_convolution_options(opts, convolution_options=None)

Set the IPU convolution options for the session.

 # Set "tempMemoryBudget" flag to "1000000"
opts = create_ipu_config()
opts = set_convolution_options(opts,
    convolution_options={"tempMemoryBudget": "1000000"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • convolution_options – A dictionary of poplar option flags for the convolutions.

Returns

The IpuOptions configuration protobuf, with convolution options set.

tensorflow.python.ipu.utils.set_floating_point_behaviour_options(opts, inv=True, div0=True, oflo=True, esr=True, nanoo=True)

Set the IPU floating point control behaviour bits

See the Poplar API documentation for poplar::FloatingPointBehaviour.

Parameters
  • inv – If true a floating point invalid operation (defined by IEEE 754) will cause an exception.

  • div0 – If true a floating point divide by zero operation will cause an exception.

  • oflo – If true a floating point overflow will cause an exception.

  • esr – Enable stochastic rounding.

  • nanoo – Enable Not-a-Number on overflow mode.

tensorflow.python.ipu.utils.set_ipu_connection_type(opts, connection_type=None, ipu_version=1)

Configure when to attach to the device.

 # Compile without attaching to the device.
opts = create_ipu_config()
opts = set_ipu_connection_type(opts,
                               DeviceConnectionType.ON_DEMAND))
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • connection_type – One of DeviceConnectionType. Defaults to DeviceConnectionType.ALWAYS if None.

  • ipu_version – Version of the IPU hardware used.

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_ipu_model_options(opts, compile_ipu_code=True)

Set the IPU Model options.

Parameters

compile_ipu_code – Whether or not to actually compile real IPU code for modelling.

Returns

The IpuOptions configuration protobuf, with IPU model options set.

tensorflow.python.ipu.utils.set_matmul_options(opts, matmul_options=None, clear_pass_type=False)

Set the IPU matrix multiplication options for the session.

 # Set "availableMemoryProportion" flag to "0.5"
opts = create_ipu_config()
opts = set_matmul_options(opts,
    matmul_options={"availableMemoryProportion": "0.5"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • matmul_options – A dictionary containing the poplar option flag “availableMemoryProportion” for the matrix multiplication operations. It indicates the proportion of tile memory to be made available as temporary memory for the matrix multiplications (float between 0 and 1.0). Less temporary memory will generally result in a multiplication that takes more cycles to complete. However, because always live memory (like code and vertex state) is not tracked when planning it, a multiplication using less temporary memory may use more memory overall, due to an increase of always live memory.

  • clear_pass_type – When set to True, the Pass type will not be set in the options passed to the poplar operation.

Returns

The IpuOptions configuration protobuf, with matmul options set.

tensorflow.python.ipu.utils.set_optimization_options(opts, combine_embedding_lookups=False, combine_matmuls=False, max_cross_replica_sum_buffer_size=0, max_inter_ipu_copies_buffer_size=0, gather_simplifier=False)

Set the IPU options related to performance / optimizations.

 # Create a device with fusion for multiSlices sharing the same input
# enabled.
opts = create_ipu_config()
opts = set_optimization_options(opts,
                                combine_embedding_lookups=True)
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • combine_embedding_lookups – Fuse embedding lookups on the same tensor. This might improve performance but increase memory usage.

  • combine_matmuls – Fuse matmul operations if they share the same weights or the same input.

  • max_cross_replica_sum_buffer_size – The maximum number of bytes that can be waiting before a cross replica sum op is scheduled.

  • max_inter_ipu_copies_buffer_size – The maximum number of bytes that can be waiting before a inter IPU copy between IPUs is scheduled.

  • gather_simplifier – Will enable more aggressive optimisation for embedding lookups.

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_pooling_options(opts, pooling_options=None)

Set the IPU pooling compilation options for the session.

 # Set "poolUseIntrospectiveMapping" flag to "false"
opts = create_ipu_config()
opts = set_pooling_options(opts,
    pooling_options={"poolUseIntrospectiveMapping": "false"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • pooling_options – A dictionary of poplar option flags for the pooling operation.

Returns

The IpuOptions configuration protobuf, with pooling options set.

tensorflow.python.ipu.utils.set_recomputation_options(opts, allow_recompute=True, allow_stateful_recompute=True)

Set re-computation options.

Parameters
  • allow_recompute – Whether or not to re-compute instructions during training. If this is enabled then we will attempt to pattern match instructions/pipeline stages in the forward pass and recompute them in the backward pass to avoid having to preserve activations which increase the maximum memory liveness. Enabling this option can reduce memory usage at the expense of extra computation.

  • allow_stateful_recompute – Whether or not to extend the re-compute of pipeline stages to stages containing stateful operations (Has no effect if allow_recompute is False).

Returns

The IpuOptions configuration protobuf.

tensorflow.python.ipu.utils.set_report_options(opts, report_options=None)

Set the options used to influence Poplar report generation.

The options are added to both the compile and execution report generations.

 opts = create_ipu_config()
opts = set_report_options(opts,
    report_options={"reportOption1": "false"})
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters
  • opts – An IpuOptions session control protobuf.

  • report_options – A dictionary of poplar option flags for the report generation.

Returns

The IpuOptions configuration protobuf, with convolution options set.

tensorflow.python.ipu.utils.set_serialization_options(opts, output_folder='')

Enable / disable the serialization to disk of the compiled executables.

 # Create a device that will save to disk all the compiled executables.
opts = create_ipu_config()
opts = set_serialization_options(opts,
                                output_folder="/tmp/my_network")
ipu.utils.configure_ipu_system(opts)
with tf.Session() as s:
  ...
Parameters

output_folder – Where to save the compiled executables. Set to “” to disable serialization.

Returns

The IpuOptions configuration protobuf.

Popops all to all and all gather operators

tensorflow.python.ipu.ops.all_to_all_op.all_gather(x, replication_factor, name)
Gather the data on all replicas to all other replicas. Each replica will

have the exact same output.

Parameters
  • x – The tensor to gather

  • replication_factor – The replication factor of the model.

  • name – Optional op name.

Returns

A tensor of [num_replicas][x] with each replica having the same tensor.

tensorflow.python.ipu.ops.all_to_all_op.all_to_all(x, split_dimension, concat_dimension, replication_factor, name=None)

Perform an XLA all to all operation across all replicas (https://www.tensorflow.org/xla/operation_semantics#alltoall)

Parameters
  • split_dimension – A value in the interval [0,n) that names the dimension along which the operand is split

  • concat_dimension – A value in the interval [0,n) that names the dimension along which the split blocks are concatenated.

  • replication_factor – The replication factor of the model.

  • name – Optional op name.

Returns

A tensor of the same size where each replica will have a different value.

Popops embedding operators

tensorflow.python.ipu.ops.embedding_ops.embedding_lookup(params, ids, name=None, one_hot_threshold=0, min_encoding_size=1216)

Looks up ids in a list of embedding tensors. (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (min_encoding_size, one_hot_threshold). They will be removed in a future version. Instructions for updating: stop passing this argument.

This is designed to be a drop-in replacement for the typical use cases with tf.nn.embedding_lookup for the IPU.

Parameters
  • params – A single tensor representing the complete embedding tensor.

  • ids – A Tensor with type int32 containing the slices to be extracted from params.

  • name – A name for the operation.

  • one_hot_threshold – The threshold below which the embedding lookup will become a one-hot with matmul.

  • min_encoding_size – The minimum encoding size for the embedding. This is used to decide whether to split the embedding tensor.

Returns

A Tensor with the same type as the tensors in params.

Popnn normalization operators

tensorflow.python.ipu.ops.normalization_ops.group_norm(inputs, groups=2, channels_axis=-1, reduction_axes=(-3, -2), center=True, scale=True, epsilon=1e-06, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)

Functional interface for the group normalization layer.

Reference: https://arxiv.org/abs/1803.08494.

“Group Normalization”, Yuxin Wu, Kaiming He

Parameters
  • inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.

  • groups – Integer. Divide the channels into this number of groups over which normalization statistics are computed. This number must be commensurate with the number of channels in inputs.

  • channels_axis – An integer. Specifies index of channels axis which will be broken into groups, each of which whose statistics will be computed across. Must be mutually exclusive with reduction_axes. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.

  • reduction_axes

    Tuple of integers. Specifies dimensions over which

    statistics will be accumulated. Must be mutually exclusive with channels_axis. Statistics will not be accumulated across axes not specified in reduction_axes nor channel_axis. Preferred usage is to specify negative integers to be agnostic to whether a batch dimension is included.

    Some sample usage cases:

    NHWC format: channels_axis=-1, reduction_axes=[-3, -2] NCHW format: channels_axis=-3, reduction_axes=[-2, -1]

     

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling can be done by the next layer.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • activation_fn – Activation function, default set to None to skip it and maintain a linear activation.

  • param_initializers – Optional initializers for beta and gamma.

  • reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.

  • variables_collections – Optional collections for the variables.

  • training – Whether this is operation is being used in a training network.

  • trainable – If True also add variables to the graph collection GraphKeys.TRAINABLE_VARIABLES (see tf.Variable).

  • scope – Optional scope for variable_scope.

Returns

A Tensor representing the output of the operation.

Raises
  • ValueError – If the rank of inputs is undefined.

  • ValueError – If rank or channels dimension of inputs is undefined.

  • ValueError – If channels dimension is not 1 or 3.

  • ValueError – If number of groups is not commensurate with number of channels.

  • ValueError – If reduction_axes or channels_axis are out of bounds.

  • ValueError – If reduction_axes are not mutually exclusive with channels_axis.

tensorflow.python.ipu.ops.normalization_ops.instance_norm(inputs, channels_axis=-1, reduction_axes=(-3, -2), center=True, scale=True, epsilon=1e-06, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)

Functional interface for the instance normalization layer.

Reference: https://arxiv.org/abs/1607.08022.

“Instance Normalization: The Missing Ingredient for Fast Stylization” Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

Parameters
  • inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.

  • channels_axis – An integer. Specifies index of channels axis which will be broken into groups, each of which whose statistics will be computed across. Must be mutually exclusive with reduction_axes. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.

  • reduction_axes

    Tuple of integers. Specifies dimensions over which statistics will be accumulated. Must be mutually exclusive with channels_axis. Statistics will not be accumulated across axes not specified in reduction_axes nor channel_axis. Preferred usage is to specify negative integers to be agnostic to whether a batch dimension is included.

    Some sample usage cases:

    NHWC format: channels_axis=-1, reduction_axes=[-3, -2] NCHW format: channels_axis=-3, reduction_axes=[-2, -1]

     

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling can be done by the next layer.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • activation_fn – Activation function, default set to None to skip it and maintain a linear activation.

  • param_initializers – Optional initializers for beta and gamma.

  • reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.

  • variables_collections – Optional collections for the variables.

  • training – Whether this is operation is being used in a training network.

  • trainable – If True also add variables to the graph collection GraphKeys.TRAINABLE_VARIABLES (see tf.Variable).

  • scope – Optional scope for variable_scope.

Returns

A Tensor representing the output of the operation.

Raises
  • ValueError – If data_format is neither NHWC nor NCHW.

  • ValueError – If the rank of inputs is undefined.

  • ValueError – If rank or channels dimension of inputs is undefined.

tensorflow.python.ipu.ops.normalization_ops.layer_norm(inputs, channels_axis=-1, reduction_axes=(-3, -2), center=True, scale=True, epsilon=1e-06, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)

Adds a Layer Normalization layer.

Based on the paper:

“Layer Normalization”

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

https://arxiv.org/abs/1607.06450.

Given a tensor inputs of rank R, moments are calculated and normalization is performed over axes begin_norm_axis … R - 1. Scaling and centering, if requested, is performed over axes begin_params_axis .. R - 1.

By default, begin_norm_axis = 1 and begin_params_axis = -1, meaning that normalization is performed over all but the first axis (the HWC if inputs is NHWC), while the beta and gamma trainable parameters are calculated for the rightmost axis (the C if inputs is NHWC). Scaling and recentering is performed via broadcast of the beta and gamma parameters with the normalized tensor.

The shapes of beta and gamma are inputs.shape[begin_params_axis:], and this part of the inputs’ shape must be fully defined.

Parameters
  • inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.

  • channels_axis – An integer. Specifies index of channels axis which will be broken into groups, each of which whose statistics will be computed across. Must be mutually exclusive with reduction_axes. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.

  • reduction_axes

    Tuple of integers. Specifies dimensions over which

    statistics will be accumulated. Must be mutually exclusive with channels_axis. Statistics will not be accumulated across axes not specified in reduction_axes nor channel_axis. Preferred usage is to specify negative integers to be agnostic to whether a batch dimension is included.

    Some sample usage cases:

    NHWC format: channels_axis=-1, reduction_axes=[-3, -2] NCHW format: channels_axis=-3, reduction_axes=[-2, -1]

     

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling can be done by the next layer.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • activation_fn – Activation function, default set to None to skip it and maintain a linear activation.

  • param_initializers – Optional initializers for beta and gamma.

  • reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.

  • variables_collections – Optional collections for the variables.

  • training – Whether this is operation is being used in a training network.

  • trainable – If True also add variables to the graph collection GraphKeys.TRAINABLE_VARIABLES (see tf.Variable).

  • scope – Optional scope for variable_scope.

Returns

A Tensor representing the output of the operation, having the same shape and dtype as inputs.

Raises

ValueError – If the rank of inputs is not known at graph build time, or if inputs.shape[begin_params_axis:] is not fully defined at graph build time.

Pipelining operators

class tensorflow.python.ipu.ops.pipelining_ops.OptimizerFunctionOutput(opt, loss)

A helper class used for returning a structured output from an optimizer_function in a pipeline.

__init__(opt, loss)

Creates an OptimizerFunctionOutput object.

Parameters
  • opt – An instance of optimizer.Optimizer which is used to generate the back-propagation and the weight update pipeline stages.

  • loss – The loss which is passed to the optimizer.

class tensorflow.python.ipu.ops.pipelining_ops.PipelineSchedule

The PipelineSchedule describes how stages are interleaved on the IPUs servicing the pipeline. The forward and backward passes of each stage will execute on the same IPUs. So, in the core of the pipeline there is a choice as to whether to run the forward stages together, or the backward stages and the forward stages together.

Grouped

This groups the forward passes on multiple IPUs. This requires more memory since activations need to be stored until the backward stages run together. However, since forward passes tend to be smaller than backward passes, Grouped tends to improve the speed of the execution, as different IPUs don’t spend so much time waiting for each other.

Interleaved

This schedules the backward passes whenever the forward passes have just generated some activations. Consequently fewer activations are required to be stored between the forward and backward pipeline stages, so less memory is required. However, since forward and backward stages tend to be very different in terms of execution cycles, the overall performance of the pipeline tends to be slower.

Sequential

This is a debug mode, where the pipeline is scheduled in the same way as if it were a sharded model.

class tensorflow.python.ipu.ops.pipelining_ops.PipelineStageOptions(convolution_options=None, matmul_options=None)

A helper class which can be used to configure Poplar compilation options (such as ‘availableMemoryProportion’) inside a pipeline forward, backward and weight update stage. This will override the global options set by ipu.utils.set_convolution_options and ipu.utils.set_matmul_options.

__init__(convolution_options=None, matmul_options=None)

Creates an PipelineStageOptions object.

Parameters
  • convolution_options – If provided, a dictionary of Poplar option flags for all the convolution operations in the stage.

  • matmul_options

    If provided, a dictionary of Poplar option flags for

    all the matmul operations in the stage.

    loss: The loss which is passed to the optimizer.

     

tensorflow.python.ipu.ops.pipelining_ops.pipeline(computational_stages, pipeline_depth, repeat_count=1, inputs=None, infeed_queue=None, outfeed_queue=None, optimizer_function=None, device_mapping=None, pipeline_schedule=None, forward_propagation_stages_poplar_options=None, backward_propagation_stages_poplar_options=None, weight_update_poplar_options=None, continuous_weight_updates=False, outfeed_loss=False, name=None)

Sets up a series of computational stages, where the outputs of one stage are the inputs to the next one. These stages are then executed in parallel across multiple IPUs. This approach can be used to split the model where layer(s) are executed on different IPUs.

The first stage takes the inputs and the infeed_queue (if provided) as its inputs. If the infeed_queue is provided, it is automatically dequeued (similar to the ipu.loops API) therefore care needs to be taken to make sure the signature of the first pipeline stage matches both the arguments from inputs and the infeed_queue, otherwise an error is thrown.

All tensors which are used in the pipeline which are not TensorFlow Variables need to be explicitly passed as inputs to the pipeline. If an input does not change its value during the execution of the pipeline op (for example hyperparameters such as learning rate), it needs to be passed as part of inputs. Alternatively, if these values change during execution (for example the model processes different batches of data) the input should be passed through the infeed_queue (see ipu.ipu_infeed_queue.IPUInfeedQueue).

When training a model, an optional optimizer_function function can be provided. This function takes all the outputs from the last computational stage as inputs, and returns an instance of OptimizerFunctionOutput that is used to generate the backwards pass of the model using the TensorFlow Optimizer API. This will internally create corresponding backpropagation pipeline stages for each pipeline stage and colocate them such that the activations and weights required for the gradient calculation and application stay on the device in order to minimise the number of copies between IPUs.

Note that the gradients, which are calculated by the compute_gradients function, will be accumulated automatically during the execution of the pipeline, unless continuous_weight_updates is enabled.

If the last computational stage has any outputs, then an outfeed_queue (see ipu.ipu_outfeed_queue.IPUOutfeedQueue) is required and all the outputs from the last computational stage are enqueued to the outfeed_queue.

Note that pipelining also supports recomputation, to enable it, use the tensorflow.ipu.utils.set_recomputation_options() function when configuring the device.

For example a simple inference network for the MNIST can be split across two IPUs:

 from tensorflow import keras

# Create the dataset
#...

# Create the data queues from/to IPU.
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset, "infeed")
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue("outfeed")

# Create a pipelined model which is split accross two stages.
def stage1(image):
  partial = keras.layers.Dense(256, activation=tf.nn.relu)(image)
  partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial)
  return partial

def stage2(partial):
  logits = keras.layers.Dense(10)(partial)
  probabilities = tf.nn.softmax(logits)
  classes = tf.argmax(input=logits, axis=1)
  return probabilities, classes

def model():
  with variable_scope.variable_scope("vs", use_resource=True):
    pipeline_op = pipelining_ops.pipeline(
                      computational_stages=[stage1, stage2],
                      pipeline_depth=250,
                      repeat_count=2,
                      inputs=[],
                      infeed_queue=infeed_queue,
                      outfeed_queue=outfeed_queue,
                      device_mapping=[3,1],
                      name="Pipeline")
  return pipeline_op

with ops.device("/device:IPU:0"):
  compiled_model = ipu_compiler.compile(model, inputs=[])

outfeed_op = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(compiled_model)
  probabilities, classes = sess.run(outfeed_op)

In this set up, the model is split across two IPUs. By default the first two layers would be executed on the first IPU and the third layer and the probabilities and classes on the second IPU but here device_mapping is used to override the default IPU allocation and instead the first two layers will be executed on the fourth IPU and the third layer and the probabilities and classed on the second IPU.

This creates a pipeline of depth 250 (specified by the pipeline_depth), which means each pipeline stage is executed 250 times.

This pipeline is then executed 2 times (specified by the repeat_count) The results of the pipeline (probabilities and classes) are returned to the host by the outfeed queue.

We can also train this network by providing optimizer_function:

 from tensorflow import keras

# Create the dataset
#...

# Create the data queues from/to IPU.
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset, "infeed")
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue("outfeed")

# Create a pipelined model which is split accross two stages.
def stage1(lr, images, labels):
  partial = keras.layers.Dense(256, activation=tf.nn.relu)(images)
  partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial)
  return lr, partial, labels

def stage2(lr, partial, labels):
  logits = keras.layers.Dense(10)(partial)
  cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
                        labels=labels, logits=logits)
  loss = tf.reduce_mean(cross_entropy)
  return lr, loss

def optimizer_function(lr, loss):
  optimizer = tf.train.GradientDescentOptimizer(lr)
  return pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

def model(lr):
  with variable_scope.variable_scope("vs", use_resource=True):
    pipeline_op = pipelining_ops.pipeline(
                      computational_stages=[stage1, stage2],
                      pipeline_depth=128,
                      repeat_count=10,
                      inputs=[lr],
                      infeed_queue=infeed_queue,
                      outfeed_queue=outfeed_queue,
                      optimizer_function=optimizer_function,
                      name="Pipeline")
  return pipeline_op

with ops.device('cpu'):
  lr = tf.placeholder(np.float16, [])

with ops.device("/device:IPU:0"):
  compiled_model = ipu_compiler.compile(model, inputs=[lr])

outfeed_op = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(compiled_model, {lr: 0.01})
  losses = sess.run(outfeed_op)

Here the tf.train.GradientDescentOptimizer generates the pipeline stages which calculate the gradients and apply them to the weights. Note how the loss is returned to the host by the outfeed queue.

Note that modifying tf.Variable values in a pipeline stage and/or during the gradient calculation will result in undefined behavior. These variables can only be modified by the apply_gradients member function of the applied Optimizer.

Parameters
  • computational_stages – a list of python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline state as its inputs.

  • pipeline_depth – the number of times each pipeline stage will be executed.

  • repeat_count – the number of times the pipeline will be executed.

  • inputs – arguments passed to the first pipeline stage.

  • infeed_queue – optional IPUInfeedQueue, if passed, it is dequeued and passed as an input in the first pipeline stage.

  • outfeed_queue – IPUOutfeedQueue, required if the last computational stage has any outputs. The outputs of these are enqueued to this queue and they can be accessed on the host.

  • optimizer_function – optional Python function which takes the output of the last computational stage as parameters and returns an instance of pipelining_ops.OptimizerFunctionOutput in order to generate the back-propagation and weight-update parts of the model suitable for training.

  • device_mapping – optional stage to ipu mapping override.

  • pipeline_schedule – Which scheduling algorithm to use for pipeline lowering. Defaults to PipelineSchedule.Grouped.

  • forward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.

  • backward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grained control of the Poplar options for a given backward propagation computational stage.

  • weight_update_poplar_options – If provided, a PipelineStageOptions object which allows for fine grained control of the Poplar options for the weight update stage.

  • continuous_weight_updates – ** CURRENTLY UNIMPLEMENTED ** When training, this option will apply the gradients to the resource variables immediately, rather than accumulating the gradients and applying them at the end of each execution of the pipeline.

  • outfeed_loss – If True, the loss given by the optimizer_function will be enqueued on the outfeed, instead of the outputs from the last computational stage.

  • name – name of this pipeline.

Returns

An Operation that executes the pipeline.

Popops reduce scatter operator

tensorflow.python.ipu.ops.reduce_scatter_op.reduce_scatter(x, replication_factor, name=None)

Reduce (sum) the given replicated tensor with the result scattered across the replicas. For an input of shape [num_elements], the output will have shape [ceil(num_elements / replication_factor)]. If replication_factor does not evenly divide num_elements, the result is zero-padded. Example:

 Input:  Replica0: [x0, y0, z0]
        Replica1: [x1, y1, z1]
Output: Replica0: [x0 + x1, y0 + y1]
        Replica1: [z0 + z1, 0]
Parameters
  • x – The input Tensor. Must have rank 1.

  • replication_factor – The replication factor of the model.

  • name – Optional op name.

Returns

A Tensor with the result for this replica.

Popnn recurrent operators

class tensorflow.python.ipu.ops.rnn_ops.PopnnLSTM(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, name=None)

XLA compatible, time-major Popnn implementation of an LSTM layer.

Below is a typical workflow:

 with tf.Graph().as_default():
  lstm = PopnnLSTM(num_units, ...)

  outputs, output_states = lstm(inputs, initial_states, training=True)
build(input_shape)

Create variables of the PopnnLSTM.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the LSTM model.

Parameters
  • inputs – 3-D tensor with shape [time_len, batch_size, input_size].

  • initial_state – An LSTMStateTuple of state tensors, each shaped [batch_size, num_units]. If not provided, the state is initialized to zeros. DEPRECATED a tuple of tensor (input_h_state, input_c_state) each of shape [batch_size, num_units].

  • training – whether this operation will be used in training or inference.

Returns

 

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_states: An LSTMStateTuple of the same shape and structure as

    initial_state. If the initial state used the deprecated behaviour of not passing LSTMStateTuple, then a tuple (output_h_state, output_c_state) is returned.

 

Return type

tuple of output and output states

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn LSTM states.

Shape is a 2-element tuple. Each is [batch_size, num_units]

Parameters

batch_size – an int

Returns

a tuple of python arrays.

class tensorflow.python.ipu.ops.rnn_ops.PopnnGRU(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, name=None)

XLA compatible, time-major Popnn implementation of an GRU layer.

Below is a typical workflow:

 with tf.Graph().as_default():
  lstm = PopnnGRU(num_units, ...)

  outputs, output_state = lstm(inputs, initial_state, training=True)
build(input_shape)

Create variables of the PopnnGRU.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the GRU model.

Parameters
  • inputs – 3-D tensor with shape [time_len, batch_size, input_size].

  • initial_state – Initial state tensor, shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – whether this operation will be used in training or inference.

Returns

a tensor of shape [time_len, batch_size, num_units]. output_state: The output state of the last cell.

Return type

output

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn GRU state.

State shape is [batch_size, num_units].

Parameters

batch_size – an int

Returns

A python array.

Popnn random operators

tensorflow.python.ipu.ops.rand_ops.dropout(x, seed=None, rate=0.5, scale=1, seed_modifier=1, name=None)

This targets the poplibs popnn dropout operation, optimized for execution on the IPU.

Parameters
  • x – The input tensor.

  • rate – The probability that a given element will be zeroed out.

  • scale – An optional factor to apply to all other elements.

  • seed_modifier – An optional parameter given to poplar which uses it to modify the seed.

  • name – Optional op name.

Returns

A Tensor which has some nodes set to zero, as randomly selected based on other parameters.

Popops cross replica operators

tensorflow.python.ipu.ops.cross_replica_ops.cross_replica_sum(x, name=None)

Sum the input tensor across replicas.

Parameters
  • x – The local tensor to the sum.

  • name – Optional op name.

Returns

A Tensor which is summed across replicas.

Summary operations for IPUs

tensorflow.python.ipu.ops.summary_ops.ipu_compile_summary(name, op_list, collections=None)

Create an IPU compiler summary operation.

Parameters
  • name – A name for the summary.

  • op_list – An operation or list of operations to make this summary dependent upon.

  • collections – Optional collections to add the summary into.

Returns

The new summary operation

Custom operations

class tensorflow.python.ipu.optimizers.map_gradient_optimizer.MapGradientOptimizer(wrapped_optimizer, gradient_mapping_function, name='MapGradientOptimizer')

This class enables modification of the computed gradients, before they are passed to the final optimizer for application.

MapGradientOptimizer needs a map function that will modify the gradients, and an optimizer to which the modified gradients are passed.

The map function has two arguments: gradient and variable. The map function must return the modified gradient.

Example

 # Define function which will modify computed gradients.
# This is a gradient decay function.

def map_fn_decay(grad, var):
  return grad + (WEIGHT_DECAY * var)

# To run the code we need a session:
with self.cached_session():
  optimizer = gradient_descent.GradientDescentOptimizer(0.000001)
  # We define MapGradientOptimizer
  map_optimizer = map_gradient_optimizer.MapGradientOptimizer(
      optimizer, map_fn_decay)
  # Gradients are computed by compute_gradients(), where our map function
  # modifies computed gradients. compute_gradients(loss, var_list) arguments
  # are loss and var_list so define arguments and call
  # map_optimizer.compute_gradients().
  values = [1.0, 2.0, 3.0]
  vars_ = [variables.Variable([v], dtype=dtypes.float32) for v in values]
  grads_and_vars = map_optimizer.compute_gradients(
      vars_[0] * vars_[1] + vars_[0] * vars_[2] + vars_[1] * vars_[2],
      vars_)
  # The output grads_and_vars contains computed gradients modified by
  # the decay map function.
  # grads are 5.01, 4.02 and 3.03. If we did not use MapGradientOptimizer
  # they would be 5, 4 and 3.
Parameters
  • wrapped_optimizer – tensorflow (derived) optimizer.

  • gradient_mapping_function – is applied on grads and variables which are provided by wrapped_optimizer.compute_gradients().

Returns

compute_gradients() returns a list of (gradient, variable) pairs.

apply_gradients(grads_and_vars, global_step=None, name=None)

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Parameters
  • grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.

Raises
  • TypeError – If grads_and_vars is malformed.

  • ValueError – If none of the variables have gradients.

  • RuntimeError – If you should use _distributed_apply() instead.

compute_gradients(loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Parameters
  • loss – A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.

  • var_list – Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.

  • gate_gradients – How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.

  • aggregation_method – Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.

  • colocate_gradients_with_ops – If True, try colocating gradients with the corresponding op.

  • grad_loss – Optional. A Tensor holding the gradient computed for loss.

Returns

A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.

Raises
  • TypeError – If var_list contains anything else than Variable objects.

  • ValueError – If some arguments are invalid.

  • RuntimeError – If called with eager execution enabled and loss is not callable.

@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

get_slot(var, name)

Return a slot named name created for var by the Optimizer.

Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variable objects if for some reason you need them.

Use get_slot_names() to get the list of slot names created by the Optimizer.

Parameters
  • var – A variable passed to minimize() or apply_gradients().

  • name – A string.

Returns

The Variable for the slot if it was created, None otherwise.

get_slot_names()

Return a list of the names of slots created by the Optimizer.

See get_slot().

Returns

A list of strings.

minimize(loss, global_step=None, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, name=None, grad_loss=None)

Add operations to minimize loss by updating var_list.

This method simply combines calls compute_gradients() and apply_gradients(). If you want to process the gradient before applying them call compute_gradients() and apply_gradients() explicitly instead of using this function.

Parameters
  • loss – A Tensor containing the value to minimize.

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • var_list – Optional list or tuple of Variable objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.

  • gate_gradients – How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.

  • aggregation_method – Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.

  • colocate_gradients_with_ops – If True, try colocating gradients with the corresponding op.

  • name – Optional name for the returned operation.

  • grad_loss – Optional. A Tensor holding the gradient computed for loss.

Returns

An Operation that updates the variables in var_list. If global_step was not None, that operation also increments global_step.

Raises

ValueError – If some of the variables are not Variable objects.

@compatibility(eager) When eager execution is enabled, loss should be a Python function that takes no arguments and computes the value to be minimized. Minimization (and gradient computation) is done with respect to the elements of var_list if not None, else with respect to any trainable variables created during the execution of the loss function. gate_gradients, aggregation_method, colocate_gradients_with_ops and grad_loss are ignored when eager execution is enabled. @end_compatibility

variables()

A list of variables which encode the current state of Optimizer.

Includes slot variables and additional global variables created by the optimizer in the current default graph.

Returns

A list of variables.

Dataset benchmarking

tensorflow.python.ipu.dataset_benchmark.dataset_benchmark(dataset, number_of_epochs, elements_per_epochs, print_stats=True)

Allows the user to benchmark performance of a tf.data.Dataset.

Parameters
  • dataset – An instance of tf.data.Dataset which will be benchmarked.

  • number_of_epochs – The number of epochs this dataset will be run for.

  • elements_per_epochs – The number of elements there are in each epoch.

  • print_stats – Whether to print statistics about the performance to the console.

Returns

 

A JSON string with performance statistics, which records the following metrics every epoch:

  • elements_processed - number of elements processed.

  • total_bytes_processed - total number of bytes which was processed.

  • time_elapsed - the time it took (in seconds) for the epoch to complete.

  • elements_per_second - number of elements processed per second.

  • bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native python JSON library (

see https://docs.python.org/3/library/json.html).

 

Raises
  • TypeError – if dataset is not an instance of tf.data.Dataset.

  • ValueError – if number_of_epochs or elements_per_epochs is less than 1.

tensorflow.python.ipu.dataset_benchmark.infeed_benchmark(infeed_queue, number_of_epochs, elements_per_epochs, print_stats=True)

Allows the user to benchmark performance of a ipu.ipu_infeed_queue.IPUInfeedQueue.

Parameters
  • infeed_queue – An instance of ipu.ipu_infeed_queue.IPUInfeedQueue which will be benchmarked.

  • number_of_epochs – The number of epochs this infeed queue will be run for.

  • elements_per_epochs – The number of elements there are in each epoch.

  • print_stats – Whether to print statistics about the performance to the console.

Returns

 

A JSON string with performance statistics, which records the following metrics every epoch:

  • elements_processed - number of elements processed.

  • total_bytes_processed - total number of bytes which was processed.

  • time_elapsed - the time it took (in seconds) for the epoch to complete.

  • elements_per_second - number of elements processed per second.

  • bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native python JSON library (

see https://docs.python.org/3/library/json.html).

 

Raises
  • TypeError – if infeed_queue is not an instance of ipu.ipu_infeed_queue.IPUInfeedQueue.

  • ValueError – if number_of_epochs or elements_per_epochs is less than 1.

TensorFlow operators supported by the IPU

Supported operators for device: XLA_IPU_JIT

Operator

Type Constraint

Abs

T={float,half,int32,int64}

Acos

T={float,half,int32,int64}

Acosh

T={float,half}

Add

T={float,half,int32,int64}

AddN

T={float,half,int32,int64,variant}

AddV2

T={float,half,int32,int64}

AdjustContrastv2

T={float,half}

AdjustHue

T={float,half}

AdjustSaturation

T={float,half}

All

Tidx={int32,int64}

Any

Tidx={int32,int64}

ApproximateEqual

T={float,half,int32,int64}

ArgMax

output_type={int32,int64}
T={float,half,int32,int64}
Tidx={int32,int64}

ArgMin

output_type={int32,int64}
T={float,half,int32,int64}
Tidx={int32,int64}

Asin

T={float,half,int32,int64}

Asinh

T={float,half}

AssignAddVariableOp

dtype={float,half,int32,int64}

AssignSubVariableOp

dtype={float,half,int32,int64}

AssignVariableOp

dtype={bool,float,half,int32,int64}

Atan

T={float,half,int32,int64}

Atan2

T={float,half}

Atanh

T={float,half}

AvgPool

T={float,half}

AvgPool3D

T={float,half}

AvgPool3DGrad

T={float,half}

AvgPoolGrad

T={float,half}

BatchMatMul

T={float,half,int32,int64}

BatchMatMulV2

T={float,half,int32,int64}

BatchToSpace

Tidx={int32,int64}
T={bool,float,half,int32,int64}

BatchToSpaceND

Tcrops={int32,int64}
T={bool,float,half,int32,int64}
Tblock_shape={int32,int64}

BiasAdd

T={float,half,int32,int64}

BiasAddGrad

T={float,half,int32,int64}

BiasAddV1

T={float,half,int32,int64}

Bitcast

type={float,half,int32,int64}
T={float,half,int32,int64}

BitwiseAnd

T={int32,int64}

BitwiseOr

T={int32,int64}

BitwiseXor

T={int32,int64}

BroadcastArgs

T={int32,int64}

BroadcastGradientArgs

T={int32,int64}

BroadcastTo

Tidx={int32,int64}
T={bool,float,half,int32,int64}

Bucketize

T={float,int32,int64}

Case

Tout={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

Cast

DstT={bool,float,half,int32,int64}
SrcT={bool,float,half,int32,int64}

Ceil

T={float,half}

Cholesky

T={float,half}

ClipByValue

T={float,half,int32,int64}

Concat

T={bool,float,half,int32,int64}

ConcatOffset

 

ConcatV2

Tidx={int32}
T={bool,float,half,int32,int64}

ConjugateTranspose

Tperm={int32,int64}
T={bool,float,half,int32,int64}

Const

dtype={bool,float,half,int32,int64,string}

ControlTrigger

 

Conv2D

T={float,half}

Conv2DBackpropFilter

T={float,half}

Conv2DBackpropInput

T={float,half}

Conv3D

T={float,half}

Conv3DBackpropFilterV2

T={float,half}

Conv3DBackpropInputV2

Tshape={int32,int64}
T={float,half}

Cos

T={float,half}

Cosh

T={float,half}

Cross

T={float,half,int32,int64}

Cumprod

Tidx={int32,int64}
T={float,half,int32}

Cumsum

Tidx={int32,int64}
T={float,half,int32}

DataFormatDimMap

T={int32,int64}

DataFormatVecPermute

T={int32,int64}

DepthToSpace

T={bool,float,half,int32,int64}

DepthwiseConv2dNative

T={float,half}

DepthwiseConv2dNativeBackpropFilter

T={float,half}

DepthwiseConv2dNativeBackpropInput

T={float,half}

Diag

T={float,half,int32,int64}

DiagPart

T={float,half,int32,int64}

Digamma

T={float,half}

Div

T={float,half,int32,int64}

DivNoNan

T={float,half}

DynamicStitch

T={bool,float,half,int32,int64}

Elu

T={float,half}

EluGrad

T={float,half}

Empty

dtype={bool,float,half,int32,int64}

EmptyTensorList

shape_type={int32,int64,variant}
element_dtype={bool,float,half,int32,int64,variant}

Equal

T={bool,float,half,int32,int64}

Erf

T={float,half}

Erfc

T={float,half}

Exp

T={float,half}

ExpandDims

Tdim={int32,int64}
T={bool,float,half,int32,int64}

Expm1

T={float,half}

ExtractImagePatches

T={float,half,int32,int64}

FakeParam

dtype={bool,float,half,int32,int64}

FakeQuantWithMinMaxArgs

 

FakeQuantWithMinMaxArgsGradient

 

FakeQuantWithMinMaxVars

 

FakeQuantWithMinMaxVarsGradient

 

Fill

index_type={int32,int64}
T={bool,float,half,int32,int64}

Floor

T={float,half}

FloorDiv

T={float,half,int32,int64}

FloorMod

T={float,half,int32,int64}

FusedBatchNorm

T={float}

FusedBatchNormGrad

T={float}

FusedBatchNormGradV2

V={float,half}
T={float,half}
U={float,half}

FusedBatchNormGradV3

V={float,half}
T={float,half}
U={float,half}

FusedBatchNormV2

U={float,half}
T={float,half}

FusedBatchNormV3

U={float,half}
T={float,half}

Gather

Tindices={int32,int64}
Tparams={bool,float,half,int32,int64}

GatherNd

Tindices={int32,int64}
Tparams={bool,float,half,int32,int64}

GatherV2

Taxis={int32,int64}
Tparams={bool,float,half,int32,int64}
Tindices={int32,int64}

Greater

T={float,half,int32,int64}

GreaterEqual

T={float,half,int32,int64}

HSVToRGB

T={float,half}

IRFFT

 

IRFFT2D

 

IRFFT3D

 

Identity

T={bool,float,half,int32,int64,resource,variant}

IdentityN

T={bool,float,half,int32,int64,resource,variant}

If

Tout={bool,float,half,int32,int64,resource,variant}
Tcond={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

InTopKV2

T={int32,int64}

Inv

T={float,half,int32,int64}

Invert

T={int32,int64}

InvertPermutation

T={int32}

IsFinite

T={float,half}

IsInf

T={float,half}

IsNan

T={float,half}

L2Loss

T={float,half}

LRN

T={float,half}

LRNGrad

T={float,half}

LeakyRelu

T={float,half}

LeakyReluGrad

T={float,half}

LeftShift

T={int32,int64}

Less

T={float,half,int32,int64}

LessEqual

T={float,half,int32,int64}

Lgamma

T={float,half}

LinSpace

Tidx={int32,int64}
T={float}

ListDiff

out_idx={int32,int64}
T={int32,int64}

Log

T={float,half}

Log1p

T={float,half}

LogSoftmax

T={float,half}

LogicalAnd

 

LogicalNot

 

LogicalOr

 

MatMul

T={float,half}

MatrixBandPart

Tindex={int32,int64}
T={bool,float,half,int32,int64}

MatrixDiag

T={bool,float,half,int32,int64}

MatrixDiagPart

T={bool,float,half,int32,int64}

MatrixDiagPartV2

T={bool,float,half,int32,int64}

MatrixDiagV2

T={bool,float,half,int32,int64}

MatrixInverse

T={float,half}

MatrixSetDiag

T={bool,float,half,int32,int64}

MatrixSetDiagV2

T={bool,float,half,int32,int64}

MatrixTriangularSolve

T={float,half}

Max

Tidx={int32,int64}
T={float,half,int32,int64}

MaxPool

T={float,half,int32,int64}

MaxPool3D

T={float,half}

MaxPool3DGrad

TInput={float,half}
T={float,half}

MaxPoolGrad

T={float,half,int32,int64}

MaxPoolGradGradV2

T={float}

MaxPoolGradV2

T={float,half,int32,int64}

MaxPoolV2

T={float,half,int32,int64}

Maximum

T={float,half,int32,int64}

Mean

Tidx={int32,int64}
T={float,half,int32,int64}

Min

Tidx={int32,int64}
T={float,half,int32,int64}

Minimum

T={float,half,int32,int64}

MirrorPad

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

Mod

T={float,half,int32,int64}

Mul

T={float,half,int32,int64}

MulNoNan

T={float,half}

Multinomial

output_dtype={int32,int64}
T={float,half,int32,int64}

Neg

T={float,half,int32,int64}

NextAfter

T={float}

NoOp

 

NotEqual

T={bool,float,half,int32,int64}

OneHot

TI={int32,int64}
T={bool,float,half,int32,int64}

OnesLike

T={bool,float,half,int32,int64}

Pack

T={bool,float,half,int32,int64}

Pad

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

PadV2

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

ParallelDynamicStitch

T={bool,float,half,int32,int64}

ParameterizedTruncatedNormal

T={int32,int64}
dtype={float}

PartitionedCall

Tout={bool,float,half,int32,int64,resource,string,variant}
Tin={bool,float,half,int32,int64,resource,string,variant}

PlaceholderWithDefault

dtype={bool,float,half,int32,int64}

Pow

T={float,half,int32,int64}

PreventGradient

T={bool,float,half,int32,int64}

Prod

Tidx={int32,int64}
T={float,half,int32,int64}

QuantizeAndDequantizeV2

T={float,half}

QuantizeAndDequantizeV3

T={float,half}

RFFT

 

RFFT2D

 

RFFT3D

 

RGBToHSV

T={float,half}

RandomShuffle

T={bool,float,half,int32,int64}

RandomStandardNormal

T={int32,int64}
dtype={float,half}

RandomUniform

T={int32,int64}
dtype={float,half}

RandomUniformInt

T={int32,int64}
Tout={int32,int64}

Range

Tidx={float,int32,int64}

Rank

T={bool,float,half,int32,int64}

ReadVariableOp

dtype={bool,float,half,int32,int64}

RealDiv

T={float,half,int32,int64}

Reciprocal

T={float,half,int32,int64}

ReciprocalGrad

T={float,half}

Relu

T={float,half,int32,int64}

Relu6

T={float,half,int32,int64}

Relu6Grad

T={float,half,int32,int64}

ReluGrad

T={float,half,int32,int64}

Reshape

Tshape={int32,int64}
T={bool,float,half,int32,int64}

ResizeBilinear

T={float,half,int32,int64}

ResizeBilinearGrad

T={float,half}

ResizeNearestNeighbor

T={float,half,int32,int64}

ResourceApplyAdaMax

T={float,half}

ResourceApplyAdadelta

T={float,half}

ResourceApplyAdagrad

T={float,half}

ResourceApplyAdagradDA

T={float,half}

ResourceApplyAdagradV2

T={float,half}

ResourceApplyAdam

T={float,half}

ResourceApplyAddSign

T={float,half}

ResourceApplyCenteredRMSProp

T={float,half}

ResourceApplyFtrl

T={float,half}

ResourceApplyFtrlV2

T={float,half}

ResourceApplyGradientDescent

T={float,half}

ResourceApplyKerasMomentum

T={float,half}

ResourceApplyMomentum

T={float,half}

ResourceApplyPowerSign

T={float,half}

ResourceApplyProximalAdagrad

T={float,half}

ResourceApplyProximalGradientDescent

T={float,half}

ResourceApplyRMSProp

T={float,half}

ResourceGather

Tindices={int32,int64}
dtype={bool,float,half,int32,int64}

ResourceScatterAdd

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterDiv

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterMax

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterMin

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterMul

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterNdAdd

Tindices={int32,int64}
T={bool,float,half,int32,int64}

ResourceScatterNdSub

Tindices={int32,int64}
T={bool,float,half,int32,int64}

ResourceScatterNdUpdate

Tindices={int32,int64}
T={bool,float,half,int32,int64}

ResourceScatterSub

Tindices={int32,int64}
dtype={float,half,int32,int64}

ResourceScatterUpdate

Tindices={int32,int64}
dtype={bool,float,half,int32,int64}

ResourceStridedSliceAssign

Index={int32,int64}
T={bool,float,half,int32,int64}

Reverse

T={bool,float,half,int32,int64}

ReverseSequence

Tlen={int32,int64}
T={bool,float,half,int32,int64}

ReverseV2

T={bool,float,half,int32,int64}
Tidx={int32,int64}

RightShift

T={int32,int64}

Rint

T={float,half}

Roll

Taxis={int32,int64}
T={bool,float,half,int32,int64}
Tshift={int32,int64}

Round

T={float,half,int32,int64}

Rsqrt

T={float,half}

RsqrtGrad

T={float,half}

ScatterNd

Tindices={int32,int64}
T={bool,float,half,int32,int64}

Select

T={bool,float,half,int32,int64}

SelectV2

T={bool,float,half,int32,int64}

SelfAdjointEigV2

T={float,half}

Selu

T={float,half}

SeluGrad

T={float,half}

Shape

out_type={int32,int64}
T={bool,float,half,int32,int64}

ShapeN

out_type={int32,int64}
T={bool,float,half,int32,int64}

Sigmoid

T={float,half}

SigmoidGrad

T={float,half}

Sign

T={float,half,int32,int64}

Sin

T={float,half}

Sinh

T={float,half}

Size

out_type={int32,int64}
T={bool,float,half,int32,int64}

Slice

Index={int32,int64}
T={bool,float,half,int32,int64}

Snapshot

T={bool,float,half,int32,int64}

Softmax

T={float,half}

SoftmaxCrossEntropyWithLogits

T={float,half}

Softplus

T={float,half}

SoftplusGrad

T={float,half}

Softsign

T={float,half}

SoftsignGrad

T={float,half}

SpaceToBatch

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}

SpaceToBatchND

Tpaddings={int32,int64}
T={bool,float,half,int32,int64}
Tblock_shape={int32,int64}

SpaceToDepth

T={bool,float,half,int32,int64}

SparseMatMul

Tb={float}
Ta={float}

SparseSoftmaxCrossEntropyWithLogits

Tlabels={int32,int64}
T={float,half}

SparseToDense

Tindices={int32,int64}
T={bool,float,half,int32,int64}

Split

T={bool,float,half,int32,int64}

SplitV

Tlen={int32,int64}
T={bool,float,half,int32,int64}

Sqrt

T={float,half}

SqrtGrad

T={float,half}

Square

T={float,half,int32,int64}

SquaredDifference

T={float,half,int32,int64}

Squeeze

T={bool,float,half,int32,int64}

StackCloseV2

 

StackPopV2

elem_type={bool,float,half,int32,int64}

StackPushV2

T={bool,float,half,int32,int64}

StackV2

elem_type={bool,float,half,int32,int64}

StatefulPartitionedCall

Tout={bool,float,half,int32,int64,resource,string,variant}
Tin={bool,float,half,int32,int64,resource,string,variant}

StatefulStandardNormalV2

shape_dtype={bool,float,half,int32,int64}
dtype={float}

StatefulTruncatedNormal

shape_dtype={bool,float,half,int32,int64}
dtype={float}

StatefulUniform

shape_dtype={bool,float,half,int32,int64}
dtype={float}

StatefulUniformFullInt

shape_dtype={bool,float,half,int32,int64}
dtype={int32,int64}

StatefulUniformInt

shape_dtype={bool,float,half,int32,int64}
dtype={int32,int64}

StatelessIf

Tout={bool,float,half,int32,int64,resource,variant}
Tcond={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

StatelessMultinomial

T={float}
output_dtype={int32,int64}
Tseed={int32}

StatelessRandomNormal

Tseed={int32}
dtype={float}
T={int32,int64}

StatelessRandomUniform

Tseed={int32}
dtype={float}
T={int32,int64}

StatelessRandomUniformInt

Tseed={int32}
dtype={int32,int64}
T={int32,int64}

StatelessTruncatedNormal

Tseed={int32}
dtype={float}
T={int32,int64}

StatelessWhile

T={bool,float,half,int32,int64,resource,variant}

StopGradient

T={bool,float,half,int32,int64}

StridedSlice

Index={int32,int64}
T={bool,float,half,int32,int64}

StridedSliceGrad

Index={int32,int64}
T={bool,float,half,int32,int64}

Sub

T={float,half,int32,int64}

Sum

Tidx={int32,int64}
T={float,half,int32,int64}

Svd

T={float,half}

SymbolicGradient

Tout={bool,float,half,int32,int64}
Tin={bool,float,half,int32,int64}

Tan

T={float,half,int32,int64}

Tanh

T={float,half}

TanhGrad

T={float,half}

TensorArrayCloseV3

 

TensorArrayConcatV3

dtype={bool,float,half,int32,int64}

TensorArrayGatherV3

dtype={bool,float,half,int32,int64}

TensorArrayGradV3

 

TensorArrayReadV3

dtype={bool,float,half,int32,int64}

TensorArrayScatterV3

T={bool,float,half,int32,int64}

TensorArraySizeV3

 

TensorArraySplitV3

T={bool,float,half,int32,int64}

TensorArrayV3

dtype={bool,float,half,int32,int64}

TensorArrayWriteV3

T={bool,float,half,int32,int64}

TensorListElementShape

shape_type={int32,int64}

TensorListFromTensor

shape_type={int32,int64}
element_dtype={bool,float,half,int32,int64}

TensorListGather

element_dtype={bool,float,half,int32,int64}

TensorListGetItem

element_dtype={bool,float,half,int32,int64}

TensorListLength

 

TensorListPopBack

element_dtype={bool,float,half,int32,int64,variant}

TensorListPushBack

element_dtype={bool,float,half,int32,int64,variant}

TensorListReserve

shape_type={int32,int64}
element_dtype={bool,float,half,int32,int64}

TensorListSetItem

element_dtype={bool,float,half,int32,int64}

TensorListStack

element_dtype={bool,float,half,int32,int64}

TensorScatterAdd

Tindices={int32,int64}
T={bool,float,half,int32,int64}

TensorScatterSub

Tindices={int32,int64}
T={bool,float,half,int32,int64}

TensorScatterUpdate

Tindices={int32,int64}
T={bool,float,half,int32,int64}

Tile

Tmultiples={int32,int64}
T={bool,float,half,int32,int64}

TopKV2

T={float,int32}

Transpose

Tperm={int32,int64}
T={bool,float,half,int32,int64}

TruncateDiv

T={float,half,int32,int64}

TruncateMod

T={float,half,int32,int64}

TruncatedNormal

T={int32,int64}
dtype={float}

Unpack

T={bool,float,half,int32,int64}

UnsortedSegmentMax

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

UnsortedSegmentMin

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

UnsortedSegmentProd

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

UnsortedSegmentSum

Tnumsegments={int32,int64}
T={float,half,int32,int64}
Tindices={int32,int64}

VarIsInitializedOp

 

VariableShape

out_type={int32,int64}

While

T={bool,float,half,int32,int64,resource,variant}

Xdivy

T={float,half}

XlaBroadcastHelper

Tindices={int32,int64}
T={float,half,int32,int64}

XlaConv

Tindices={int32,int64}
T={float,half,int32,int64}

XlaDequantize

 

XlaDot

T={float,half,int32,int64}

XlaDynamicSlice

Tindices={int32,int64}
T={bool,float,half,int32,int64}

XlaDynamicUpdateSlice

Tindices={int32,int64}
T={bool,float,half,int32,int64}

XlaEinsum

T={float}

XlaIf

Tout={bool,float,half,int32,int64,resource,variant}
Tcond={bool,float,half,int32,int64,resource,variant}
Tin={bool,float,half,int32,int64,resource,variant}

XlaKeyValueSort

V={bool,float,half,int32,int64}
K={float,half,int32,int64}

XlaPad

Tindices={int32,int64}
T={bool,float,half,int32,int64}

XlaRecv

dtype={bool,float,half,int32,int64}

XlaReduce

T={float,half,int32,int64}

XlaReduceWindow

Tindices={int32,int64}
T={float,half,int32,int64}

XlaReplicaId

 

XlaSelectAndScatter

Tindices={int32,int64}
T={float,half,int32,int64}

XlaSelfAdjointEig

T={float,half}

XlaSend

T={bool,float,half,int32,int64}

XlaSort

T={bool,float,half,int32,int64}

XlaSvd

T={float,half}

XlaWhile

T={bool,float,half,int32,int64,resource,variant}

Xlogy

T={float,half}

ZerosLike

T={bool,float,half,int32,int64,variant}

_Arg

T={bool,float,half,int32,int64,resource,variant}

_ArrayToList

out_types={bool,float,half,int32,int64}
T={bool,float,half,int32,int64}

_FusedBatchNormEx

U={float,half}
T={float,half}

_ListToArray

T={bool,float,half,int32,int64}
Tin={bool,float,half,int32,int64}

_Retval

T={bool,float,half,int32,int64,resource,variant}

_UnaryOpsComposition

T={float,half}

To regenerate this table, run:

 bazel run -c opt -- tensorflow/compiler/tf2xla:tf2xla_supported_ops --device=XLA_IPU_JIT