Preliminary IPU benchmarks

Written by Dave Lacey

Posted Oct 26, 2017

 

 

Graphcore's IPU (Intelligence Processing Unit) is a new AI accelerator bringing an unprecedented level of performance to both current and future machine learning workloads. Its unique combination of massively parallel multi-tasking compute, synchronized execution within an IPU or across multiple IPUs, innovative data exchange fabric and large amounts of on-chip SRAM give unheard of capabilities for both training and inference across a large range of machine learning algorithms.

When we announced our Series A funding back in October 2016, we made three statements about the performance of the IPU:

- it improves performance by 10x to 100x compared with other AI accelerators

- it excels at both training and inference

- it lets machine learning developers innovate with models and algorithms that just don’t work on even the best alternative architectures

Since then we have been inundated with requests for more detail about our claims. Today we are delighted to share three preliminary benchmarks to corroborate these early goals.

We understood from the beginning that a full solution requires more than just a new chip design. The software infrastructure needs to be comprehensive and easy to use to allow machine learning developers to quickly adapt the hardware to their needs. As a result, we have been focused on bringing up a full software stack early to ensure that the IPU can be used for real applications from the outset. 

Our Poplar® graph programming framework and application libraries provide these capabilities. We have developed a port of TensorFlow to target the Poplar libraries with support for other machine learning frameworks underway. With these software tools, we can run a wide variety of real applications, through both cycle accurate chip simulations and real hardware.

Having this platform to experiment with allows us to execute a range of different machine learning applications and to progress from rough estimates to the preliminary performance results we can expect for our IPU systems, which will be refined further when we have production systems.

CNN model training (even at low batch sizes)

Convolutional neural networks (CNNs) are used widely in image processing. A CNN model will typically contain several layers performing multiple convolution operations. The convolution operations have parameters which must be learnt via a training algorithm. Training is usually performed by stochastic gradient descent which involves repeatedly running the model on image data, calculating the gradients of the model and then updating the parameters of the model.

When training machine learning models, the batch size is the number of items of data you need to process in parallel with the current set of parameters. The batch size restricts how often you can update those parameters since you must process a whole batch before an update. A large batch size may not be ideal for training your model. One property of IPU systems is that they perform well even for relatively small batch sizes.

The chart below shows the estimated performance in terms of images per second of training the ResNet-50 neural network to learn image classification on the ImageNet dataset.

resnet50.png

The performance gain is substantial even at smaller batch sizes. When we scale up to eight C2 accelerator cards we only use a batch size of 64.

At any point in this space, using an IPU system is a substantial performance leap over existing technologies. For example, the best performance reported on a 300W GPU accelerator (the same power budget as a C2 accelerator) is approximately 580 images per second.

LSTM inference

Recurrent networks are used to process sequence data, for example, in language translation or text-to-speech applications. LSTM (long short-term memory) networks are a form of recurrent network that contain several distinct elements for choosing whether to remember or forget historical data from the sequence being processed when producing an output.

All recurrent networks contain data dependencies that are a challenge for current chip architectures. The data dependencies limit the amount of parallelism available and the number of operations per data fetch from memory is limited. The IPU and Poplar libraries handle these limitations better through the availability of large amounts of on-chip memory and the flexibility of compute and data movement within the IPU.

For a server performing inference there will be a latency constraint i.e. a minimum required time from asking for an inference to happen to getting the result. The chart below shows the performance of a single layer LSTM network for various different latency constraints compared to a GPU:

lstm-inference.png

The parameters of this single layer are taken from the Baidu DeepBench suite which describes typical types of recurrent network layers used in deep learning applications.

Generative networks

The final application we will look at is a generative neural network. This is a recurrent neural network that generates new items of data a piece at a time. In particular, WaveNet networks generate audio waveforms a sample at a time to provide text-to-speech functionality. We will look at Deep Voice which is a variation of WaveNet.

Here our application experiments have considered two types of performance metric. Firstly, how quickly can samples be generated? In particular, the samples need to be generated quickly enough to produce a real-time stream (e.g. at 16Khz). If a real-time stream can be generated we can consider how many channels of audio can be produced at once (generating different utterances).

The charts below show the performance for the Deep Voice generative algorithm compared to other platforms considered in the original research paper:

deepvoice-sample-rate.png 

deepvoice-channels.png

These applications are just a taster. The IPU and Poplar software stack provide a fully flexible and programmable platform. We are truly excited to see what kind of applications users will find for this platform in the coming years.

As we get closer to product shipping, we are starting to share our Poplar framework with early access customers and we will be publicly releasing documentation and code over the next couple of months. If you would like to receive early access to Poplar documentation and code, please sign up here.


 

Written by Dave Lacey

Posted Oct 26, 2017