Written by Dave Lacey
Posted Nov 13, 2019
We are delighted to share a selection of performance benchmarks based on IPU systems that are now available as an IPU Cloud preview on Microsoft Azure and as IPU-Server products from Dell with full support from our production release Poplar® software stack.
Customers are able to replicate our benchmarks using code we will make available as public examples on the Graphcore GitHub site from Monday 18th November 2019. All of the benchmarks in this blog, plus several other application examples that are not included here, will be available publicly. Customer implementations will not be available as public examples, as you would expect. Please check back regularly as we will be adding more applications on an ongoing basis.
The Graphcore IPU is a new type of processor designed from the ground up for machine intelligence. The IPU has a number of distinguishing architectural features that result in much higher performance for both training and inference, especially on new, more complex machine learning models. Below we have identified some of the benefits that the IPU enables:
- The IPU delivers much better arithmetic efficiency on small batch sizes for both training and inference which results in faster model convergence in training, models that generalise better, the ability to parallelize over many more IPU processors to reduce training time for a given batch size, and also delivers much higher throughput at lower latencies for inference.
- IPUs deliver dramatic performance breakthroughs with new higher accuracy models, like ResNext, that use group separable convolutions. Legacy architectures struggle on non-aligned and sparse data accesses but these more complex data structures are critical for the next generation of machine intelligence models. The IPU has been designed to support complex data access efficiently and at much higher speeds.
- In real world applications including those where data changes over time, data points are usually inter-related. Today’s deep learning algorithms are based around highly structured, feed-forward operations that do not take into account the natural uncertainty arising from these relationships. However, for many applications including language, video, or for any time series based analysis, the machine intelligence model needs to take into account the contextual relationship in data and uncertainty relationships such as the safety of a decision. Probabilistic machine learning approaches allow this. Unlike today's architectures, the IPU has been designed to efficiently support stochastic computations and the much more complex data structures that are required with higher dimensional models.
- Regular, and even continuous, training updates are becoming essential in many applications, such as content filtering and for more accurate search. The IPU can efficiently support both training and inference so as machine intelligence systems start to learn from experience the IPU will support the transition to more intelligent systems that can learn from experience.
Many of today’s standard benchmarks are backward looking and do not take into account the important characteristics and performance capabilities that will be required to support the next breakthroughs in machine intelligence. So rather than focus on these older approaches we have instead focused on working with customers to support real world applications and to help them try out new approaches and solve more complex problems. We are now delighted to share a selection of performance benchmarks that include newer model structures and approaches.
The IPU does deliver state of the art performance on today’s image processing and language models but we are also seeing significant performance gains in several new model types, like ResNext and models using MCMC (Markov Chain Monte Carlo) based methods.
Machine Intelligence innovation is just in the very early stages and we expect to see many new innovations developed over the next few years. The IPU has been designed to help innovators create these new breakthroughs.
Natural Language Processing (NLP): BERT
BERT (Bidirectional Encoder Representations from Transformers), published by researchers at Google AI Language, represents an important development in the field of Natural Language Processing (NLP). Attention based Transformer models allow for unsupervised learning of language structure and meaning in text. With BERT, a key innovation is the application of bidirectional training of the Transformer attention model, to provide a more complete and accurate sense of language understanding and interpretation. In addition, after pre-training the BERT model on a wide corpus of language, fine tuning on more specific language data can be used to target the model to the specific NLP use case.
Graphcore shows state of the art time to train and accuracy with BERT.base, a significant proof point for the IPU architecture. To date, only three processor providers have demonstrated training capability with BERT: Google, NVIDIA, and now Graphcore. In addition, the architecture of the IPU is particularly well suited to the next breakthroughs in NLP, including such innovations as Block-Sparse based Transformer models.
For NLP Inference, as with many other inference use cases, there is a strong emphasis on the highest possible throughput at the lowest possible latencies. For example, this requirement is highlighted in a report on the importance of performance, in particular latency, for search engine companies. The report mentions an Amazon analysis showing that a 100ms slowdown decreases sales by 1%. Likewise experiments by Microsoft Bing indicated a 100ms speedup improves revenue by 0.6%.
The inference benchmark reported below is therefore focused on evaluating throughput at the lowest possible latency. Throughput becomes less meaningful as batch sizes increase since the required latency for larger batch sizes becomes problematic in a real application. Graphcore is able to demonstrate with BERT base inference, 3x the throughput at 1.3x lower latency compared to today’s solutions.
Image Recognition: ResNext-101
In addition to the importance of high throughput at low latency, accuracy also has a strong impact on revenue for internet companies. In ad placement or search engine use cases, a percentage point increase in accuracy directly maps to revenue gains. A new class of image classification, ResNext, uses innovative approaches, like group and depthwise separable convolutions, to increase accuracy while reducing the parameter count. These approaches are not well suited to legacy architectures that struggle with the unaligned data accesses critical for these newer, more accurate models. This means companies have been held back from moving beyond todays simple CNN models that work well on today’s processors.
However, the use of group separable convolutions, which involves splitting the convolution filters into smaller separable blocks, is much more suited to the IPU’s massively parallel architecture.
As seen in the chart below, the Graphcore C2 IPU-Processor PCIe card achieved 3.4x higher throughput at 18x lower latency compared to the most common alternative processor and achieves a 40x advantage at the lowest possible latency for each solution. High throughput at the lowest possible latency is key in many of the important use cases today and becomes even more important as companies focus on developing new solutions for video content.
Separable Convolution Analysis
The micro-benchmarks below provide greater insight into how the architecture of the IPU fits with increasing degrees of separable convolution. The IPU is able to support increasing levels of group separable convolutions with its ability to flexibly map smaller blocks of data to thousands of fully independent processing threads and as a result of its much more flexible and higher throughput memory architecture. In this micro-benchmark we vary the group dimension to show coverage from a standard convolution with a group size dimension of 512, through multiple group convolutions, all the way down to group size dimension of 1 which corresponds to a fully depthwise convolution (filter depth of a single layer).
The chart below shows the results for a Graphcore C2 IPU-Processor PCIe card versus a leading alternative processor at equivalent power, starting with standard convolution on the far right, moving right to left through increasingly smaller group convolutions, until the fully depthwise convolution (group dimension 1) on the far left. The results show the IPU advantage across the sweep of group convolutions, with significant advantage for group convolutions, and delivering up to a 77x throughput advantage.
Time Series Analysis: Sales Forecast Model Training
This benchmark shows a typical model used in time series analysis consisting of MLP (Multi-Layer Perceptron) networks combined with feature embeddings. The model predicts the amount of sales on a particular day given a set of features in the original Rossmann competition dataset. The results from our comparative testing shows a performance advantage for the Graphcore C2 IPU-Processor PCIe card of 15x versus an alternative leading processor at equivalent power and batch size (batch size - 1,024). Even when we increase the batch size for the leading alternative processor up to 512,000 to maximise its throughput we still see >5x improvement in throughput for the Graphcore C2 platform that is still using the smaller 1,024 batch size.
Recommender/Ranking: Dense AutoEncoder Training
Autoencoder models can be used to perform collaborative filtering in recommender systems in order to provide useful predictions, for example recommending films for online TV viewers based on previous viewing experiences. This autoencoder model shows significant improvement in results compared to previous models when tested using a publicly available Netflix dataset made up of 3m data samples.
The model architecture is a deep autoencoder with 6 fully connected layers and a constrained decoder. Dense re-feeding is used for training to overcome the sparseness of data in collaborative filtering. It is implemented in TensorFlow with a model size of about 10 million parameters. This model is taken from the paper “Training Deep AutoEncoders for Collaborative Filtering”
The results from our comparative testing shows a performance advantage for the Graphcore C2 card of more than 2x versus a leading alternative processor at equivalent power.
Probabilistic Learning – Markov Chain Monte Carlo (MCMC)
Early access IPU customers in the finance sector have been able to train their proprietary, optimised models using MCMC in just 4 ½ minutes on IPUs, compared to over 2 hours with their existing hardware. This represents a 26x speed up in training time.
MCMC Implementation using TensorFlow
We took an implementation using the off the shelf TensorFlow Probability (TFP) library to assess the performance of probabilistic models on IPU comparing against other leading hardware accelerators.
What we discovered was that even when implemented using standard TensorFlow code and not including any optimisations, the MCMC algorithms still train 8x faster on an IPU compared to the next best alternative.
In this example the model is a neural network with 3 fully connected layers. The input data set are features, generated from time series of stock prices. Distributions of model parameters are represented by their samples. The samples are obtained using the Hamiltonian Monte Carlo (HMC) algorithm, which is an MCMC method, efficient in high-dimensional cases. Sampling is performed in a sliding time window on subsets of the data. This is done to test the historical predictive power of the model. Using the IPU platform we were able to train the model in 45 minutes down from over 6.5 hours on the best alternative processor.
Reinforcement Learning (RL) can be thought of as a framework for solving complex problems and is fundamental to the future of machine intelligence. Reinforcement learning provides a clean simple language to state general AI problems. In reinforcement learning there is a set of actions, a set of observations, and a reward. The goal in reinforcement learning is to learn a policy which is a function of the observations, the rewards and the actions and which maps histories to actions so as to try and maximise the expected sum of observed rewards. As an example reinforcement learning has been used to teach machines how to play games in a completely unsupervised learning way.
Reinforcement learning requires the machine intelligence system to remember previous histories and uses these to help learn the policy. Low latency and fast access to complex state is critical. To show the potential performance of IPU on reinforcement learning policy training problems we have taken a typical policy model such as those found within RL problems and compared performance against existing processor solutions. With no optimisation the IPU delivers an order of magnitude (10x) improvement in throughput which results in much faster time to train for these complex and compute intensive problems. Work with some of our early access customers has shown even higher levels of performance gain.
We are only at the beginning of exploring ongoing optimisation in our performance across a wide range of models for training and inference and will continue to share code and results publicly.
For more information or to be contacted by one of our sales team please register your interest here.
The products, systems, software, and results are based on configurations existing at the time of the measurements, and as such are subject to change at any time, without notice. For more information regarding methodology or results, please contact Graphcore.
Written by Dave Lacey
Posted Nov 13, 2019