The IPU has many unique architectural features which enable significant performance gains for both training and inference. Here we provide our latest training and inference performance benchmark results for our second generation MK2 IPU platforms, the IPU-M2000, IPU-POD16, and IPU-POD64. Benchmarks were generated using our examples on the Graphcore GitHub page.
Last updated on Tuesday, April 20, 2021
The results are obtained on the IPU-M2000, IPU-POD16 and IPU-POD64 systems with 4, 16, and 64 Mk2 IPUs respectively. The host server used is a dual-socket AMD EPYC 7742. The operating system is Ubuntu 18.04.
Training a machine learning model involves running the algorithm over an input dataset (training data) until the model converges - meaning that it has learned to produce the desired output to a specified accuracy. Throughput in this context is defined as the number of input data points (sequences, images, or rows) processed by the model per second. Throughput is often used as a measure of hardware performance as it is directly related to the time for the model to train to a specified accuracy.
The results provided below detail the obtained throughput values for each of the referenced models in the specified configuration. All configurations running on real data are verified for convergence.
[*1] Preliminary results – pending convergence verification
Training: Time to Result
Model inference in this context refers to running a model on input data to infer output. Inference performance in production setups is typically measured on two metrics: throughput (as defined previously) and latency, which is defined as the time taken to execute an inference.
[*2] Latency results are provided for Synthetic (on-IPU)
Precision Terminology: X.Y is defined as follows: X is the precision for storing the activations & gradients, and Y is the precision for storing the weights. When training in 16.16 weights we may still use FP32 for other variables (such as norms or momentum), and include stochastic rounding.