The IPU has many unique architectural features which enable significant performance gains for both training and inference. Here we provide our latest training and inference performance benchmark results for our second generation MK2 IPU platforms, the IPU-M2000 and IPU-POD64. All benchmarks were generated using our examples on the Graphcore GitHub page.
Last updated on Wednesday, December 9, 2020
The results are obtained on the 1xIPU-M2000, 4xIPU-M2000 and IPU-POD64 systems with 4, 16 and 64 Mk2 IPUs respectively. The host server used is a dual-socket AMD EPYC 7742. The operating system is Ubuntu 18.04.
Training a machine learning model involves running the algorithm over an input dataset (training data) until the model converges - meaning that it has learned to produce the desired output to a specified accuracy. Throughput in this context is defined as the number of input data points (sequences, images, or rows) processed by the model per second. Throughput is often used as a measure of hardware performance as it is directly related to the time for the model to train to a specified accuracy.
The results provided below detail the obtained throughput values for each of the referenced models in the specified configuration. All configurations running on real data have been verified for convergence.
Training: Time to Result
Model inference in this context refers to running a model on input data to infer output. Inference performance in production setups is typically measured on two metrics: throughput (as defined previously) and latency, which is defined as the time taken to execute an inference. Inference results are replicated on IPU-M2000. Here below we provide results for both throughput and latency for a given batch size.