RESEARCH PAPERS

Charité Universitätsmedizin, Cornell University, Lawrence Berkeley National Laboratory, Simula Research: Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU
Luk Burchard, Max Xiaohang Zhao, Johannes Langguth, Aydın Buluç & Giulia Guidi

Graphcore & PNNL: Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry
Hatem Helal, Jesun Firoz, Jenna Bilbrey, Mario Michael Krell, Tom Murray, Ang Li, Sotiris Xantheas, Sutanay Choudhury
This paper demonstrates a novel hardware-software co-design approach to scale up the training of graph neural networks for molecular property prediction.
We introduce an algorithm that can reduce the training time of such molecular property prediction models from days to less than two hours, opening new possibilities for AI-driven scientific discovery.

Graphcore, Valence, MILA: GPS++: An Optimised Hybrid GNN/Transformer for Molecular Property Prediction
Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, Dominique Beaini
We present GPS++, a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2. Through a thorough ablation study we highlight the impact of individual components and, contrary to expectations set by recent trends, find that nearly all of the model's performance can be maintained without any use of global self-attention. We also show that our approach is significantly more accurate than prior art when 3D positional information is not available.

Graphcore: BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion
Alberto Cattaneo, Daniel Justus, Harry Mellor, Douglas Orr, Jerome Maloberti, Zhenying Liu, Thorin Farnsworth, Andrew Fitzgibbon, Blazej Banaszewski, Carlo Luschi
We present the award-winning submission of the WikiKG90Mv2 track of OGB-LSC@NeurIPS 2022. The task is link-prediction on the large-scale knowledge graph WikiKG90Mv2, consisting of 90M+ nodes and 600M+ edges. Our solution uses a diverse ensemble of 85 Knowledge Graph Embedding models combining five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx) and two different loss functions (log-sigmoid, sampled softmax cross-entropy).
Our final model achieved 1st place with a validation MRR of 0.2922 and a test-challenge MRR of 0.2562.

Graphcore, PNNL, IBM Research, University of Washington: Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators
Jenna A. Bilbrey, Kristina M. Herman, Henry Sprueill, Sotiris S. Xantheas, Payel Das, Manuel Lopez Roldan, Mike Kraus, Hatem Helal, Sutanay Choudhury
We demonstrate finetuning for downstream tasks on a graph neural network (GNN) trained over a molecular database containing 2.7 million water clusters.
The use of Graphcore IPUs for training molecular GNNs reduces training time from a reported 2.7 days on 0.5M clusters to 1.2 hours on 2.7M clusters. Finetuning the pretrained model for downstream tasks of molecular dynamics and transfer to a different potential energy surface took only 8.3 hours and 28 minutes, respectively, on a single GPU.

Texas A&M University & Graphcore: Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads
Abhinand S. Nasari, Tim Cockerill, Hieu T. Le, Richard Lawrence, Zhenhua He, Xin Yang, Mario M. Krell, Alex Tsyplikhin, Mahidhar Tatineni, Lisa M. Perez, Dhruva K. Chakravorty, Honggao Liu

Graphcore Research: 8-bit Numerical Formats for Deep Neural Networks
Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

Microsoft Research & Graphcore: Confidential Machine Learning within Graphcore IPUs
Kapil Vaswani, Stavros Volos, Cédric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, Richard Osbourne, Dan Wilkinson
This paper presents IPU Trusted Extensions (ITX), a set of experimental hardware extensions that enable trusted execution environments in Graphcore's IPUs.
Its evaluation on a development board using standard DNN training workloads suggests that ITX adds less than 5% performance overhead, and delivers up to 17x better performance compared to CPU-based confidential computing systems relying on AMD SEV-SNP.

Imperial College London: Incremental Abstraction in Distributed Probabilistic SLAM Graphs
Joseph Ortiz, Talfan Evans, Edgar Sucar, Andrew Davison
The Robot Vision Laboratory at Imperial College London propose a method for efficient incremental construction of probabilistic scene graphs from monocular input based on two novel components. Firstly, an incremental scene abstraction framework combing amortized inference with probabilistic inference and secondly, a routing procedure that enables inference on dynamic graphs with GBP leveraging the parallelism of the Graphcore IPU.
This paper demonstrates the advantage of GBP over direct methods for complex factor graphs due to the structure-agnostic time per iteration.

Imperial College London - Dyson Robotics Laboratory: From Scene Flow to Visual Odometry through Local and Global Regularisation in Markov Random Fields
Raluca Scona, Hidenobu Matsuki, Andrew Davison
This paper revisits pairwise Markov Random Field (MRF) formulations for RGB-D scene flow and leverage novel advances in processor design for real-time implementations.
Dyson Robotics Lab show that visual odometry and non-rigid scene flow can be unified into a single joint factor graph, and optimised highly efficiently with Gaussian Belief Propagation on the Graphcore IPU by leveraging the processor's distributed per-tile memory and ultrafast all-to-all communication fabric.

Graphcore: A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128
James Hanlon, Stephen Felix
The IPU contains an original pseudorandom number generator (PRNG) called xoroshiro128aox, based on the F2-linear generator xoroshiro128. It is designed for cheap hardware implementation and high-quality statistical randomness.
We assess the generator's quality using standard statistical test suites and compare results against PRNGs xoroshiro128+, pcg64 and philox4x32-10. We show that xoroshiro128aox mitigates a known weakness in xoroshiro128+ with a new 'AOX' output function by passing the BigCrush and PractRand suites.
We conclude that the non-uniformities and inherited linear artefacts are hard to detect, and so xoroshiro128aox provides a good trade-off between statistical quality and hardware implementation cost.

Stanford University & Graphcore: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU
Edward H. Lee, Mario Michael Krell, Alexander Tsyplikhin, Victoria Rege, Errol Colak, Kristen W. Yeom
Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs.
In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.

Université de Paris: Comparison of Graphcore IPUs and NVIDIA GPUs for cosmology applications
Bastien Arcelin
This paper represents the first investigation of the suitability and performance of Graphcore Intelligence Processing Units (IPUs) for deep learning applications in cosmology. It presents the benchmark between a Nvidia V100 GPU and a Graphcore Mk1 (GC2) IPU on three cosmological use cases: a classical deep neural network and a Bayesian neural network (BNN) for galaxy shape estimation, and a generative network for galaxy images production.
The results show that IPUs can accelerate various cosmology applications, outperforming GPUs in some cases by as much as 4x faster time to train.

Graphcore Research: Dynamic Sparse Pre-Training of BERT
Anastasia S. D. Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, Carlo Luschi
In this work, we develop and study a simple, dynamic always-sparse pre-training approach for BERT language models, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation.
As a result, we achieve Pareto improvements in terms of number of FLOPs over both static and dense baselines across model sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

Graphcore: Packing: Towards 2x NLP BERT Acceleration
Matej Kosec, Sheng Fu, Mario Michael Krell
By using a new packing algorithm, Graphcore engineers have sped up Natural Language Processing by more than 2 times while training BERT-Large. Our new packing technique removes padding, enabling significantly more efficient computation. We suspect this could also be applied to genomics and protein folding models and other models with skewed length distributions to make a much broader impact in different industries and applications.
We introduce Graphcore's highly efficient Non-Negative Least Squares Histogram-Packing algorithm (or NNLSHP) as well as our BERT algorithm applied to packed sequences in a new paper.

Simula: iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs
Luk Burchard, Johannes Moe, Daniel Thilo Schroeder, Konstantin Pogorelov, Johannes Langguth
This paper aims to test the IPU’s suitability for algorithms with hard-to-predict memory accesses by implementing a breadth-first search (BFS) that complies with the Graph500 specifications. Precisely because of its apparent simplicity, BFS is an established benchmark that is not only subroutine for a variety of more complex graph algorithms, but also allows comparability across a wide range of architectures.
The results indicate that the IPU delivers speedups of up to 4× over the fastest competing result on an NVIDIA V100 GPU, with typical speedups of about 1.5× on most test instances.

Graphcore Research: GroupBERT - Enhanced Transformer Architecture with Efficient Grouped Structures
Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi
Attention based language models have become a critical component in state-of-the-art NLP systems. However, these models have significant computational requirements, due to long training times, dense operations and large parameter count.
In this paper, Graphcore Research demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. This architecture is applied to language representation learning and demonstrates a superior performance compared to BERT models of different scales. This results in improved efficiency, both in terms of floating-point operations (FLOPs) and time-to-train.

Oxford-Man Institute & University of Oxford: Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units
Zihao Zhang, Stefan Zohren
Researchers at the Oxford-Man Institute of Quantitative Finance have used Graphcore’s Intelligence Processing Unit (IPU) to dramatically accelerate the training of advanced price prediction models, using techniques which are typically plagued by computational bottlenecks when run on other types of processor.
The IPU’s designed-for-AI architecture allowed the OMI team to reduce the training times for their multi-horizon forecasting models to the point where they could deliver significant commercial advantage by more accurately estimating market price movements. Such models can be used in the development of alpha for fast trading and in market making strategies.

Graphcore Research: Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence
Antoine Labatie, Dominic Masters, Zach Eaton-Rosen, Carlo Luschi
We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity.
To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.

Graphcore Research: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training
Dominic Masters, Antoine Labatie, Zach Eaton-Rosen,
Graphcore Research examines three methods for optimising state-of-the-art computer vision model EfficientNet’s performance on Intelligence Processing Units (IPUs), in a new paper. These approaches are :(i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations
to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.
By combining all three techniques, IPUs delivered accelerations of up to 7x on training and more than 3.6x on inference.

University of Bristol: Using the Graphcore IPU for traditional HPC applications
Thorben Louw, Simon McIntosh-Smith
The increase in ML workloads means that AI accelerators are expected to become common in supercomputers, evoking considerable interest in the scientific HPC community about how these devices might also be exploited for traditional HPC workloads.
In this paper, we report our early results using Graphcore's IPU for stencil computations on structured grid problems, which are used for solvers for differential equations in domains such as computational fluid dynamics. We demonstrate that the IPU and its low-level programming framework, Poplar, expose sufficient programmability to express these HPC problems, and achieve performance comparable to that of modern GPUs.

Graphcore & UMass Amherst: Accelerating Simulation-based Inference with Emerging AI Hardware
Sourabh Kulkarni, Alexander Tsyplikhin, Mario Michael Krell, Csaba Andras Moritz
In this work, we explore hardware accelerated simulation-based inference over probabilistic models, by combining massively parallelized ABC inference algorithm with the cutting-edge AI chip solutions that are uniquely suited for this purpose. As a proof-of-concept, we demonstrate inference over a probabilistic epidemiology model used to predict the spread of COVID-19. Two hardware acceleration platforms are compared - the Tesla V100 GPU and the Graphcore Mk1 IPU. Our results show that while both of these platforms outperform multi-core CPUs, the Mk1 IPUs are 7.5x faster than the Tesla V100 GPUs for this workload.

Google Research, UC Berkeley & Graphcore Research: Parallel Training of Deep Networks with Local Updates
Michael Laskin, Luke Metz, Seth Nabarro, Mark Saroufim, Badreddine Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel
In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.

Graphcore Research: Improving Neural Network Training in Low Dimensional Random Bases
Frithjof Gressmann, Zach Eaton-Rosen, Carlo Luschi
Graphcore Research is exploring novel ways to train neural networks that could allow us to scale to substantially larger models in future.
In this paper, we revisit a simple approach to reduce the effective network dimensionality using random projections. We leverage the hardware-accelerated random number generation of the IPU to train in randomly selected directions of the weight space. Applying smaller independent random projections to different parts of the network and re-drawing them at every step significantly improves the obtained accuracy.

Graphcore & Ford: A Follow-The-Leader Strategy using Hierarchical Deep Neural Networks with Grouped Convolutions
José Solomon, François Charette
A follow-the-leader strategy can be implemented using a hierarchical Deep Neural Network (DNN) end-to-end driving model to match the direction and speed of a target pedestrian. Using a classifier DNN, pedestrian movements can be tracked to determine if the pedestrian is in the camera sensor’s field of view. The autonomous vehicle’s steering and throttle can then be adjusted by a regression DNN. These DNNs also incorporate grouped convolutions to boost model performance.
In this paper, Graphcore Research and Ford Motor Company leverage the fine-grain compute capabilities of the Graphcore IPU to minimise time-to-train for these Hierarchical Deep Neural Networks.

University of Bristol: Studying the potential of Graphcore IPUs for applications in Particle Physics
Lakshan Ram Madhan Mohan, Alexander Marshall, Samuel Maddrell-Mander, Daniel O'Hanlon, Konstantinos Petridis, Jonas Rademacker, Victoria Rege, Alexander Titterton
This paper presents the first study of Graphcore's Intelligence Processing Unit (IPU) in the context of particle physics applications.
Comparisons are made for neural-network-based event simulation, multiple-scattering correction, and flavour tagging, implemented on IPUs, GPUs and CPUs, using a variety of neural network architectures and hyperparameters. Additionally, a Kálmán filter for track reconstruction is implemented with promising results.

Imperial College London: Bundle Adjustment on a Graph Processor
Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison
This paper shows for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor such as Graphcore's Intelligence Processing Unit (IPU) using Gaussian Belief Propagation.
Gaussian Belief Propagation is an effective algorithmic framework for spatial AI problems where estimates are needed in real time with new measurements constantly being fed into the algorithm.

Qwant: Graphcore C2 Card performance for image-based deep learning application: A Report
Ilyes Kacher, Maxime Portaz, Hicham Randrianarivo, Sylvain Peyronnet
Graphcore's architecture of the processor has been designed to achieve state of the art performance on current machine intelligence models for both training and inference.
In this paper, we report on a benchmark in which we have evaluated the performance of IPU processors on deep neural networks for inference. We focus on deep vision models such as ResNeXt. We report the observed latency, throughput and energy efficiency.

Citadel: Dissecting the Graphcore IPU Architecture via Microbenchmarking
Zhe Jia, Blake Tillman, Marco Maggioni, Daniele Paolo Scarpazza
This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform introduced by Graphcore and aimed at Artificial Intelligence/Machine Learning (AI/ML) workloads.
The study dissects the IPU’s performance behavior using microbenchmarks that were crafted for the purpose.

Graphcore Research: Revisiting Small Batch Training for Deep Neural Networks
Dominic Masters, Carlo Luschi
The team at Graphcore Research addresses mini-batch stochastic gradient optimization of modern deep network architectures.
In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. Our experiments show that small batch sizes produce the best results.