RESEARCH PAPERS

Graphcore & PNNL: Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry

Graphcore & PNNL: Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry

Hatem Helal, Jesun Firoz, Jenna Bilbrey, Mario Michael Krell, Tom Murray, Ang Li, Sotiris Xantheas, Sutanay Choudhury

This paper demonstrates a novel hardware-software co-design approach to scale up the training of graph neural networks for molecular property prediction.

We introduce an algorithm that can reduce the training time of such molecular property prediction models from days to less than two hours, opening new possibilities for AI-driven scientific discovery.

Graphcore, Valence, MILA: GPS++: An Optimised Hybrid GNN/Transformer for Molecular Property Prediction

Graphcore, Valence, MILA: GPS++: An Optimised Hybrid GNN/Transformer for Molecular Property Prediction

Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, Dominique Beaini

This technical report presents GPS++, the winning method in the Open Graph Benchmark Large-Scale Challenge (OGB-LSC 2022) for PCQM4Mv2 molecular property prediction task. Our method is a hybrid GNN/Transformer model that incorporates 3D atom positions and an auxiliary denoising task, achieving 0.0719 mean absolute error on the PCQM4Mv2 test-challenge.

Using IPUs, GPS++ scales to deep architectures (16 layers), training at 3 minutes per epoch, and large ensemble (112 models), completing the final predictions in 1 hour 32 minutes, well under the 4-hour inference budget.

Graphcore: BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion

Graphcore: BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion

Alberto Cattaneo, Daniel Justus, Harry Mellor, Douglas Orr, Jerome Maloberti, Zhenying Liu, Thorin Farnsworth, Andrew Fitzgibbon, Blazej Banaszewski, Carlo Luschi

We present the award-winning submission of the WikiKG90Mv2 track of OGB-LSC@NeurIPS 2022. The task is link-prediction on the large-scale knowledge graph WikiKG90Mv2, consisting of 90M+ nodes and 600M+ edges. Our solution uses a diverse ensemble of 85 Knowledge Graph Embedding models combining five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx) and two different loss functions (log-sigmoid, sampled softmax cross-entropy).

Our final model achieved 1st place with a validation MRR of 0.2922 and a test-challenge MRR of 0.2562.

Graphcore, PNNL, IBM Research, University of Washington: Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators

Graphcore, PNNL, IBM Research, University of Washington: Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators

Jenna A. Bilbrey, Kristina M. Herman, Henry Sprueill, Sotiris S. Xantheas, Payel Das, Manuel Lopez Roldan, Mike Kraus, Hatem Helal, Sutanay Choudhury

We demonstrate finetuning for downstream tasks on a graph neural network (GNN) trained over a molecular database containing 2.7 million water clusters.

The use of Graphcore IPUs for training molecular GNNs reduces training time from a reported 2.7 days on 0.5M clusters to 1.2 hours on 2.7M clusters. Finetuning the pretrained model for downstream tasks of molecular dynamics and transfer to a different potential energy surface took only 8.3 hours and 28 minutes, respectively, on a single GPU.

Texas A&M University & Graphcore: Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads

Texas A&M University & Graphcore: Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads

Abhinand S. Nasari, Tim Cockerill, Hieu T. Le, Richard Lawrence, Zhenhua He, Xin Yang, Mario M. Krell, Alex Tsyplikhin, Mahidhar Tatineni, Lisa M. Perez, Dhruva K. Chakravorty, Honggao Liu

This papers compares the performance of two different architectures: the commonly used GPU and the new generation of Intelligence Processing Units (IPUs), by running training benchmarks on national cyberinfrastructure resources of common AI/ML models. 
Microsoft Research & Graphcore: Confidential Machine Learning within Graphcore IPUs

Microsoft Research & Graphcore: Confidential Machine Learning within Graphcore IPUs

Kapil Vaswani, Stavros Volos, Cédric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, Richard Osbourne, Dan Wilkinson

This paper presents IPU Trusted Extensions (ITX), a set of experimental hardware extensions that enable trusted execution environments in Graphcore's IPUs.

Its evaluation on a development board using standard DNN training workloads suggests that ITX adds less than 5% performance overhead, and delivers up to 17x better performance compared to CPU-based confidential computing systems relying on AMD SEV-SNP.

Imperial College London: Incremental Abstraction in Distributed Probabilistic SLAM Graphs

Imperial College London: Incremental Abstraction in Distributed Probabilistic SLAM Graphs

Joseph Ortiz, Talfan Evans, Edgar Sucar, Andrew Davison

The Robot Vision Laboratory at Imperial College London propose a method for efficient incremental construction of probabilistic scene graphs from monocular input based on two novel components. Firstly, an incremental scene abstraction framework combing amortized inference with probabilistic inference and secondly, a routing procedure that enables inference on dynamic graphs with GBP leveraging the parallelism of the Graphcore IPU.

This paper demonstrates the advantage of GBP over direct methods for complex factor graphs due to the structure-agnostic time per iteration. 

Imperial College London - Dyson Robotics Laboratory: From Scene Flow to Visual Odometry through Local and Global Regularisation in Markov Random Fields

Imperial College London - Dyson Robotics Laboratory: From Scene Flow to Visual Odometry through Local and Global Regularisation in Markov Random Fields

Raluca Scona, Hidenobu Matsuki, Andrew Davison

This paper revisits pairwise Markov Random Field (MRF) formulations for RGB-D scene flow and leverage novel advances in processor design for real-time implementations.

Dyson Robotics Lab show that visual odometry and non-rigid scene flow can be unified into a single joint factor graph, and optimised highly efficiently with Gaussian Belief Propagation on the Graphcore IPU by leveraging the processor's distributed per-tile memory and ultrafast all-to-all communication fabric.

Graphcore: A Fast Hardware Pseudorandom Number Generator Based on the xoroshiro128 LFSR

Graphcore: A Fast Hardware Pseudorandom Number Generator Based on the xoroshiro128 LFSR

James Hanlon, Diya Rajan, Stephen Felix

In this paper, we present a rigorous assessment of the quality of our new PRNG using standard statistical test suites and compare the results with the fast contemporary PRNGs xoroshiro128+, pcg64 and philox4x32. As a baseline for the analysis, we include the widely-used Mersenne Twister PRNG. In our experiments, we show that xoroshiro128aox mitigates the known weakness in the lower order bits of xoroshiro128+ with our new AOX output function by passing the BigCrush and PractRand test suites.

We extend our testing with the Gjrand test suite and a Hamming-Weight dependency test to highlight the linear weaknesses of both xoroshiro128 PRNGs, but conclude that these linearities are hard to detect, and the xoroshiro128aox PRNG otherwise provides an excellent trade off between statistical quality and hardware implementation cost.

Stanford University & Graphcore: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Stanford University & Graphcore: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Edward H. Lee, Mario Michael Krell, Alexander Tsyplikhin, Victoria Rege, Errol Colak, Kristen W. Yeom

Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs.

In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.

Université de Paris: Comparison of Graphcore IPUs and NVIDIA GPUs for cosmology applications

Université de Paris: Comparison of Graphcore IPUs and NVIDIA GPUs for cosmology applications

Bastien Arcelin

This paper represents the first investigation of the suitability and performance of Graphcore Intelligence Processing Units (IPUs) for deep learning applications in cosmology. It presents the benchmark between a Nvidia V100 GPU and a Graphcore Mk1 (GC2) IPU on three cosmological use cases: a classical deep neural network and a Bayesian neural network (BNN) for galaxy shape estimation, and a generative network for galaxy images production.

The results show that IPUs can accelerate various cosmology applications, outperforming GPUs in some cases by as much as 4x faster time to train.

Graphcore Research: Dynamic Sparse Pre-Training of BERT

Graphcore Research: Dynamic Sparse Pre-Training of BERT

Anastasia S. D. Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, Carlo Luschi

In this work, we develop and study a simple, dynamic always-sparse pre-training approach for BERT language models, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation.

As a result, we achieve Pareto improvements in terms of number of FLOPs over both static and dense baselines across model sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

Graphcore: Packing: Towards 2x NLP BERT Acceleration

Graphcore: Packing: Towards 2x NLP BERT Acceleration

Matej Kosec, Sheng Fu, Mario Michael Krell

By using a new packing algorithm, Graphcore engineers have sped up Natural Language Processing by more than 2 times while training BERT-Large. Our new packing technique removes padding, enabling significantly more efficient computation. We suspect this could also be applied to genomics and protein folding models and other models with skewed length distributions to make a much broader impact in different industries and applications.

We introduce Graphcore's highly efficient Non-Negative Least Squares Histogram-Packing algorithm (or NNLSHP) as well as our BERT algorithm applied to packed sequences in a new paper.

Simula: iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs

Simula: iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs

Luk Burchard, Johannes Moe, Daniel Thilo Schroeder, Konstantin Pogorelov, Johannes Langguth

This paper aims to test the IPU’s suitability for algorithms with hard-to-predict memory accesses by implementing a breadth-first search (BFS) that complies with the Graph500 specifications. Precisely because of its apparent simplicity, BFS is an established benchmark that is not only subroutine for a variety of more complex graph algorithms, but also allows comparability across a wide range of architectures.

The results indicate that the IPU delivers speedups of up to 4× over the fastest competing result on an NVIDIA V100 GPU, with typical speedups of about 1.5× on most test instances.

Graphcore Research: GroupBERT - Enhanced Transformer Architecture with Efficient Grouped Structures

Graphcore Research: GroupBERT - Enhanced Transformer Architecture with Efficient Grouped Structures

Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi

Attention based language models have become a critical component in state-of-the-art NLP systems. However, these models have significant computational requirements, due to long training times, dense operations and large parameter count.

In this paper, Graphcore Research demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. This architecture is applied to language representation learning and demonstrates a superior performance compared to BERT models of different scales. This results in improved efficiency, both in terms of floating-point operations (FLOPs) and time-to-train.

Oxford-Man Institute & University of Oxford: Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units

Oxford-Man Institute & University of Oxford: Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units

Zihao Zhang, Stefan Zohren

Researchers at the Oxford-Man Institute of Quantitative Finance have used Graphcore’s Intelligence Processing Unit (IPU) to dramatically accelerate the training of advanced price prediction models, using techniques which are typically plagued by computational bottlenecks when run on other types of processor.

The IPU’s designed-for-AI architecture allowed the OMI team to reduce the training times for their multi-horizon forecasting models to the point where they could deliver significant commercial advantage by more accurately estimating market price movements. Such models can be used in the development of alpha for fast trading and in market making strategies.

Graphcore Research: Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

Graphcore Research: Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

Antoine Labatie, Dominic Masters, Zach Eaton-Rosen, Carlo Luschi

We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity.

To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.

Graphcore Research: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Graphcore Research: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Dominic Masters, Antoine Labatie, Zach Eaton-Rosen,

Graphcore Research examines three methods for optimising state-of-the-art computer vision model EfficientNet’s performance on Intelligence Processing Units (IPUs), in a new paper. These approaches are :(i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations
to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.

By combining all three techniques, IPUs delivered accelerations of up to 7x on training and more than 3.6x on inference.

University of Bristol: Using the Graphcore IPU for traditional HPC applications

University of Bristol: Using the Graphcore IPU for traditional HPC applications

Thorben Louw, Simon McIntosh-Smith

The increase in ML workloads means that AI accelerators are expected to become common in supercomputers, evoking considerable interest in the scientific HPC community about how these devices might also be exploited for traditional HPC workloads.

In this paper, we report our early results using Graphcore's IPU for stencil computations on structured grid problems, which are used for solvers for differential equations in domains such as computational fluid dynamics. We demonstrate that the IPU and its low-level programming framework, Poplar, expose sufficient programmability to express these HPC problems, and achieve performance comparable to that of modern GPUs.

Graphcore & UMass Amherst: Accelerating Simulation-based Inference with Emerging AI Hardware

Graphcore & UMass Amherst: Accelerating Simulation-based Inference with Emerging AI Hardware

Sourabh Kulkarni, Alexander Tsyplikhin, Mario Michael Krell, Csaba Andras Moritz

In this work, we explore hardware accelerated simulation-based inference over probabilistic models, by combining massively parallelized ABC inference algorithm with the cutting-edge AI chip solutions that are uniquely suited for this purpose. As a proof-of-concept, we demonstrate inference over a probabilistic epidemiology model used to predict the spread of COVID-19. Two hardware acceleration platforms are compared - the Tesla V100 GPU and the Graphcore Mk1 IPU. Our results show that while both of these platforms outperform multi-core CPUs, the Mk1 IPUs are 7.5x faster than the Tesla V100 GPUs for this workload.

Google Research, UC Berkeley & Graphcore Research: Parallel Training of Deep Networks with Local Updates

Google Research, UC Berkeley & Graphcore Research: Parallel Training of Deep Networks with Local Updates

Michael Laskin, Luke Metz, Seth Nabarro, Mark Saroufim, Badreddine Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel

In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.

Graphcore Research: Improving Neural Network Training in Low Dimensional Random Bases

Graphcore Research: Improving Neural Network Training in Low Dimensional Random Bases

Frithjof Gressmann, Zach Eaton-Rosen, Carlo Luschi

Graphcore Research is exploring novel ways to train neural networks that could allow us to scale to substantially larger models in future.

In this paper, we revisit a simple approach to reduce the effective network dimensionality using random projections. We leverage the hardware-accelerated random number generation of the IPU to train in randomly selected directions of the weight space. Applying smaller independent random projections to different parts of the network and re-drawing them at every step significantly improves the obtained accuracy.

Graphcore & Ford: A Follow-The-Leader Strategy using Hierarchical Deep Neural Networks with Grouped Convolutions

Graphcore & Ford: A Follow-The-Leader Strategy using Hierarchical Deep Neural Networks with Grouped Convolutions

José Solomon, François Charette

A follow-the-leader strategy can be implemented using a hierarchical Deep Neural Network (DNN) end-to-end driving model to match the direction and speed of a target pedestrian. Using a classifier DNN, pedestrian movements can be tracked to determine if the pedestrian is in the camera sensor’s field of view. The autonomous vehicle’s steering and throttle can then be adjusted by a regression DNN. These DNNs also incorporate grouped convolutions to boost model performance.

In this paper, Graphcore Research and Ford Motor Company leverage the fine-grain compute capabilities of the Graphcore IPU to minimise time-to-train for these Hierarchical Deep Neural Networks.

University of Bristol: Studying the potential of Graphcore IPUs for applications in Particle Physics

University of Bristol: Studying the potential of Graphcore IPUs for applications in Particle Physics

Lakshan Ram Madhan Mohan, Alexander Marshall, Samuel Maddrell-Mander, Daniel O'Hanlon, Konstantinos Petridis, Jonas Rademacker, Victoria Rege, Alexander Titterton

This paper presents the first study of Graphcore's Intelligence Processing Unit (IPU) in the context of particle physics applications. 

Comparisons are made for neural-network-based event simulation, multiple-scattering correction, and flavour tagging, implemented on IPUs, GPUs and CPUs, using a variety of neural network architectures and hyperparameters. Additionally, a Kálmán filter for track reconstruction is implemented with promising results.

Imperial College London: Bundle Adjustment on a Graph Processor

Imperial College London: Bundle Adjustment on a Graph Processor

Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison

This paper shows for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor such as Graphcore's Intelligence Processing Unit (IPU) using Gaussian Belief Propagation.

Gaussian Belief Propagation is an effective algorithmic framework for spatial AI problems where estimates are needed in real time with new measurements constantly being fed into the algorithm.

Qwant: Graphcore C2 Card performance for image-based deep learning application: A Report

Qwant: Graphcore C2 Card performance for image-based deep learning application: A Report

Ilyes Kacher, Maxime Portaz, Hicham Randrianarivo, Sylvain Peyronnet

Graphcore's architecture of the processor has been designed to achieve state of the art performance on current machine intelligence models for both training and inference.

In this paper, we report on a benchmark in which we have evaluated the performance of IPU processors on deep neural networks for inference. We focus on deep vision models such as ResNeXt. We report the observed latency, throughput and energy efficiency.

Citadel: Dissecting the Graphcore IPU Architecture via Microbenchmarking

Citadel: Dissecting the Graphcore IPU Architecture via Microbenchmarking

Zhe Jia, Blake Tillman, Marco Maggioni, Daniele Paolo Scarpazza

This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform introduced by Graphcore and aimed at Artificial Intelligence/Machine Learning (AI/ML) workloads.

The study dissects the IPU’s performance behavior using microbenchmarks that were crafted for the purpose.

Graphcore Research: Revisiting Small Batch Training for Deep Neural Networks

Graphcore Research: Revisiting Small Batch Training for Deep Neural Networks

Dominic Masters, Carlo Luschi

The team at Graphcore Research addresses mini-batch stochastic gradient optimization of modern deep network architectures.

In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. Our experiments show that small batch sizes produce the best results.

×