<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">

RESEARCH PAPERS

Generating QM1B with PySCFIPU

Generating QM1B with PySCFIPU

Alexander Mathiasen, Hatem Helal, Kerstin Klaser, Paul Balanca, Josef Dean, Carlo Luschi, Dominique Beaini, Andrew Fitzgibbon, Dominic Masters

The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks.
This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples.These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCFIPU using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate
that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases.
To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets.

Training and inference of large language models using 8-bit floating point

Training and inference of large language models using 8-bit floating point

Sergio P. Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, Andrew William Fitzgibbon

FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights,gradients and activations. We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.
Unit Scaling: Out-of-the-Box Low-Precision Training

Unit Scaling: Out-of-the-Box Low-Precision Training

Charlie Blake, Douglas Orr, Carlo Luschi

Unit scaling is a paradigm for designing deep learning models that simplifies the use of low-precision number formats. Training in FP16 or the recently proposed FP8 formats offers substantial efficiency gains, but can lack sufficient range for out-of-the-box training. Unit scaling addresses this by introducing a principled approach to model numerics: seeking unit variance of all weights, activations and gradients at initialisation. Unlike alternative methods, this approach neither requires multiple training runs to find a suitable scale nor has significant computational overhead. We demonstrate the efficacy of unit scaling across a range of models and optimisers. We further show that existing models can be adapted to be unit-scaled, training BERT-Large in FP16 and then FP8 with no degradation in accuracy.
GPS++: Reviving the Art of Message Passing for Molecular Property Prediction

GPS++: Reviving the Art of Message Passing for Molecular Property Prediction

Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Andrew Fitzgibbon, Shenyang Huang, Ladislav Rampášek, Dominique Beaini

GPS++ is a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2. Through a thorough ablation study we highlight the impact of individual components and find that nearly all of the model's performance can be maintained without any use of global self-attention, showing that message passing is still a competitive approach for 3D molecular property prediction despite the recent dominance of graph transformers. We also find that our approach is significantly more accurate than prior art when 3D positional information is not available.
PopSparse: Accelerated block sparse matrix multiplication on IPU

PopSparse: Accelerated block sparse matrix multiplication on IPU

Zhiyi Li, Douglas Orr, Valeriu Ohan, Godfrey Da costa, Tom Murray, Adam Sanders, Deniz Beker, Dominic Masters

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such as NVIDIA GPUs using low precision number formats. In this work we introduce PopSparse, a library that enables fast sparse operations on Graphcore IPUs by leveraging both the unique hardware characteristics of IPUs as well as any block structure defined in the data. We target two different types of sparsity: static, where the sparsity pattern is fixed at compile-time; and dynamic, where it can change each time the model is run. We present benchmark results for matrix multiplication for both of these modes on IPU with a range of block sizes, matrix sizes and densities. Results indicate that the PopSparse implementations are faster than dense matrix multiplications on IPU at a range of sparsity levels with large matrix size and block size. Furthermore, static sparsity in general outperforms dynamic sparsity. While previous work on GPAs has shown speedups only for very high sparsity (typically 99\% and above), the present work demonstrates that our static sparse implementation outperforms equivalent dense calculations in FP16 at lower sparsity (around 90%).
SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE

SPARQ ATTENTION: BANDWIDTH-EFFICIENT LLM INFERENCE

Luka Ribar Ivan Chelombiev Luke Hudlass-Galley Charlie Blake Carlo Luschi Douglas Orr

Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

Characterizing the Performance of Triangle Counting on Graphcore's IPU Architecture

Characterizing the Performance of Triangle Counting on Graphcore's IPU Architecture

Reet Barik, Siddhisanket Raskar, Murali Emani, Venkatram Vishwanath, Authors Info & Claims

In recent years, we have seen an emergence of novel spatial architectures to accelerate domain-specific workloads like Machine Learning. There is a need to investigate their performance characteristics for traditional HPC workloads for their tighter integration with current and future heterogeneous compute resources. In this work, we implement, optimize and evaluate a parallel algorithm for Triangle Counting for graphs in Bulk Synchronous Parallel (BSP) model for Graphcore’s IPU architecture as well as discuss lessons learned. This study demonstrates IPU’s competency in handling such irregular workloads by providing an average speedup of up to 5.3x over NVIDIA A100 GPU on real-world datasets.
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Jan Finkbeiner, Thomas Gmeinder, Mark Pupilli, Alexander Titterton, Emre Neftci

Current AI training is dominated by GPUs which excel at parallel workloads but struggle with sparse, recurrent models. This limits progress in more efficient AI. MIMD architectures like IPUs better support sparse models. We implement sparse, recurrent Spiking Neural Networks (SNNs) on IPUs. SNNs have binary, sparse activations. For training, we use backpropagation through time. We achieve 5-10x speedups over GPUs, up to 38x for high sparsity, with no loss in accuracy. Scales well to larger models. This demonstrates IPUs' advantage for sparse models, enabling more efficient AI beyond standard GPU-based training.
Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Hongwu Peng, Caiwen Ding, Tong Geng, Sutanay Choudhury, Kevin Barker, Ang Li

This research evaluates and compares emerging AI/ML hardware accelerators like the IPU, RDU, and enhanced GPUs. These accelerators use innovative dataflow architectures to deliver superior performance on AI workloads beyond traditional processors. Through benchmarking common DNN operators and other AI tasks, we analyze the strengths and tradeoffs of each platform. The findings provide insights into current accelerator capabilities and guide future accelerator development and design for advancing AI/ML applications. Our analysis aims to further overall understanding of specialized hardware needed to meet the growing demands of AI/ML.
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis Müller, Jama Hussein Mohamud, Ali Parviz, Michael Craig, Michał Koziarski, Jiarui Lu, Zhaocheng Zhu, Cristian Gabellini, Kerstin Klaser, Josef Dean, Cas Wognum, Maciej Sypetkowski, Guillaume Rabusseau, Reihaneh Rabbany, Jian Tang, Christopher Morris, Ioannis Koutis, Mirco Ravanelli, Guy Wolf, Prudencio Tossou, Hadrien Mary, Therence Bois, Andrew Fitzgibbon, Błażej Banaszewski, Chad Martin, Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. 

Charité Universitätsmedizin, Cornell University, Lawrence Berkeley National Laboratory, Simula Research: Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU

Charité Universitätsmedizin, Cornell University, Lawrence Berkeley National Laboratory, Simula Research: Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU

Luk Burchard, Max Xiaohang Zhao, Johannes Langguth, Aydın Buluç & Giulia Guidi

The sequence alignment problem is fundamental in bioinformatics; we have implemented the 𝑋 -Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore IPU. Our implementation achieves 10× speedup over a state-of-the-art GPU implementation and up to 4.65× compared to CPU.
Graphcore & PNNL: Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry

Graphcore & PNNL: Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry

Hatem Helal, Jesun Firoz, Jenna Bilbrey, Mario Michael Krell, Tom Murray, Ang Li, Sotiris Xantheas, Sutanay Choudhury

This paper demonstrates a novel hardware-software co-design approach to scale up the training of graph neural networks for molecular property prediction.

We introduce an algorithm that can reduce the training time of such molecular property prediction models from days to less than two hours, opening new possibilities for AI-driven scientific discovery.

Graphcore, Valence, MILA: GPS++: An Optimised Hybrid GNN/Transformer for Molecular Property Prediction

Graphcore, Valence, MILA: GPS++: An Optimised Hybrid GNN/Transformer for Molecular Property Prediction

Dominic Masters, Josef Dean, Kerstin Klaser, Zhiyi Li, Sam Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Ladislav Rampášek, Dominique Beaini

We present GPS++, a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2. Through a thorough ablation study we highlight the impact of individual components and, contrary to expectations set by recent trends, find that nearly all of the model's performance can be maintained without any use of global self-attention. We also show that our approach is significantly more accurate than prior art when 3D positional information is not available.

Graphcore: BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion

Graphcore: BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion

Alberto Cattaneo, Daniel Justus, Harry Mellor, Douglas Orr, Jerome Maloberti, Zhenying Liu, Thorin Farnsworth, Andrew Fitzgibbon, Blazej Banaszewski, Carlo Luschi

We present the award-winning submission of the WikiKG90Mv2 track of OGB-LSC@NeurIPS 2022. The task is link-prediction on the large-scale knowledge graph WikiKG90Mv2, consisting of 90M+ nodes and 600M+ edges. Our solution uses a diverse ensemble of 85 Knowledge Graph Embedding models combining five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx) and two different loss functions (log-sigmoid, sampled softmax cross-entropy).

Our final model achieved 1st place with a validation MRR of 0.2922 and a test-challenge MRR of 0.2562.

Graphcore, PNNL, IBM Research, University of Washington: Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators

Graphcore, PNNL, IBM Research, University of Washington: Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators

Jenna A. Bilbrey, Kristina M. Herman, Henry Sprueill, Sotiris S. Xantheas, Payel Das, Manuel Lopez Roldan, Mike Kraus, Hatem Helal, Sutanay Choudhury

We demonstrate finetuning for downstream tasks on a graph neural network (GNN) trained over a molecular database containing 2.7 million water clusters.

The use of Graphcore IPUs for training molecular GNNs reduces training time from a reported 2.7 days on 0.5M clusters to 1.2 hours on 2.7M clusters. Finetuning the pretrained model for downstream tasks of molecular dynamics and transfer to a different potential energy surface took only 8.3 hours and 28 minutes, respectively, on a single GPU.

Texas A&M University & Graphcore: Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads

Texas A&M University & Graphcore: Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence / Machine Learning Workloads

Abhinand S. Nasari, Tim Cockerill, Hieu T. Le, Richard Lawrence, Zhenhua He, Xin Yang, Mario M. Krell, Alex Tsyplikhin, Mahidhar Tatineni, Lisa M. Perez, Dhruva K. Chakravorty, Honggao Liu

This papers compares the performance of two different architectures: the commonly used GPU and the new generation of Intelligence Processing Units (IPUs), by running training benchmarks on national cyberinfrastructure resources of common AI/ML models. 
Graphcore Research: 8-bit Numerical Formats for Deep Neural Networks

Graphcore Research: 8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point representation, and present an in-depth study on the use of 8-bit floating-point number formats for activations, weights, and gradients for both training and inference. We explore the effect of different bit-widths for exponents and significands and different exponent biases. 
Microsoft Research & Graphcore: Confidential Machine Learning within Graphcore IPUs

Microsoft Research & Graphcore: Confidential Machine Learning within Graphcore IPUs

Kapil Vaswani, Stavros Volos, Cédric Fournet, Antonio Nino Diaz, Ken Gordon, Balaji Vembu, Sam Webster, David Chisnall, Saurabh Kulkarni, Graham Cunningham, Richard Osbourne, Dan Wilkinson

This paper presents IPU Trusted Extensions (ITX), a set of experimental hardware extensions that enable trusted execution environments in Graphcore's IPUs.

Its evaluation on a development board using standard DNN training workloads suggests that ITX adds less than 5% performance overhead, and delivers up to 17x better performance compared to CPU-based confidential computing systems relying on AMD SEV-SNP.

Imperial College London: Incremental Abstraction in Distributed Probabilistic SLAM Graphs

Imperial College London: Incremental Abstraction in Distributed Probabilistic SLAM Graphs

Joseph Ortiz, Talfan Evans, Edgar Sucar, Andrew Davison

The Robot Vision Laboratory at Imperial College London propose a method for efficient incremental construction of probabilistic scene graphs from monocular input based on two novel components. Firstly, an incremental scene abstraction framework combing amortized inference with probabilistic inference and secondly, a routing procedure that enables inference on dynamic graphs with GBP leveraging the parallelism of the Graphcore IPU.

This paper demonstrates the advantage of GBP over direct methods for complex factor graphs due to the structure-agnostic time per iteration. 

Imperial College London - Dyson Robotics Laboratory: From Scene Flow to Visual Odometry through Local and Global Regularisation in Markov Random Fields

Imperial College London - Dyson Robotics Laboratory: From Scene Flow to Visual Odometry through Local and Global Regularisation in Markov Random Fields

Raluca Scona, Hidenobu Matsuki, Andrew Davison

This paper revisits pairwise Markov Random Field (MRF) formulations for RGB-D scene flow and leverage novel advances in processor design for real-time implementations.

Dyson Robotics Lab show that visual odometry and non-rigid scene flow can be unified into a single joint factor graph, and optimised highly efficiently with Gaussian Belief Propagation on the Graphcore IPU by leveraging the processor's distributed per-tile memory and ultrafast all-to-all communication fabric.

Graphcore: A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128

Graphcore: A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128

James Hanlon, Stephen Felix

The IPU contains an original pseudorandom number generator (PRNG) called xoroshiro128aox, based on the F2-linear generator xoroshiro128. It is designed for cheap hardware implementation and high-quality statistical randomness.

We assess the generator's quality using standard statistical test suites and compare results against PRNGs xoroshiro128+, pcg64 and philox4x32-10. We show that xoroshiro128aox mitigates a known weakness in xoroshiro128+ with a new 'AOX' output function by passing the BigCrush and PractRand suites.

We conclude that the non-uniformities and inherited linear artefacts are hard to detect, and so xoroshiro128aox provides a good trade-off between statistical quality and hardware implementation cost.

Stanford University & Graphcore: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Stanford University & Graphcore: NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Edward H. Lee, Mario Michael Krell, Alexander Tsyplikhin, Victoria Rege, Errol Colak, Kristen W. Yeom

Differentially private SGD (DPSGD) has recently shown promise in deep learning. However, compared to non-private SGD, the DPSGD algorithm places computational overheads that can undo the benefit of batching in GPUs.

In our work, we argue that low batch sizes using group normalization on ResNet-50 can yield high accuracy and privacy on Graphcore IPUs. This enables DPSGD training of ResNet-50 on ImageNet in just 6 hours (100 epochs) on an IPU-POD16 system.

Université de Paris: Comparison of Graphcore IPUs and NVIDIA GPUs for cosmology applications

Université de Paris: Comparison of Graphcore IPUs and NVIDIA GPUs for cosmology applications

Bastien Arcelin

This paper represents the first investigation of the suitability and performance of Graphcore Intelligence Processing Units (IPUs) for deep learning applications in cosmology. It presents the benchmark between a Nvidia V100 GPU and a Graphcore Mk1 (GC2) IPU on three cosmological use cases: a classical deep neural network and a Bayesian neural network (BNN) for galaxy shape estimation, and a generative network for galaxy images production.

The results show that IPUs can accelerate various cosmology applications, outperforming GPUs in some cases by as much as 4x faster time to train.

Graphcore Research: Dynamic Sparse Pre-Training of BERT

Graphcore Research: Dynamic Sparse Pre-Training of BERT

Anastasia S. D. Dietrich, Frithjof Gressmann, Douglas Orr, Ivan Chelombiev, Daniel Justus, Carlo Luschi

In this work, we develop and study a simple, dynamic always-sparse pre-training approach for BERT language models, which leverages periodic compression steps based on magnitude pruning followed by random parameter re-allocation.

As a result, we achieve Pareto improvements in terms of number of FLOPs over both static and dense baselines across model sizes. Furthermore, we demonstrate that training remains FLOP-efficient when using coarse-grained block sparsity, making it particularly promising for efficient execution on modern hardware accelerators.

Graphcore: Packing: Towards 2x NLP BERT Acceleration

Graphcore: Packing: Towards 2x NLP BERT Acceleration

Matej Kosec, Sheng Fu, Mario Michael Krell

By using a new packing algorithm, Graphcore engineers have sped up Natural Language Processing by more than 2 times while training BERT-Large. Our new packing technique removes padding, enabling significantly more efficient computation. We suspect this could also be applied to genomics and protein folding models and other models with skewed length distributions to make a much broader impact in different industries and applications.

We introduce Graphcore's highly efficient Non-Negative Least Squares Histogram-Packing algorithm (or NNLSHP) as well as our BERT algorithm applied to packed sequences in a new paper.

Simula: iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs

Simula: iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs

Luk Burchard, Johannes Moe, Daniel Thilo Schroeder, Konstantin Pogorelov, Johannes Langguth

This paper aims to test the IPU’s suitability for algorithms with hard-to-predict memory accesses by implementing a breadth-first search (BFS) that complies with the Graph500 specifications. Precisely because of its apparent simplicity, BFS is an established benchmark that is not only subroutine for a variety of more complex graph algorithms, but also allows comparability across a wide range of architectures.

The results indicate that the IPU delivers speedups of up to 4× over the fastest competing result on an NVIDIA V100 GPU, with typical speedups of about 1.5× on most test instances.

Graphcore Research: GroupBERT - Enhanced Transformer Architecture with Efficient Grouped Structures

Graphcore Research: GroupBERT - Enhanced Transformer Architecture with Efficient Grouped Structures

Ivan Chelombiev, Daniel Justus, Douglas Orr, Anastasia Dietrich, Frithjof Gressmann, Alexandros Koliousis, Carlo Luschi

Attention based language models have become a critical component in state-of-the-art NLP systems. However, these models have significant computational requirements, due to long training times, dense operations and large parameter count.

In this paper, Graphcore Research demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. This architecture is applied to language representation learning and demonstrates a superior performance compared to BERT models of different scales. This results in improved efficiency, both in terms of floating-point operations (FLOPs) and time-to-train.

Oxford-Man Institute & University of Oxford: Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units

Oxford-Man Institute & University of Oxford: Multi-Horizon Forecasting for Limit Order Books: Novel Deep Learning Approaches and Hardware Acceleration using Intelligent Processing Units

Zihao Zhang, Stefan Zohren

Researchers at the Oxford-Man Institute of Quantitative Finance have used Graphcore’s Intelligence Processing Unit (IPU) to dramatically accelerate the training of advanced price prediction models, using techniques which are typically plagued by computational bottlenecks when run on other types of processor.

The IPU’s designed-for-AI architecture allowed the OMI team to reduce the training times for their multi-horizon forecasting models to the point where they could deliver significant commercial advantage by more accurately estimating market price movements. Such models can be used in the development of alpha for fast trading and in market making strategies.

Graphcore Research: Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

Graphcore Research: Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

Antoine Labatie, Dominic Masters, Zach Eaton-Rosen, Carlo Luschi

We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity.

To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.

Graphcore Research: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Graphcore Research: Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training

Dominic Masters, Antoine Labatie, Zach Eaton-Rosen,

Graphcore Research examines three methods for optimising state-of-the-art computer vision model EfficientNet’s performance on Intelligence Processing Units (IPUs), in a new paper. These approaches are :(i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations
to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.

By combining all three techniques, IPUs delivered accelerations of up to 7x on training and more than 3.6x on inference.

University of Bristol: Using the Graphcore IPU for traditional HPC applications

University of Bristol: Using the Graphcore IPU for traditional HPC applications

Thorben Louw, Simon McIntosh-Smith

The increase in ML workloads means that AI accelerators are expected to become common in supercomputers, evoking considerable interest in the scientific HPC community about how these devices might also be exploited for traditional HPC workloads.

In this paper, we report our early results using Graphcore's IPU for stencil computations on structured grid problems, which are used for solvers for differential equations in domains such as computational fluid dynamics. We demonstrate that the IPU and its low-level programming framework, Poplar, expose sufficient programmability to express these HPC problems, and achieve performance comparable to that of modern GPUs.

Graphcore & UMass Amherst: Accelerating Simulation-based Inference with Emerging AI Hardware

Graphcore & UMass Amherst: Accelerating Simulation-based Inference with Emerging AI Hardware

Sourabh Kulkarni, Alexander Tsyplikhin, Mario Michael Krell, Csaba Andras Moritz

In this work, we explore hardware accelerated simulation-based inference over probabilistic models, by combining massively parallelized ABC inference algorithm with the cutting-edge AI chip solutions that are uniquely suited for this purpose. As a proof-of-concept, we demonstrate inference over a probabilistic epidemiology model used to predict the spread of COVID-19. Two hardware acceleration platforms are compared - the Tesla V100 GPU and the Graphcore Mk1 IPU. Our results show that while both of these platforms outperform multi-core CPUs, the Mk1 IPUs are 7.5x faster than the Tesla V100 GPUs for this workload.

Google Research, UC Berkeley & Graphcore Research: Parallel Training of Deep Networks with Local Updates

Google Research, UC Berkeley & Graphcore Research: Parallel Training of Deep Networks with Local Updates

Michael Laskin, Luke Metz, Seth Nabarro, Mark Saroufim, Badreddine Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel

In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.

Graphcore Research: Improving Neural Network Training in Low Dimensional Random Bases

Graphcore Research: Improving Neural Network Training in Low Dimensional Random Bases

Frithjof Gressmann, Zach Eaton-Rosen, Carlo Luschi

Graphcore Research is exploring novel ways to train neural networks that could allow us to scale to substantially larger models in future.

In this paper, we revisit a simple approach to reduce the effective network dimensionality using random projections. We leverage the hardware-accelerated random number generation of the IPU to train in randomly selected directions of the weight space. Applying smaller independent random projections to different parts of the network and re-drawing them at every step significantly improves the obtained accuracy.

Graphcore & Ford: A Follow-The-Leader Strategy using Hierarchical Deep Neural Networks with Grouped Convolutions

Graphcore & Ford: A Follow-The-Leader Strategy using Hierarchical Deep Neural Networks with Grouped Convolutions

José Solomon, François Charette

A follow-the-leader strategy can be implemented using a hierarchical Deep Neural Network (DNN) end-to-end driving model to match the direction and speed of a target pedestrian. Using a classifier DNN, pedestrian movements can be tracked to determine if the pedestrian is in the camera sensor’s field of view. The autonomous vehicle’s steering and throttle can then be adjusted by a regression DNN. These DNNs also incorporate grouped convolutions to boost model performance.

In this paper, Graphcore Research and Ford Motor Company leverage the fine-grain compute capabilities of the Graphcore IPU to minimise time-to-train for these Hierarchical Deep Neural Networks.

University of Bristol: Studying the potential of Graphcore IPUs for applications in Particle Physics

University of Bristol: Studying the potential of Graphcore IPUs for applications in Particle Physics

Lakshan Ram Madhan Mohan, Alexander Marshall, Samuel Maddrell-Mander, Daniel O'Hanlon, Konstantinos Petridis, Jonas Rademacker, Victoria Rege, Alexander Titterton

This paper presents the first study of Graphcore's Intelligence Processing Unit (IPU) in the context of particle physics applications. 

Comparisons are made for neural-network-based event simulation, multiple-scattering correction, and flavour tagging, implemented on IPUs, GPUs and CPUs, using a variety of neural network architectures and hyperparameters. Additionally, a Kálmán filter for track reconstruction is implemented with promising results.

Imperial College London: Bundle Adjustment on a Graph Processor

Imperial College London: Bundle Adjustment on a Graph Processor

Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison

This paper shows for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor such as Graphcore's Intelligence Processing Unit (IPU) using Gaussian Belief Propagation.

Gaussian Belief Propagation is an effective algorithmic framework for spatial AI problems where estimates are needed in real time with new measurements constantly being fed into the algorithm.

Qwant: Graphcore C2 Card performance for image-based deep learning application: A Report

Qwant: Graphcore C2 Card performance for image-based deep learning application: A Report

Ilyes Kacher, Maxime Portaz, Hicham Randrianarivo, Sylvain Peyronnet

Graphcore's architecture of the processor has been designed to achieve state of the art performance on current machine intelligence models for both training and inference.

In this paper, we report on a benchmark in which we have evaluated the performance of IPU processors on deep neural networks for inference. We focus on deep vision models such as ResNeXt. We report the observed latency, throughput and energy efficiency.

Citadel: Dissecting the Graphcore IPU Architecture via Microbenchmarking

Citadel: Dissecting the Graphcore IPU Architecture via Microbenchmarking

Zhe Jia, Blake Tillman, Marco Maggioni, Daniele Paolo Scarpazza

This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform introduced by Graphcore and aimed at Artificial Intelligence/Machine Learning (AI/ML) workloads.

The study dissects the IPU’s performance behavior using microbenchmarks that were crafted for the purpose.

Graphcore Research: Revisiting Small Batch Training for Deep Neural Networks

Graphcore Research: Revisiting Small Batch Training for Deep Neural Networks

Dominic Masters, Carlo Luschi

The team at Graphcore Research addresses mini-batch stochastic gradient optimization of modern deep network architectures.

In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. Our experiments show that small batch sizes produce the best results.