Written by Carlo Luschi
Posted Jan 20, 2020
Before we dive into what we’re working on this year, here’s a quick recap of some of our biggest achievements in 2019.
Graphcore Research in 2019
In November, we released new IPU benchmarks highlighting our state of the art performance on many of today’s most advanced models, including BERT and ResNext. We also made significant headway with our probabilistic modelling research, achieving outstanding results on IPU when combining MCMC and variational inference and 26x performance gains for our finance customers with MCMC modelling. With the launch of our IPU Preview on Microsoft Azure and IPU Server with Dell EMC last year, we’re now in the fortunate position of being able to use our learnings throughout 2020 to practically support AI innovators working with IPUs, which is very exciting.
Research Directions for 2020
As we presented at NeurIPS in December, during 2020 we’re looking at 6 key research areas:
- Arithmetic Efficiency
- Memory Efficient Training
- Distributed Learning in Large-Scale Machines
- Sparse Structures for Training of Overparametrized Models
- Neural Architecture Search and Evolutionary Computing
- Self-Supervised Learning and Probabilistic Models
The use of low-precision numerical formats is an effective means to improve computational efficiency and speed up machine intelligence computation. Over the last few years, lower precision floating point formats have become a valuable tool for reducing the memory requirements and speeding up the model training process. The state of the art until very recently has been mixed precision training with 16-bit floating points. At Graphcore Research, we have been working on mixed precision training with more aggressive numerical formats such as float-16 and float-8. We believe that progress in this field will be fundamental in terms of improving both learning and efficiency during training.
Memory Efficient Training
There are additional methods that make it possible to consume less memory and further speed up training. We have been working on activation recomputation to reduce memory requirements when training deep neural networks. This involves storing a subset of layer activations during the forward pass and then recomputing layer activations that have not been stored during backpropagation by using reversible blocks.
We are also working on fundamental optimization algorithms for stochastic learning, addressing a number of topics including normalization techniques as well as training stability and generalization for small batch training.
Distributed Learning in Large-Scale Machines
Graphcore Research is working on multiple strategies in this field, including model-/data-parallel distributed optimization. To accelerate training, processing is distributed over a certain number of parallel workers. Conventionally, this means using an approach based on model and data parallelism, employing a larger and larger number of parallel processors to reduce training time. The problem with this method is that increasing the number of workers requires the use of progressively larger batch sizes, reducing the efficiency of SGD and similar optimization algorithms.
Though model-parallel training extremely large models can be effective, eventually a point is reached at which increasing the number of parallel workers no longer corresponds to a speedup in training. One of the ways we are working to address this challenge is through multi-model training. Instead of training a large overparametrized model, we train multiple smaller models in parallel. Rather than exchanging weights, parameters or gradients, these models are trained virtually independently. This has been shown to provide a significant performance advantage.
Neural Architecture Search based on Evolutionary Computing
We are using evolutionary computing techniques to achieve distributed implementation both for black box optimization and neural architecture search. Evolutionary computing is an essential tool for black box optimization, allowing us to naturally extend and parallelize training over a massive number of workers with reduced communication. This enables efficient search for meta-learning over a high-dimensional space.
Sparse Structures for Training of Overparametrized Models
There is a recent and growing trend towards using larger and larger overparametrized networks that are easier to train and produce better results. The main issue with this is that the larger the model is, the longer it will take to train. One attractive solution is to train a sparse subnetwork of the large overparametrized model. We are developing this further by implementing a mechanism for the evolution of the sparse connectivity pattern during training.
Self-Supervised Learning and Probabilistic Models
One of the most innovative directions of our research is the use of self-supervised learning. We are employing unsupervised pre-training and, more specifically, self-supervised learning. Unsupervised training is enabling researchers to leverage the vast amount of unlabelled data which many organizations hold for a variety of applications. In the case of self-supervised learning, probabilistic models can be trained to understand the structure of the data, such as in model-based reinforcement learning, where this approach can be used to learn the structure and dynamics of an environment.
Within the area of probabilistic modelling, we are currently focusing on the use of energy-based models as very effective generative models that can perform implicit sampling. This provides new opportunities to use advanced MCMC models for model-based reinforcement learning. In particular, the use of energy-based models has been shown to allow maximum entropy planning, which enables sampling and learning of multi-modal distributions over trajectories.
As we look ahead to the next 12 months of developments in AI research, it is exciting to see the opportunities opening up for innovators across many industries all over the world. We will no doubt see further progress this year in fields that were prominent in 2019 such as Natural Language Processing, Image Recognition and Predictive Modelling.
Traditional processing approaches have long been too inefficient for the needs of today’s leading AI researchers, who are increasingly working with huge data batch sizes and deeply complex algorithms. Processors such as our IPU can facilitate these advanced model implementations by delivering arithmetic efficiency on small batch sizes and quickly processing the sparse data structures often required for faster training techniques.
We are expecting to see the pace of change pick up in 2020 as new hardware is adopted that can significantly accelerate new machine learning models and facilitate innovation on a broader scale.
Written by Carlo Luschi
Posted Jan 20, 2020