Feb 12, 2020 \ Machine Learning, Machine Intelligence, AI, Memory, Small Batch Training, Research
The Best Machine Intelligence Papers from NeurIPS 2019
Feb 12, 2020 \ Machine Learning, Machine Intelligence, AI, Memory, Small Batch Training, Research
With so many new and exciting areas of research discussed at NeurIPS in December, we’ve been thinking about how AI innovation will evolve in the coming year. We already revealed which directions of research Graphcore will be focusing on in 2020 in a separate post in January. Here, we’ll be concentrating more on the wider trends we are seeing in the world of AI research – and what this will mean for how we process models for machine intelligence in the future.
Recent experimental evidence favours the use of progressively larger, overparametrized neural network architectures. Not only have these been shown to be easier to train, but they have produced improved generalization performance in several applications. At the same time, it has also been observed that only a small percentage of parameters of a large model are significantly different from zero at each point during training. Pruning has therefore been considered as a natural approach to reduce the model size to speed up inference and training.
Promising approaches to exploit sparsity had been based on pruning the trained overparametrized model, and then suitably re-initializing the resulting subset of parameters of the large trained model (H. Zhou et al., 2019; J. Frankle et al., 2019). However, recently it has been shown that an improved solution can be found without having to train the original overparametrized model first. Instead, the aim is to directly train a sparse subnetwork and evolve the sparse connectivity pattern during a single training run. This can only be achieved by periodically pruning and regrowing the current subnetwork so that the available degrees of freedom of the original overparametrized model can be effectively explored (for more information, see U. Evci et al. (2019) and further references cited in this paper).
In addition, as Yann LeCun recently pointed out at NeurIPS 2019, for sparse computations it would be especially efficient to resort to structured sparsity by evolving a structured topology of the sparse subnetwork.
Meta-Objectives for Representation Learning
In machine intelligence, learning relies on building an effective representation of the input data distribution so that predictions can be made in a low-dimensional abstract space instead of the high-dimensional sensory space (e.g. pixel space for images). In this scenario, the computational efficiency of learning and tracking changes of the non-stationary distribution of the world crucially depends on realizing a representation that corresponds to isolated changes and fast adaptation, as a consequence of a change of the input.
When building a compressed latent space representation, a crucial approach for generalizing to combinations of features that have zero probability over the training distribution is to rely on compositionality, which allows the accumulated knowledge to be factorized into a small number of exchangeable variables.
As suggested by Yoshua Bengio in his talk at NeurIPS 2019, the association between the variables of this high-level latent representation can then be modelled by a sparse factor graph, where each factor in the graph quantifies the direct dependence between variables. The graph joint distribution is assumed to be sparse, in the sense that each node in the graph has very few edges. This may be intuitively associated with the abstraction of natural language, where each sentence makes the association between only a small number of words.
Changes in the representation from new observations will also need to be very localized. As discussed in Y. Bengio et al. (2019), an effective way to learn the right disentangled representation would be to meta-learn the causal structure and factorization that takes less time (i.e. less computation) to recover from a change in the input distribution.
Unsupervised / Self-Supervised Learning
Recent progress in different applications based on training larger and larger overparametrized models has also driven a renewed effort in exploiting the significant amount of unlabelled data that currently exists. The use of unsupervised learning also furthers advancement in representation learning by extracting features that are not specialized to solve a single supervised task.
This approach has already achieved significant results for natural language understanding with Google’s Bidirectional Encoder Representations from Transformers (BERT) model and OpenAI’s Generative Pre-Training-2 (GPT-2) model which rely on unsupervised pretraining of latent representations based on masked language models that randomly mask input tokens.
A recent model proposed by DeepMind, Contrastive Predictive Coding (CPC) (A. van den Oord et al., 2019) uses a probabilistic contrastive loss to obtain a useful latent space representation from high-dimensional unlabelled data by predicting future input samples. While the use of predictive coding is not new, its application to representation learning together with a contrastive loss facilitates the automatic learning of context and high-level shared information between input features. This method has been successfully applied to different domains, including speech, text, images, and reinforcement learning in 3D environments – for images, CPC predictions are made on neighbouring patches with partial overlap.
Self-supervised representation learning based on the CPC approach has been recently used by S. Löwe, P. O’Connor and B. Veeling (S. Löwe et al., 2019) to replace the conventional end-to-end training of deep neural networks based on backpropagation with local training of individual groups of neurons, or modules. Instead of relying on supervised training based on backpropagation of the gradients of a global loss function, with the proposed Greedy InfoMax approach different modules are trained only locally, using a separate greedy self-supervised loss. For each module, the local greedy CPC self-supervised loss aims at maximally preserving the information at its input, building useful representation from sequential inputs by maximizing (a lower bound on) the mutual information between representations – e.g. representations of temporally consecutive utterances of speech, or representations of neighbouring patches of an image. This enables the stacked modules to create a compact representation that can be then used for specific downstream tasks.
While it has long been observed that biological neurons appear to learn without backpropagating a global error signal, techniques that had been previously proposed for training artificial neural networks using only local information had failed to achieve performance comparable with that of conventional supervised end-to-end training based on backpropagation (but see also Akrout et al., 2019). Instead, the Greedy InfoMax approach has been shown to achieve strong performance on audio and image classification (S. Löwe et al., 2019).
Self-supervised training has the first advantage of only relying on unlabelled high-dimensional data, while supervised training requires labelled input data that is often expensive to obtain. But even more importantly, the Greedy InfoMax training of individual modules based on the local CPC loss breaks the backpropagation forward-backward lock, eliminating the need to store the activations of all layers during the forward pass before starting the backward pass, while significantly reducing the corresponding memory overhead (S. Löwe et al., 2019). In addition, as this approach enables different parts of the model to undergo decoupled, asynchronous and isolated training, this in turn makes the training of large deep neural networks naturally parallelizable over a large number of processors with very low communication overhead, as each module only needs to exchange the input and output signals.
Model-based Reinforcement Learning (RL) has been successfully applied to improve the sample efficiency of reinforcement learning algorithms. Model-based RL is starting to produce remarkable results for visual predictive control by building a compact latent state space representation of the world from high-dimensional sensory inputs. This representation is then used for efficient sampling in the latent space to perform online planning or to learn a policy (D. Hafner et al. 2019). This approach has been applied to solve tasks of increased complexity – the method recently proposed by D. Hafner et al. (2019), based on an agent that learns by latent space imagination from a world model (action and value models), has been shown to provide a significant performance improvement for visual control tasks, in terms of both data efficiency and performance.
Model-based approaches have also been recently applied to object-centric prediction and planning. R. Veerapaneni et al. (2019) recently studied an effective model-based RL method where objects and their local interactions with the environment are modelled from raw visual observations, with object entities represented as latent space state variables. The approach based on the unsupervised learning of factorized latent representations of local object models, rather than of a global scene model, has resulted in remarkable performance compared to state-of-the-art video prediction and planning baselines.
The Future of AI Acceleration
Overall, from the developments we are seeing in the AI research community, it is becoming increasingly clear that legacy approaches to processing architectures are not going to be sufficient for the changing needs of modern AI.
To exploit larger model sizes, more processing power is needed. This will only be possible with new architecture types which were built specifically for machine intelligence workloads.
Traditional processors do not deal with sparsity efficiently as they are not designed for parallel processing. As a massively parallel processor, the IPU is optimized to efficiently train the sparse, low-dimensional data structures that will become increasingly prevalent.
And finally, the world is complicated by nature and it is not always possible to find large-scale, neatly structured and labelled datasets for training. This should not hold back innovators from leveraging machine intelligence. Unsupervised and self-supervised learning will be fundamental in enabling researchers to make the best possible use of their organizations’ data. Processors that can facilitate the acceleration of advanced models for self-supervised learning, such as probabilistic models, will provide a future-proof solution.
Sign up for Graphcore updates:
Sign up below to get the latest news and updates: