How to build a processor for machine intelligence (part 2)

Last week, our co-founder and Chief Technology Officer, Simon Knowles, was invited to give a talk at the 3rd Research and Applied AI Summit (RAAIS) in London. RAAIS is a community for entrepreneurs and researchers who accelerate the science and applications of AI technology. The event is organized annually by Nathan Benaich, a London-based investor, technologist and scientist focused on companies creating transformational solutions for the world’s most challenging problems.

Let me start off having you remind us how you define an Intelligent Machine?

Let’s imagine you built an autonomous car which drove more safely than a human, but nevertheless one day it crashed. This event does not imply a lack of intelligence. After all, humans crash. The machine may have made the best driving decisions possible, given the information it had. But if you try it again with exactly the same test scenario and the machine crashes in exactly the same way, then you probably wouldn’t regard it as intelligent, because it didn’t learn from the first experience. So here’s my attempt to define intelligence in one sentence: the capacity for rational judgment based on imperfect knowledge, adapted with experience. Intelligence is required where perfect results aren’t computable, usually because there’s not enough data available. And there has to be some learning component for a machine to be considered intelligent.

We hear a lot about chips designed for training and chips designed for inference. In this context, what is the Graphcore IPU designed for?

I don’t think it is helpful to think of there being a hard distinction between training and inference, from a computational viewpoint. Any intelligence computer requires just two components: an inference engine and a knowledge model. Learning is a form of inference, in which we are trying to condense experience (data) into the knowledge model. We do this by inferring the parameters and perhaps the structure of the model from the training data. We can then use the learned model to infer some unknowns from some new input data, such as what motor actions to take in response to incoming video frames.

So training is inference which adapts the model, and there is also inference which does not adapt the model, but computationally they are very similar. We should not need two types of machine architecture. There might be large machines and small machines, but they can have the same architecture and run the same software efficiently.

As I said earlier, there is no intelligence without learning. But it’s possible that some “edge” machines could delegate their learning function to a computing “core”. For example, an autonomous car might send condensed sensor data back to the cloud and receive periodic model updates. Personally, I think this is a mode which may justify machines of different capacity, but not machines of different architecture. The great beauty of having a common architecture is that all the software can be common.

You mentioned in your talk that you shouldn’t design a processor today just for neural networks. Instead, you claim that you need to have a 20 year time horizon. How do you do that?

Inventing a new processor architecture and rendering it as a leading-edge chip is difficult and expensive. Building all the software on top to make it useful and easy to use is similarly difficult and even more expensive, so unless it’s going to be useful for 20 years or more, don’t bother.

If you had set out to build an intelligence processor in 2005 you might have focused on the computational needs of support vector machines. By 2010, you would probably have been more concerned with random forests. Today, the high fashion is deep neural networks. What will be the leading paradigm in 5 years time?

When we started thinking about building a machine to accelerate intelligence processing at Graphcore, we knew that we had to look beyond today’s deep neural networks. Neural networks are clearly important, as are many of the earlier paradigms. Our IPU has to out-perform GPUs and CPUs at all these tasks. But perhaps more importantly, it has to provide a flexible platform for the discoveries yet to come.

So we have tried to stand back from specific algorithms, and to consider what is fundamental about intelligence as a computational workload. We knew that if we successfully captured the fundamentals in our design, then our IPU would automatically perform well on neural networks et al.

So, what is the fundamental workload for machine intelligence compute?

In all forms of machine intelligence computation we are trying to infer knowledge models from data and then to use these learned models to infer new results from further data. So to a large extent the workload is defined by the nature of the knowledge models.

Knowledge models and the algorithms required to perform inference on them are most naturally represented as graphs. A graph comprises vertices connected by edges. The vertices contain the learnable parameters of the model, and also some code which defines how to update those parameters in response to new inputs, and what derived information to provide to other vertices so that they may update in turn. The edges define causal pathways of information flowing through the model. In a deep neural network, the edges often pass tensors which are stacks of feature maps. In a Bayesian net or a Markov random field they pass probability distributions.

The first relevant characteristic of these graphs is that they are large, often thousands or millions of vertices, so there is scope for a parallel processor to achieve high performance by computing on many independent vertices at once. This is serendipitous, because to build higher performance processors in this age of power-limited silicon we must use parallelism not speed.

The second characteristic of the models is that they are sparse – most vertices are connected to only a small subset of the other vertices. A machine designed for intelligence processing must recognise this sparsity architecturally, because it has a major impact on memory access patterns and communication patterns between processors.

In fact the effective sparsity of memory access increases when we embed graph models in physically realizable memories. This is because the graphs usually have high “fractal dimensionality”, and the reduction in dimensionality when embedded in a linear address space induces sparsity. For example, a vertex in the graph might be connected to 100 neighbours which are equally close. But if I store the state of that vertex in a memory with one address, it can have only two immediate neighbours. The effect of embedding a high-dimensional graph in a low-dimensional memory is to cause neighbours to be scattered across the memory. This is a latent problem for very wide vector machines such as GPUs. Processing on graphs favours much finer-grain machines which can scatter and gather data efficiently. This is why IPU has many more processors than a GPU, each designed to process narrower vectors.

The third fundamental characteristic of knowledge models is that they capture statistical approximations of their data experience. They are usually stochastically learned from noisy and incomplete data, and when we use them it is to obtain approximate inferences. A consequence is that individual model parameters and activation states are never very accurately resolved. A large ensemble of them in a model can work together to deliver a highly resolved output, but no single internal component is precise. From a machine design viewpoint this favours low-precision floating-point arithmetic. In fact, this is probably the first major application for very high performance computing on very low precision data. It’s the polar opposite of traditional High Performance Computing (HPC).

Another fundamental is the re-use of model parameters. Convolution is all about spatial re-use and recurrence is all about re-use over time. Such re-use captures spatial or temporal invariances in the data, but equally important is that if you re-use parameters, you expose them to more data so they learn faster.

The last fundamental characteristic I’ll mention is that the graph structures can be regarded as static – at least for long periods of time. We don’t need to change the connections in our knowledge models during computation, just like you don’t rewire the ethernet cables in a datacenter when you change applications, and the axons in your brain don’t rewire themselves as you think (actually they do, but only very, very slowly).

Having a static graph is vital to building efficient highly-parallel machines, because it allows the compiler to do three NP hard tasks involved in mapping a parallel program to a parallel machine:

balancing computational work across processors,
partitioning and allocating memory,
scheduling message traffic between processors.

There are many other fundamental characteristics of the machine intelligence workload that we have considered; I mentioned a few more in my talk.

You said in your talk that Graphcore takes a different approach to memory. Could you explain?

For machine intelligence workloads, the performance of today’s GPUs and CPUs is limited by their bandwidth to external memory. If you invented a GPU-like processor with 10x the arithmetic performance, there is no known technology which will deliver the necessary 10x in external memory bandwidth to keep up with it. To achieve a step change in system performance we have to do something different.

Traditional external DRAM systems, such as DDR4 or GDDR5, are already left behind because bandwidth is too low and not scaling. They are being replaced with High Bandwidth Memory (HBM), which uses vertically stacked DRAM die physically very close to the processor die, on a silicon substrate with very high wiring density. This is what today’s state-of-the-art GPUs are using. It is “heroic” technology, very difficult to manufacture at good yield. Also it brings the DRAM into the thermal envelope of the processor, which is bad news for leakage and steals power from the processor budget, so the processor must run slower. Today’s second-generation HBM2 systems are struggling towards 1TByte/s, but this is still a hard limit on compute performance.

The radical alternative is to bring the model memory onto the processor chip. Allocating, say, half the die area to SRAM delivers an order of magnitude improvement in bandwidth and latency over an HBM2 external memory system. Then partitioning this SRAM into many blocks distributed across a massively parallel processor-on-chip yields another order of magnitude. This is the approach we have taken with the IPU.

You obviously can’t integrate as much SRAM on an IPU as you can attach as DRAM to a GPU. But we can fit plenty enough to hold large and complex machine learning models entirely on chip. And the ability to access them at 100x the bandwidth and 1/100^th the latency delivers great performance dividends. The final part of the jigsaw is that IPUs are designed to be clustered together such that they appear to software like one larger chip. So if you need huge models you can distribute them over multiple chips.

What if we still run out of memory? It turns out there’s huge scope for reducing the memory footprint of machine intelligence processing, not least because several of the techniques currently used to mitigate DRAM bandwidth and latency problems cause memory bloat.

For example, GPUs use a transformation called “lowering” to turn convolutions into matrix-matrix multiplication. And they require large training batches to reach their best efficiency. Each of these can bloat the memory by an order of magnitude. IPUs require neither.

We can save another order of magnitude in memory footprint by trading re-computation for memorization. For example, most of the memory used during DNN training on a GPU is to save activations on the forward pass, so that we can use them during back-propagation. An efficient alternative is to just save periodic snapshots, then re-compute the missing activations from the nearest snapshot when required. This trade provides an order of magnitude memory footprint improvement for around 25% increase in compute. An IPU has less memory but far greater compute than a GPU, so it’s a good trade.

Most importantly, all these memory optimizations can be done “under the hood” by the compiler, so they don’t require clever programming.

One of the takeaways from your talk is that you provided a new definition of what it means for a chip to be efficient – “Silicon efficiency is the full use of available power.” You’ve mentioned keeping the memory local. How else do you maximize use of the available power?

One way is by serializing computation and communication, which may seem counter-intuitive. Surely it’s more efficient to overlap them? Not necessarily so in a power-limited world.

Consider the design of a new multi-processor. You aim to overlap inter-processor communication with computation, so you allocate your power budget between these two according to your expected workload. Then you run an application which has a different balance and the result is that the program executes in the same time but consumes less power, because either the communication or the computation part was bottle-necked and the other under-used its power budget.

Now consider an alternative design, in which you assume that computation and communication do not overlap, so you provision them both to use all the chip power. Now whatever the balance of a real workload, the program has the potential to complete in minimum time, at constant full power. In this age of “dark silicon” this is a more efficient design.

You also talked about Bulk Synchronous Parallel. How does that fit in?

An IPU machine is a pure, distributed, parallel processor, so we can build little ones for embedding in robots and big ones for data-centres and, in the future, huge ones and it will still be the same machine to software.

An IPU consists of very many high-performance, independent processors, each with their own local memory, which can send messages to each other through a perfect stateless interconnect. This interconnect allows an IPU to support any type of knowledge model and inference algorithm, including ones that we haven’t thought of yet.

The static nature of the computation graph allows the compiler to divide the program into independent sub-programs running on the many processors. But we need a mechanism for maintaining overall program order, so that messages are sent between the sub-programs at the right time.

There is a paradigm for doing this: it’s called Bulk Synchronous Parallel (or BSP), and was invented in the 1990’s by Leslie Valliant and Bill McColl. It works by alternating between a “compute phase”, in which every processor just does compute on its own local memory, and an “exchange phase” in which they just send and receive messages, memory-to-memory. Notice that this serializes computation and communication, which is now efficient as discussed earlier. Once around each pair of phases, the whole machine synchronises so all processors remain in step. There is hardware to make this synchronization super-cheap, on chip as well as across chips in a cluster.

The beauty of BSP is that it effectively creates a sequential machine at the whole-program level, but supports massive parallelism in each of the steps. It’s also the only parallel program execution paradigm which is automatically free of concurrency hazards, so you can’t write a program with races or deadlocks or livelocks.

The sequence of compute and exchange phases is extracted from the graph automatically by the compiler, so the BSP mechanics are not something the programmer has to think about. It’s just the IPU way of mapping a graph program to an efficient, scalable, and safe massively-parallel machine.

I know that you have lots of potential customers who are eager to try your IPU technology – when will we be able to get our hands on an IPU?

We will be shipping products to our early access customers before the end of this year and more general availability will start early next year. We are looking forward to helping engineers deliver the first generation of really intelligent machines and providing researchers with a more flexible and potent engine for the discovery of future general intelligence models and inference algorithms.

Nathan Benaich is an investor, technologist and scientist focused on companies creating transformational solutions to the world’s challenging problems. Previously a partner at Playfair Capital, Nathan invested in 33 companies including Mapillary, Ravelin and Starship Technologies. Nathan previously earned a Ph.D. in computational cancer biology at the University of Cambridge and a BA from Williams College.

You can watch the talk in full, below: