Douglas Adams’ seminal book ‘The Hitchhikers Guide to the Galaxy’ is full of magical innovations and strange objects. One of his most famous fictional creations is the Babel fish:
"The Babel fish is small, yellow, leech-like and probably the oddest thing in the universe…if you stick one in your ear, you can instantly understand anything said to you in any form of language."
With recent advances in machine learning and new processors designed for machine intelligence, the Babel fish, or an AI version of one, may not be too far off.
Researchers at Baidu in Santa Clara, CA have been working on an end-to-end deep learning approach to speech recognition that can recognize English or Mandarin in the same system. These two languages are completely different in construction yet a single deep learning system can recognize both. Baidu’s Deep Speech 2 system can also handle noisy environments and deal with regional accents. You can see more information in their paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin https://arxiv.org/abs/1512.02595 . Their approach is to use Recurrent Neural Networks and more specifically Gated Recurrent Units (GRU’s) which are a form of Long-Short Term Memory (LSTM).
Separately, Google DeepMind in London have been working on speech synthesis using a new type of neural network that they call a WaveNet. They recently published a blog with examples and some useful animated descriptions that show how each discrete output in time is a function of the previous inputs and outputs https://deepmind.com/blog/wavenet-generative-model-raw-audio/. There’s also a more detailed paper https://arxiv.org/abs/1609.03499. The results are impressive. The paper shows how WaveNets can also be used as a discriminative model and how this approach could be applied to phoneme recognition for speech understanding as well as speech generation.
Most recently a further piece of research has been published by DeepMind, introducing ByteNets designed for text translation https://arxiv.org/abs/1610.10099. This system combines two networks that operate together, one to encode a source language sequence and a second to decode in the target language. These two networks are stacked and operate together with the output network target sequence built up as a function of the sequence of source inputs. Linking the networks in this way allows language sequences of arbitrary length to be handled. Testing shows that this system is able to produce state of the art results.
When you look at research projects like these (and there are many others) it’s clear that there is massive progress in machine language understanding and translation. It’s not hard to imagine that a real AI Babel fish may become available in just a few years. A Bluetooth headset and mobile phone connected to a cloud service will bring real time language translation when we travel. Audio and video conference calls will be available with highly accurate real-time language translation. If you wanted, the system could generate a transcript of the discussion in any language you chose.
However there are challenges that still need to be overcome. RNN’s and the associated LSTM and GRU structures are compute intensive to train and only train well with small mini batch sizes. This type of machine learning doesn’t suit current processors, slowing down progress to commercialisation of a real AI Babel fish.
If memory were closer to the processor, these language systems could run significantly faster. WaveNets and the related ByteNets require very large amounts of compute and increase the requirement for high bandwidth memory close to the processing units. State of the art research like this demands a new processor for intelligent compute. Innovations and major steps forward like these from Baidu and DeepMind are just the type of machine learning advances that Graphcore’s IPU is designed to support.