For the last few years, Graphcore has primarily been focused on slinging its IPU chips for training and inference systems of varying sizes, but that is changing now as the six-year-old British chip designer is joining the conversation about the convergence of AI and high-performance computing.
There are now 168 supercomputers in the Top500 and quite a few more outside of that list that use accelerators to power these increasingly converging workloads. Most of these systems are using Nvidia’s GPUs, but the appearance of seven new systems with AMD’s fresh Instinct MI250X GPUs — which includes Oak Ridge National Laboratory’s Frontier, the United States’ first exascale system — shows there is an appetite to consider alternative architectures when they can provide an advantage.
Graphcore hopes it can soon get a slice of this action with its massively parallel processors.
Phil Brown, a Cray veteran who returned to Graphcore in May as vice president of scaled systems after a four-month stint at chip startup NextSilicon, tells The Next Platform that the IPU maker has recently “seen significant, sustained interest” from organizations that are considering deploying Graphcore’s specialized silicon for these converged AI and HPC needs, and this includes large deployments.
“I think we’re now at the point where there is going to be significant interest in doing large-scale deployments with the systems. The technology space and machine learning capability has evolved sufficiently that it can deliver significant value to the scientific organizations, and so I’m expecting those to follow quite rapidly in the future,” he says.
Graphcore views three key opportunities around the convergence of HPC and AI: using IPU’s “class-leading” performance for 32-bit floating point math to tackle HPC applications, training large foundation models like DeepMind’s 280-billion-parameter language model, and “using AI to complement and accelerate traditional HPC workloads” to create a feedback loop of sorts.
It’s the latter area that Brown says is likely the largest opportunity for Graphcore in HPC.
“This may be having surrogate models, elements of a traditional HPC simulation, replaced by a machine learning kernel parameterization in a weather forecast, for example,” he says. Surrogate models are computationally expensive, he added, so replacing them with a machine learning models that are “much cheaper but equally accurate” can help reduce the overall cost of running simulations.
These opportunities are based on exploratory work Graphcore has conducted with partners that has yielded promising results. For instance, the company says its IPUs were used to train a gravity wave drag model for weather forecasting five times faster than Nvidia’s V100. In another example, Hewlett Packard Enterprise trained a deep learning model for protein folding using Graphcore’s IPU-M2000 system and found that the second-generation IPU was around three times faster than Nvidia’s A100.