What's the Best AI Hardware for Graph Neural Networks?

While large language models (LLMs) like ChatGPT and generative AI image models have captured the public imagination, one of the most eagerly anticipated developments in terms of commercial potential has been Graph Neural Network applications - described by Meta AI chief Yann LeCun as “a major conceptual advance” in AI.

A graph neural network or GNN allows artificial intelligence practitioners to work with applications using data that do not have, or don’t lend themselves to, being described by regular structures. Examples include the composition of molecules, organisation of social networks, and the movement of people and vehicles within cities.

Graphcore IPUs have secured double-first position in the Open Graph Benchmark Large-Scale Challenge, making the specialist AI hardware the top choice for Graph Neural Network (GNN) applications.

At OGB-LSC, the AI industry’s leading test of graph network model capability, IPUs won first place in categories for predicting quantum properties of molecular graphs and knowledge graph completion.

Graphcore researchers partnered with molecular machine learning specialists Valence Discovery and teams from Université de Montréal, and Montreal-based AI lab Mila on the OGB-LSC submission.

Commenting on the experience of working with Graphcore systems, Dominique Beaini, Research Team Lead at Valence Discovery and Associate Professor at Mila, said: “As I started applying IPUs for molecular property predictions, I was shocked to see the speed improvements over traditional methods. With such an advantage in terms of compute, it was clear in my mind that winning the OGB-LSC competition was within reach.”

Graphcore and partners placed ahead of teams from Microsoft, Tencent, and NVIDIA, as well as researchers from Peking University, University of Science and Technology of China, and UCLA.

PCQM4Mv2 - Molecular property prediction with GNNs

PCQM4Mv2 defines a molecular property prediction problem that involves building a Graph Neural Network to predict the HOMO-LUMO energy gap (a quantum chemistry property) given a dataset of 3.4 million labelled molecules.

This kind of graph prediction problem crops up in a wide range of scientific areas such as drug discovery, computational chemistry and materials science, but can be painfully slow to run using conventional methods and can even require costly lab-experiments. For this reason, science-driven AI labs including DeepMind, Microsoft and Mila have taken a keen interest in OGB-LSC.

We partnered with Valence Discovery, leaders in molecular machine learning for drug discovery, and Mila to build our submission because we felt that their real-world knowledge and research expertise, combined with the ultra-fast Graphcore hardware gave us a great opportunity to build something special.

The key to the success of our GPS++ model is its hybrid architecture that takes the best qualities of conventional graph neural networks and merges them with transformer-style attention.

In one sense, this hybridization is a natural idea, which may well have been considered by our competitors. However, on conventional AI accelerators, engineering such a model to run efficiently is a daunting challenge, making it impractical to test the original scientific hypothesis.

The IPU’s MIMD architecture and high memory bandwidth greatly simplifies such engineering, allowing scientists to test novel ideas without being constrained by the vagaries of the “hardware lottery”.

As is true across modern AI, speeding up large models is the key to state-of-the-art accuracy. In developing such models, however, it is also crucial to be able to quickly iterate on smaller models in order to test hypotheses and most efficiently tune the large “production” models. Again, the IPU's flexibility shines here: models can easily be run on a single IPU, or a Pod of 16 or more IPUs, without loss of efficiency.

Using the excellent hyperparameter sweeping tools from Weights & Biases we were able to run hundreds of small models each night with a moderately sized compute budget. This enabled us to move fast and have confidence in our decisions.

Our successful work for OGB-LSC paves the way for an ongoing collaboration, as noted by Valence Discovery and Mila’s Dominique Beaini: “We are currently pursuing our collaboration with Graphcore in the hopes that scaling the model on even larger datasets can provide by far the biggest pre-trained graph neural network for molecular property predictions.”

To find more, see our technical report and code.

You can also try both our inference model and training model for free on Paperspace.

WikiKG90Mv2 - Predicting missing facts in a knowledge graph

WikiKG90Mv2 is a dataset extracted from Wikidata, the knowledge graph used to power Wikipedia. It is a database of 600 million facts, typically represented as ‘triples’: head, relationship, tail. e.g. Geoffrey Hinton, citizen, Canada.

In many instances, the relationship information between entities is incomplete. Knowledge graph completion is the process of inferring these connections.

Standard techniques for training knowledge graph completion models struggle to cope with dataset scale, as the number of trainable parameters grows with the number of entities in the database.

Training on WikiKG90Mv2, our largest models consume over 300 GiB for parameters, optimiser state and features. It is challenging to partition these models for distributed training without introducing unwanted bias into the model.

Our distributed training scheme called BESS (Balanced Entity Sampling and Sharing) tackles these problems directly without modifying the core model.

Starting with entities that are balanced across streaming memory of a Bow Pod₁₆, we fetch a large batch of facts and corrupted entities to contrast against, enabled by the 14.4 GB in-processor memory across 16 IPUs. These facts and entities are shared via perfectly balanced all-to-all collectives over fast IPU-Links to be processed by the model.

This meant that we could train 100s of models to convergence, allowing us to optimise 10 different scoring & loss function combinations for use in our final ensemble. Fast validation gave us plenty of information about our models as they trained.

Our technique called for fine-grained control over processing, distribution and memory access. Therefore, we decided to implement the model directly in Poplar, a graph computation API for writing IPU programs.

To find out more, see our technical report and code.

You can also try Distributed KGE (training and build entity mapping) on Paperspace.

GNN Masterclass

For more information on these award-winning graph neural network and knowledge graph models and how to apply to your own AI projects, watch Graphcore Research's GNNs Masterclass.

Results in full

Winners of PCQM4Mv2 Track

1st place: WeLoveGraphs
Team members: Dominic Masters (Graphcore), Kerstin Klaser (Graphcore), Josef Dean (Graphcore), Hatem Helal (Graphcore), Adam Sanders (Graphcore), Sam Maddrell-Mander (Graphcore), Zhiyi Li (Graphcore), Deniz Beker (Graphcore), Dominique Beaini (Valence Discovery/Universite de Montreal/Mila Quebec), Ladislav Rampasek (Universite de Montreal/Mila Quebec)
Method: GPS++ (112 model ensemble)
Short summary: We use GPS++, a hybrid message passing neural network (MPNN) and transformer that builds on the General, Powerful, Scalable (GPS) framework presented by Rampášek et al. [2022]. Specifically, this combines a large and expressive message passing module with a biased self-attention layer to maximise the benefit of local inductive biases while still allowing for effective global communication. Furthermore, we integrate a grouped input masking method to exploit available 3D positional information and use a denoising loss to alleviate oversmoothing.
Learn more: Technical report, code
Test MAE: 0.0719

2nd place (joint): ViSNet
Team members: Tong Wang (MSR AI4Science), Yusong Wang (Xi`an Jiaotong University, MSR AI4Science), Shaoning Li (MSR AI4Science), Zun Wang (MSR AI4Science), Xinheng He (MSR AI4Science), Bin Shao (MSR AI4Science), Tie-Yan Liu (MSR AI4Science)
Method: 20 Transformer-M-ViSNet + 2 Pretrained-3D-ViSNet
Short summary: We designed two kinds of models: Transformer-M-ViSNet which is a geometry-enhanced graph neural network for fully connected molecular graphs and Pretrained-3D-ViSNet which is a pretrained ViSNet by distilling geometric information from optimized structures. 22 models with different settings are ensembles.
Learn more: Technical report, code
Test MAE: 0.0723

2nd place (joint): NVIDIA-PCQM4Mv2
Team members: Jean-Francois Puget (NVIDIA), Jiwei Liu (NVIDIA), Gilberto Titericz Junior (NVIDIA), Sajad Darabi (NVIDIA), Alexandre Milesi (NVIDIA), Pawel Morkisz (NVIDIA), Shayan Fazeli (UCLA)
Method: Heterogenous Ensemble of Models
Short summary: Our method combines variants of the recent TransformerM architecture with Transformer, GNN, and ResNet backbone architectures. Models are trained on the 2D data, 3D data, and image modalities of molecular graphs. We ensemble these models with a HuberRegressor. The models are trained on 4 different train/validation splits of the original train + valid datasets.
Learn more: Technical report, code
Test MAE: 0.0723

Winners of WikiKG90Mv2 Track

1st place: wikiwiki
Team members: Douglas Orr (Graphcore), Alberto Cattaneo (Graphcore), Daniel Justus (Graphcore), Harry Mellor (Graphcore), Jerome Maloberti (Graphcore), Zhenying Liu (Graphcore), Thorin Farnsworth (Graphcore), Andrew Fitzgibbon (Graphcore), Blazej Banaszewski (Graphcore), Carlo Luschi (Graphcore)

Method: BESS + diverse shallow KGE ensemble (85 models)
Short summary: An ensemble of 85 shallow KGE models combining five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx) and two different loss functions (log-sigmoid, sampled softmax cross-entropy), using a new distribution framework, BESS (Balanced Entity Sampling and Sharing).
Learn more: Technical report, code
Test MRR: 0.2562

2nd place: DNAKG
Team members: Xu Chen (Microsoft Research Asia), Lun Du (Microsoft Research Asia), Xiaojun Ma (Microsoft Research Asia), Yue Fan (Peking University), Jiayi Liao (University of Science and Technology of China), Qiang Fu (Microsoft Research Asia), Shi Han (Microsoft Research Asia), Chongjian Yue (Peking University)
Method: DNAKG
Short summary: We generate candidates using a structure-based strategy and rule mining, and score them by 13 knowledge graph embedding models and 10 manual features. Finally we adopt the ensemble method to assemble the scores given by 13 knowledge graph embedding models and 10 manual features.
Learn more: Technical report, code
Test MRR: 0.2514

3rd place: TIEG-Youpu (contact)
Team members: Feng Nie ( Tencent), Zhixiu Ye ( Tencent), Sifa Xie ( Tencent), Shuang Wu ( Tencent), Xin Yuan ( Tencent), Liang Yao ( Tencent), Jiazhen Peng ( Tencent)
Method: Structure enhanced retrieval and re-rank
Short summary: Semantic&Structure enhanced retrieval with TransE+ComplEx+NOTE 6 model ensemble
Learn more: Technical report, code
Test MRR: 0.2309