<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=145304570664993&amp;ev=PageView&amp;noscript=1">

GPUs aren’t always your best bet, Twitter ML tests suggest

GPUs are a powerful tool for machine-learning workloads, though they’re not necessarily the right tool for every AI job, according to Michael Bronstein, Twitter’s head of graph learning research.

His team recently showed Graphcore’s AI hardware offered an “order of magnitude speedup when comparing a single IPU processor to an Nvidia A100 GPU,” in temporal graph network (TGN) models.

“The choice of hardware for implementing Graph ML models is a crucial, yet often overlooked problem,” reads a joint article penned by Bronstein with Emanuele Rossi, an ML researcher at Twitter, and Daniel Justus, a researcher at Graphcore.

Graph neural networks offer a means of finding order in complex systems, and are commonly used in social networks and recommender systems. However, the dynamic nature of these environments make these models particularly challenging to train, the trio explained.

The group investigated the viability of Graphcore’s IPUs in handling several TGN models. Initial testing was done on a small TGN model based on the JODIE Wikipedia dataset that links users to edits they made to pages. The graph consisted of 8,227 users and 1,000 articles for a total of 9,227 nodes. JODIE is an open-source prediction system designed to make sense of temporal interaction networks.

The trio's experimentation revealed that large batch sizes resulted in degraded validation and inference accuracy, compared to smaller batch sizes.

“The node memory and the graph connectivity are both only updated after a full batch is processed,” the trio wrote. “Therefore, the later events within one batch may rely on outdated information as they are not aware of earlier events.”

However, by using a batch size of 10, the group was able to achieve optimal validation and inference accuracy, but they note that performance on the IPU was still superior to that of a GPU, even when using large batch sizes.

“When using a batch side of 10, TGN can be trained on the IPU about 11-times faster, and even with a large batch size of 200, training is still three-times faster on the IPU,” the post reads. “Throughout all operations, the IPU handles small batch sizes more efficiently.”

The team posits that the fast memory access and high throughput offered by Graphcore’s large in-processor SRAM cache gave the IPU an edge.

This performance lead also extended to graph models that exceeded the IPU’s in-processor memory — each IPU features a 1GB SRAM cache — requiring the use of slower DRAM memory attached to the chips.

In testing on a graph model consisting of 261 million follows between 15.5 million Twitter users, the use of DRAM for the node memory curbed throughput by a factor of two, Bronstein’s team found.

However, when inducing several sub-graphs based on a synthetic dataset 10X the size of the Twitter graph, the team found throughput scaled independently of the graph size. In other words, the performance hit was the result of using slower memory and not the result of model’s size.

“Using this technique on the IPU, TGN can be applied to almost arbitrary graph sizes, only limited by the amount of available host memory while retaining a very high throughput during training and inference,” the article reads.

The team concluded that Graphcore’s IPU architecture shows significant advantage over GPUs in workloads where compute and memory access are heterogeneous.

However, the broader takeaway is that ML researchers should carefully consider their choice of hardware and shouldn’t default to using GPUs.

“The availability of cloud computing services abstracting out the underlying hardware leads to certain laziness in this regard,” the trio wrote. “We hope that our study will draw more attention to this important topic and pave the way for future, more efficient algorithms and hardware architectures for Graph ML applications.”