In 1947 when engineers tried to identify why the Harvard Mark II computer was not working and found a moth interfering with the relay it was noted as the “first actual case of bug being found" (the term ‘bug’ in engineering predates this incident by many decades).
While the computing technology has advanced significantly we need to pay more attention than ever to the quality of our devices.
Developing advanced technology is hard. The additional complexity needed to build compute for artificial intelligence amplifies the potential for bugs and each new development introduces new failure modes.
This scales such that a defect in a single transistor or memory cell or any of the supporting systems, software, power or cooling needed to run AI foundation models could have a blast radius that takes down the entire super computer.
Advanced technology nodes, such as 3nm gate-all-around transistors, drive enhanced performance with less leakage and reduced power consumption, but we have reached a point where semiconductor manufacturing has to be so tightly controlled because a single atom in the wrong place could lead to a non-functioning device.
The need for resilience
Emergent failure modes, such as Silent Data Corruption, where a device produces a computational error that goes undetected by traditional error-checking mechanisms mean that new approaches to design, manufacturing and test are needed.
Prevention is crucial, but equally so is building resilience into the system.
In their 2024 research paper, Meta detailed a 54-day snapshot period of pre-training, where they experienced a total of 419 unexpected interruptions. Pareto analysis of these show they were primarily due to faulty GPUs, memory, software bugs, and networking. ~78% of the interruptions were caused by hardware failures. It’s an industry-wide problem.
This year's World Quality Week theme is “think differently”, and the extremely demanding environment of building systems to run AI foundation models requires just that.
At Graphcore, we’re dedicated to building the future of AI compute; a single purpose that justifies taking a different approach.
Our team of expert engineers, who care deeply about our customer experience, are given the freedom to innovate within a light touch framework of policies and processes that ensure we deliver on time and to quality - every time.
But it’s how we do things - our behaviors, how we collaborate, take accountability, listen to each other, and learn that makes us a great place to work. Check out our current vacancies.
Culture of quality
We build quality into everything that we do from world-class hardware and software design and predictive resiliency modelling to working with our supply chain partners to improve our product quality and product stress testing.
It’s what we do, it’s our culture, it’s part of our DNA.
The industrialist W. Edwards Deming said that “Quality is everyone’s responsibility”, and I truly believe that this is the case at Graphcore. Quality at every step, Impact at every scale.
Kevin Newey is Head of Quality at Graphcore.