Early data from GraphCore‘s first chip for machine learning and artificial intelligence is showing a dramatic speedup for AI algorithms and the ability to scale in data centres.

“We can fit enough memory to hold large and complex machine learning models entirely on chip”

 

Most AI chips such as graphics processor units struggle with scaling algorithms, says Simon Knowles, CTO and co-founder at GraphCore (pictured left) in Bristol. The company has raised over $60m (£50m) for the development of a new type of chip.

GraphCore’s IPU (Intelligence Processing Unit) has a unique combination of massively parallel multi-tasking computing resources that support synchronised execution within an IPU or across multiple IPUs. A new data exchange sub-system and large amounts of on-chip SRAM boost both training and inference across a large range of machine learning algorithms.

“We can fit enough memory to hold large and complex machine learning models entirely on chip,” says Knowles. “And the ability to access them at 100x the bandwidth and 1/100th the latency delivers great performance dividends. The final part of the jigsaw is that IPUs are designed to be clustered together such that they appear to software like one larger chip, so if you need huge models you can distribute them over multiple chips.”

The early data shows speed increases from 2x to 184x across a range of difference AI algorithms that are also scalable across multiple chips.

“We understood from the beginning that a full solution requires more than just a new chip design,” said Dave Lacey, Distinguished Engineer (Software) at GraphCore. “The software infrastructure needs to be comprehensive and easy to use to allow machine learning developers to quickly adapt the hardware to their needs. As a result, we have been focused on bringing up a full software stack early to ensure that the IPU can be used for real applications from the outset.”

“The performance gain for convolutional neural networks is substantial”

 

The tests using convolutional neural networks (CNNs) for image processing on the GraphCore C2 accelerator card shows linear performance increases across eight cards.

“The performance gain for CNNs is substantial,” said Lacey. “When we scale up to eight C2 accelerator cards, using an IPU system is a substantial performance leap over existing technologies.” For example, the best performance reported on a 300W GPU accelerator (the same power budget as a C2 accelerator) is approximately 580 images per second compared to 2000 images per second for a single IPU and 16,000 images a second for the cluster of cards.”

Recurrent networks are used to process sequence data such as language translation or text-to-speech applications. LSTM (long short-term memory) networks contain data dependencies that are a challenge for current chip architectures. These data dependencies limit the amount of parallelism available and the number of operations per data fetch from memory is limited. The IPU handles these limitations better through the availability of large amounts of on-chip memory and the flexibility of compute and data movement within the IPU. This can handle over 40,000 inferences per second with a low latency of 2ms, compared to a few hundred inferences for a GPU.

For generative neural networks that are used for text-to-speech applications, the IPU can provide good quality voice at 16kHz, beating both CPU and GPU designs.

This is just the start, says Lacey, as the software will improve in efficiency over time on the production chips which are set to sample next month.

There’s more information on the Poplar programming environment for machine learning at www.graphcore.ai and you can keep up with Graphcore’s latest developments by following them on Twitter here @graphcoreai