close
close

The next AI battle: who can collect the most Nvidia chips in one place

The next AI battle: who can collect the most Nvidia chips in one place

Companies that run large data centers have been vying for the past two years to buy artificial intelligence processors, which are Nvidia’s specialty. Now some of the most ambitious players are stepping up their efforts, creating so-called superclusters of computer servers that cost billions of dollars and contain an unprecedented number of Nvidia’s cutting-edge chips.

Elon Musk’s xAI built a supercomputer called Colossus in Memphis in a matter of months with 100,000 Nvidia AI Hopper chips. Meta CEO Mark Zuckerberg said last month that his company is already training its most advanced artificial intelligence models using a conglomerate of chips that he called “bigger than anything I’ve seen others do.”

A year ago, clusters of tens of thousands of chips were considered very large. UBS analysts estimate that OpenAI used about 10,000 Nvidia chips to train the version of ChatGPT it released in late 2022.

This push toward larger superclusters could help Nvidia maintain a growth trajectory that has seen its quarterly revenue rise from about $7 billion two years ago to more than $35 billion today. The jump helped make it the most valuable publicly traded company in the world, with a market capitalization of more than $3.5 trillion.

Installing many chips in one place, connected together by ultra-fast network cables, has so far allowed the creation of larger AI models at higher speeds. But there are questions about whether increasingly larger superclusters will continue to evolve into smarter chatbots and more persuasive image tools.

The continuation of Nvidia’s artificial intelligence boom also largely depends on how successful its largest chip clusters are. The trend not only promises a wave of purchases of its chips, but also drives demand for Nvidia’s networking equipment, which is quickly becoming an important business and generates billions of dollars in sales annually.

Nvidia CEO Jensen Huang said on a call with analysts after the company’s report on Wednesday that so-called core AI models still have plenty of room for improvement with larger computing setups. He predicts continued investment as the company transitions to next-generation artificial intelligence chips called Blackwell, which are several times more powerful than current chips.

Huang said that while the largest clusters for training giant AI models now number about 100,000 current Nvidia chips, “the next generation starts with about 100,000 Blackwell processors. And it gives you an idea of ​​where the industry is going.”

The stakes are high for companies like xAI and Meta, which compete with each other for computing bragging rights but also bet that having more Nvidia chips, called GPUs, will lead to commensurately better artificial intelligence models.

“There’s no evidence that this scales to a million chips and a $100 billion system, but the observation is that it scales very well from tens of chips to 100,000,” said Dylan Patel, the company’s chief analyst. SemiAnalysis, research firm.

In addition to xAI and Meta, OpenAI and Microsoft are working to create significant new computing power for artificial intelligence. Google is building huge data centers to house the chips that drive its artificial intelligence strategy.

In a podcast last month, Huang marveled at the speed with which Musk built his Colossus cluster and confirmed that more, larger clusters are on the way. He pointed to efforts to train models distributed across multiple data centers.

“Do we think we need millions of GPUs? Without a doubt,” Huang said. – Now it is certain. And the question is how do we design it from a data center perspective.”

Unprecedented superclusters are already being broadcast on the radio. Last month, Musk posted on his social media platform X that his 100,000-chip Colossus supercluster was “soon to become” a 200,000-chip cluster in the same building. cluster of the latest Nvidia chips next summer.

The rise in superclusters comes as their operators prepare for the release of Blackwell chips, which will begin shipping in the next couple of months. They are estimated to cost around $30,000 each, meaning a cluster of 100,000 would cost $3 billion, not including the cost of power generation infrastructure and IT equipment around the chips.

Those dollar figures make building superclusters with ever-increasing numbers of chips something of a gamble, industry officials say, since it’s unclear whether they can improve AI models to a degree that justifies their cost.

New engineering challenges also often arise when working with larger clusters. Meta researchers reported in a July paper that Nvidia’s cluster of more than 16,000 GPUs regularly suffered unexpected failures of chips and other components as the company spent 54 days training an improved version of its Llama model.

Keeping Nvidia chips cool is a major challenge as clusters of power-hungry chips become increasingly tightly packed together, industry executives say, partly driven by a shift to liquid cooling that pumps coolant directly to the chips to prevent them from overheating.

And the sheer size of superclusters requires a higher level of management of these chips in the event of a failure. Mark Adams, chief executive of Penguin Solutions, a company that helps build and operate computing infrastructure, says the increased complexity of managing large clusters of chips inevitably leads to problems.

“When you look at all the things that can go wrong, you could be using up half your capital expenditure because of all these things that could go wrong,” he said.

Write to Asa Fitch at [email protected].

The next AI battle: who can get more Nvidia chips in one place

View full image

The next AI battle: who can get more Nvidia chips in one place