Skip to content

Search the site

Elon Musk to double power of world's largest AI supercomputer

xAI announces plan to cram another 100,000 Nvidia Hopper GPUs into its Colossus supercomputer cluster.

Elon Musk's xAI has announced plans to bring the total number of Nvidia GPUs in its Colossus Supercomputer to 200,000.

Currently, the mega-machine is equipped with a mammoth of 100,000 Hopper GPUs - data centre-focused units which, as luck would have it, are part of a product line that was once called Tesla.

Colossus is based in Memphis, Tennessee, and is described as the world's largest AI supercomputer. It is used to train Grok, Elon's large language model (LLM). It will be arguably twice as powerful as it is today when Nvidia is finished installing its 100,000 extra GPUs (although it's not entirely clear that throwing more and more compute at a model will make it more effective at a given task).

The GPUs will be connected to the Nvidia Spectrum-X Ethernet networking platform, which is "designed to deliver superior performance to multi-tenant, hyperscale AI factories" using standards-based Ethernet for its Remote Direct Memory Access (RDMA) network. You can see a tour of the supercomputer cluster in the tweet below.

READ MORE: Elon Musk reveals challenges of employing AI engineers who are "only interested" in AGI

“Colossus is the most powerful training system in the world,” said Elon Musk on X “Nice work by xAI team, NVIDIA and our many partners/suppliers.”

In a press release about the project, Nvidia also confirmed Elon's previous claims that Colossus broke the record for the world's fastest-ever AI supercomputer roll out.

Nvidia and xAI summoned the beast of Memphis in just 122 days, instead of the months to years similar supercomputers take to deploy. It took just 19 days from the time when the first rack was installed to the beginning of AI model training.

During the training of the "extremely large" Grok model, Colossus "achieves unprecedented network performance" across all three tiers of network fabric, Nvidia reports, with "zero application latency degradation or packet loss due to flow collisions" and 95% data throughput enabled by Spectrum-X's "congestion control".

Nvidia was quick to note that this "level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput."

“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”

British readers may note that Colossus is also the name of the world's first programmable computer, which was once very super indeed. Housed at Bletchley Park, Colossus was developed by codebreakers during World War II and used thermionic valves (vacuum tubes), which are now more likely to be found in guitar amps than cutting-edge technology.

In stark contrast to the open approach of many AI researchers and firms (particularly those that don't have the word "open" in their names), the original Colossus was developed in absolute secrecy, potentially hampering the UK's involvement in the computing revolution of the 20th century.

Do you have anything to say about AI scaling laws and whether bigger is better (or worse) for AI models? Get in touch with jasper@thestack.technology to let us know.

READ MORE: IBM reveals why its "tiny" AI models punch well above their weight

Latest