Oracle and AMD Collaborate to Launch Ultra-large Scale AI Supercomputers

AI AMD Oracle

Oracle and AMD jointly announced on Thursday that AMD’s latest Instinct MI355X GPU will land on Oracle Cloud Infrastructure (OCI), providing more than twice the cost-effectiveness advantage of previous generation products for large-scale AI training and inference workloads. OCI will build a zettascale AI supercomputing cluster accelerated by up to 131072 MI355X GPUs to support customers in conducting large-scale AI development.

Mahesh Thiagarajan, Executive Vice President of OCI, states, “We commit to providing the widest range of AI infrastructure options. The combination of AMD accelerators with OCI high-performance networks and flexible architectures will meet customers’ training and inference needs for new intelligent agent applications. This solution adopts a high-throughput and ultra-low latency RDMA cluster network architecture, with MI355X GPU computing performance improved by nearly three times and high-frequency memory capacity increased by 50%.

Forrest Norrod, Executive Vice President of AMD Data Center Solutions Division, states that the collaboration between the two parties provide customers with open, efficient, and flexible solutions. The new generation of AMD accelerators and Pollara network cards will support more AI inference, fine-tuning, and training scenarios. “

MI355X core advantages

The new platform has 288GB HBM3 video memory and 8TB/s memory bandwidth, supports 4-bit floating-point computing (FP4) standard, and adopts liquid cooling design to achieve a single rack power density of 125 kW. Deploy 64 1400 watt GPUs per rack, paired with AMD Turing high-frequency CPUs (up to 3TB system memory) to achieve efficient task scheduling. Customers can seamlessly migrate existing code through AMD’s open-source ROCm software stack and utilize the advanced RoCE capabilities of Pollara smart network cards to build high-performance networks.