Enhancing AI Training Networks with MRC (Multipath Reliable Connection)
OpenAI, in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA, has introduced the MRC protocol to enhance GPU networking performance and resilience in AI training clusters. Released through the Open Compute Project, MRC employs multi-plane networks and adaptive packet spraying to address network congestion and failures. This protocol indicates a strategic shift in AI infrastructure design, focusing on redundancy and failure management to support scalable AI systems.

Summary
OpenAI, in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA, has introduced the MRC protocol to enhance GPU networking performance and resilience in AI training clusters. Released through the Open Compute Project, MRC employs multi-plane networks and adaptive packet spraying to address network congestion and failures. This protocol indicates a strategic shift in AI infrastructure design, focusing on redundancy and failure management to support scalable AI systems.
Key Updates
- OpenAI partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop the MRC protocol.
- MRC uses multi-plane networks and adaptive packet spraying to handle congestion and failures.
- The protocol was released through the Open Compute Project (OCP).
- Static source routing is used in MRC to bypass failures and eliminate routing failure classes.
- The Stargate supercomputer by Oracle Cloud Infrastructure is part of the network design for large AI model training.
Why It Matters
MRC points to a deeper infrastructure shift in AI: frontier model training is increasingly constrained by networking reliability, cluster efficiency, and failure handling, not only by GPU availability or model architecture.
As AI systems scale, the network connecting thousands of accelerators becomes part of the performance ceiling. Releasing the protocol through an open compute process suggests that training infrastructure is becoming a shared industry concern, especially for organizations building or operating large AI clusters.
For most builders, this is not an immediate implementation trigger. But it is a useful signal: the next phase of AI infrastructure may depend as much on resilient data movement and congestion management as on faster chips.
Builder Takeaway
Builders working close to AI infrastructure should watch how this develops across hardware vendors, cloud providers, and open compute environments. For teams not operating large training clusters, the practical lesson is broader: AI performance is becoming a full-stack infrastructure problem, where networking, reliability, and operational efficiency matter alongside models and GPUs.
How strong is this signal for builders?
Signal feedback is stored anonymously and used to improve Tech Radar editorial quality.
Want more builder-focused AI and infrastructure signals?
Follow UniQubit Tech Radar or contact UniQubit about the systems you are building.