Constructing a supercomputer is at all times difficult, however creating the trade’s first exascale-class system is an encounter with one thing wholly surprising and requires loads of work with {hardware} and software program. Sadly, this could be taking place with Oak Ridge Nationwide Laboratory’s Frontier supercomputer, which might barely final a day with out quite a few {hardware} failures.
ORNL’s Frontier is the trade’s first system designed to ship as much as 1.685 FP64 ExaFLOPS peak efficiency utilizing AMD’s 64-core EPYC Trento processors, Intuition MI250X compute GPUs, and HPE’s Slingshot interconnections at 21 MW of energy. HPE constructed the system and used the Cray EX (opens in new tab) structure designed for scale-out purposes, primarily for ultra-fast supercomputers.
Whereas on paper, the Frontier supercomputer appears exceptionally good, and {hardware} components of the machine system have been delivered, it looks as if issues with {hardware} maintain chasing the machine from coming on-line and being accessible to researchers requiring efficiency of round 1 FP64 ExaFLOPS.
“We’re working by points in {hardware} and ensuring that we perceive (what they’re),” mentioned Justin Whitt, program director for the Oak Ridge Management Computing Facility (OLCF), in an interview with InsideHPC (opens in new tab). “You’ll have failures at this scale. Imply time between failure on a system this dimension is hours, it’s not days.”
Rumors about potential {hardware} failures of Frontier have been floating round for fairly some time now. Some mentioned that the system skilled issues with the Slingshot interconnect, in keeping with one other InsideHPC (opens in new tab) story. As well as, others indicated that AMD’s Intuition MI250X compute GPUs weren’t as dependable as anticipated this 12 months. Do not forget that the X model, with the next variety of stream processors and excessive clocks, is simply accessible to pick clients.
Mr. Whitt didn’t verify that the system experiences any specific points with Intuition or Slingshot, however he pressed that the machine suffers from quite a few {hardware} points.
“A whole lot of challenges are centered round these [GPUs], however that’s not the vast majority of the challenges that we’re seeing,” the pinnacle of OLCF mentioned. “It’s a fairly good unfold amongst widespread culprits of components failures which have been an enormous a part of it. I don’t suppose that at this level that we now have loads of concern over the AMD merchandise.”
Oak Ridge Nationwide Laboratory’s Frontier supercomputer is by far not the one system round to make use of HPE’s Cray EX structure with Slingshot interconnects, AMD’s EPYC CPUs and AMD’s Intuition compute GPUs. For instance, Finland’s Lumi supercomputer (Cray EX, EPYC Milan, Intuition MI250X compute GPUs) delivers 550 PetaFLOPS peak efficiency and is formally ranked because the world’s third strongest supercomputer. Maybe, the issue is legitimate with the size of the machine that makes use of 60 million components in whole.
Solely time will inform whether or not the Frontier supercomputer that was initially promised to come back on-line in 2022 might be accessible to researchers beginning in 2023, provided that it’s nonetheless not formally deployed.