Simply days earlier than Supercomputing 22 kicks off, Intel launched (opens in new tab) its next-generation Xeon Max CPU, beforehand codenamed Sapphire Rapids HBM, and Knowledge Heart GPU Max Collection compute GPUs, generally known as Ponte Vecchio. The brand new merchandise cater to various kinds of high-performance computing workloads or work collectively to unravel probably the most advanced supercomputing duties.
The Xeon Max CPU: Sapphire Rapids Will get 64GB of HBM2E
Basic-purpose x86 processors have been used for just about all types of technical computing for many years and due to this fact help many purposes. Nonetheless, whereas the efficiency of general-purpose CPU cores has scaled relatively quickly for years, at this time’s processors have two vital limitations concerning efficiency in synthetic intelligence and HPC workloads: parallelization and reminiscence bandwidth. Intel’s Xeon Max ‘Sapphire Rapids HBM’ processors promise to take away each boundaries.
Intel’s Xeon Max processor options as much as 56 high-performance Golden Cove cores (unfold over 4 chiplets interconnected utilizing Intel’s EMIB know-how) additional enhanced with a number of accelerator engines for AI and HPC workloads and 64GB of on-package HBM2E reminiscence. Like different Sapphire Rapids CPUs, the Xeon Max will nonetheless help eight channels of DDR5 reminiscence and PCIe Gen 5 interface with the CXL 1.1 protocol on high, so it is going to be in a position to all these CXL-enabled accelerators when it is sensible.
Along with vector AVX-512 and Deep Studying Increase (AVX512_VNNI and AVX512_BF16) accelerators help, the brand new cores additionally deliver Superior Matrix Extensions (AMX) tiled matrix multiplication accelerator, which is actually a grid of fused multiply-add models supporting BF16 and INT8 enter varieties that may be programmed utilizing solely 12 directions and carry out as much as 1024 TMUL BF16 or 2048 TMUL INT8 operations per cycle per core. Additionally, the brand new CPU helps Knowledge Streaming Accelerator (DSA), which offloads knowledge copy and transformation workloads from the CPU.
64GB of on-package HBM2E reminiscence (4 stacks of 16GB) gives a peak bandwidth of round 1TB/s, which interprets to ~1.14GB of HBM2E per core at 18.28 GB/s per core. To place the numbers into context, a 56-core Sapphire Rapids processor geared up with eight DDR5-4800 modules will get as much as 307.2 GB/s of bandwidth, which implies 5.485 GB/s per core. In the meantime, Xeon Max can use its HBM2E reminiscence in several methods: use it as system reminiscence, which requires no code change; use it as a high-performance cache for DDR5 reminiscence subsystem, which doesn’t require change code; use it as part of a unified reminiscence pool (HBM flat mode), which entails software program optimizations.
Relying on the workload, Intel’s AMX-enabled Xeon Max processor can present a 3X – 5.3X efficiency enchancment over the at the moment obtainable Xeon Scalable 8380 processor that makes use of standard FP32 processing for a similar workloads. In the meantime, in purposes like mannequin growth for molecular dynamics, the brand new HBM2E-equipped CPUs are as much as 2.8X instances sooner than AMD’s EPYC 7773X, which options 3D V-Cache.
However HBM2E has one other necessary implication for Intel because it considerably reduces knowledge motion overhead between CPU and GPU, which is crucial for numerous HPC workloads. It brings us to the second of at this time’s bulletins: the Knowledge Heart GPU Max Collection compute GPUs.
The Knowledge Heart GPU Max: The Pinnacle of Intel’s Datacenter Improvements
Intel’s Knowledge Heart GPU Max compute GPU sequence will make use of the corporate’s codenamed Ponte Vecchio structure, first launched in 2019 after which detailed in 2020 ~ 2021. Intel’s Ponte Vecchio is probably the most advanced processor ever created, because it packs over 100 billion transistors (not together with reminiscence) over 47 tiles (together with 8 HBM2E tiles). As well as, the product extensively makes use of Intel’s superior packaging applied sciences (e.g., EMIB) as totally different tiles are made by different producers utilizing totally different course of applied sciences.
Intel’s Knowledge Heart GPU Max compute GPUs will depend on the corporate’s Xe-HPC structure tailor-made explicitly for AI and HPC workloads and due to this fact help applicable knowledge codecs and directions in addition to 512-bit vector and 4096-bit matrix (tensor) engines.
Knowledge Heart Max 1100 | Knowledge Heart Max 1350 | Knowledge Heart Max 1550 | AMD Intuition MI250X | Nvidia H100 | Nvidia H100 | Rialto Bridge | |
---|---|---|---|---|---|---|---|
Type-Issue | PCIe | OAM | OAM | OAM | SXM | PCIe | OAM |
Tiles + Reminiscence | ? | ? | 39+8 | 2+8 | 1+6 | 1+6 | many |
Transistors | ? | ? | 100 billion | 58 billion | 80 billion | 80 billion | a great deal of them |
Xe HPC Cores | Compute Models | 56 | 112 | 128 | 220 | 132 | 114 | 160 Enhanced Xe HPC Cores |
RT Cores | 56 | 112 | 128 | – | – | – | ? |
512-bit Vector Engines | 448 | 896 | 1024 | ? | ? | ? | ? |
4096-bit Matrix Engines | 448 | 896 | 1024 | ? | ? | ? | ? |
L1 Cache | ? | ? | 64MB at 105 TB/s | ? | ? | ? | ? |
L2 Rambo Cache | ? | ? | 408MB at 13 TB/s | ? | 50MB | 50MB | ? |
In comparison with Xe-HPG, Xe-HPC has significantly extra refined reminiscence and caching subsystems, in another way configured Xe cores (every Xe-HPG core options 16 256-bit vector and 16 1024-bit matrix engines, whereas every Xe-HPC core sports activities eight 512-bit vector and eight 4096-bit vector engines). Moreover, Xe-HPC GPUs don’t characteristic texturing models or render again ends, so they can not render graphics utilizing conventional strategies. In the meantime, Xe-HPG surprisingly helps ray tracing for supercomputer visualization.
One of the crucial necessary components of Xe-HPC is Intel’s Xe Matrix Extensions (XMX) that allow relatively formidable tensor/matrix efficiency of Intel’s Knowledge Heart GPU Max 1550 (see the desk under) — as much as 419 TF32 TFLOPS and as much as 1678 INT8 TOPS, in response to Intel. In fact, peak efficiency numbers supplied by compute GPU builders are necessary however might not replicate efficiency achievable on real-world supercomputers in real-world purposes. Nonetheless, we can’t assist however discover that Intel’s range-topping Ponte Vecchio is considerably behind Nvidia’s H100 usually and fails to ship tangible benefits over AMD’s Intuition MI250X throughout all instances besides FP32 Tensor (TF32).
Knowledge Heart Max 1550 | AMD Intuition MI250X | Nvidia H100 | Nvidia H100 | |
---|---|---|---|---|
Type-Issue | OAM | OAM | SXM | PCIe |
HBM2E | 128GB HBM2E at 3.2 TB/s | 128GB HBM2E at 3.2 TB/s | 80GB HBM3 at 3.35 TB/s | 80GB HBM2E at 2 TB/s |
Energy | 600W | 560W | 700W | 350W |
Peak INT8 Vector | ? | 383 TOPS | 133.8 TFLOPS | 102.4 TFLOPS |
Peak FP16 Vector | 104 TFLOPS | 383 TFLOPS | 134 TFLOPS | 102.4 TFLOPS |
Peak BF16 Vector | ? | 383 TFLOPS | 133.8 TFLOPS | 102.4 TFLOPS |
Peak FP32 Vector | 52 TFLOPS | 47.9 TFLOPS | 67 TFLOPS | 51 TFLOPS |
Peak FP64 Vector | 52 TFLOPS | 47.9 TFLOPS | 34 TFLOPS | 26 TFLOPS |
Peak INT8 Tensor | 1678 TOPS | ? | 1979 TOPS | 3958 TOPS* | 1513 TOPS | 3026 TOPS* |
In the meantime, Intel says that its Knowledge Heart GPU Max 1550 is 2.4x sooner than Nvidia’s A100 on Riskfuel credit score possibility pricing and provides a 1.5x efficiency enchancment over A100 for NekRS digital reactor simulations.
Intel plans to supply three Ponte Vecchio merchandise: the top-of-the-range Knowledge Heart GPU Max 1550 in OAM form-factor that includes 128 Xe-HPC cores, 128GB of HBM2E reminiscence, and rated for as much as 600W thermal design energy; the cut-down Knowledge Heart GPU Max 1350 in OAM form-factor with 112 Xe-HPC cores, 96GB of reminiscence, and a 450W TDP; and the entry-level Knowledge Heart GPU Max 1100 that is available in a dual-wide FLFH form-factor and carries a processor with 56 Xe-HPC cores, has 56GB of HBM2E reminiscence and rated for a 300W TDP.
In the meantime, to its supercomputer purchasers, Intel will provide Max Collection Subsystems with 4 OAM modules on a service board rated for a 1,800W and a pair of,400W TDP.
Intel’s Rialto Bridge: Enhancing the Max
Along with formally unveiling its Knowledge Heart GPU Max compute GPUs, Intel at this time additionally gave a sneak peek at its next-generation Knowledge Heart GPU, codenamed Rialto Bridge which arrives in 2024. This AI and HPC compute GPU will likely be primarily based on enhanced Xe-HPC cores, presumably with a barely totally different structure, however will keep compatibility with Ponte Vecchi-based purposes. Sadly, that further complexity will improve the TDP of the next-generation flagship compute GPU to 800W, although there will likely be easier and fewer power-hungry variations.
Availability
One of many first clients to get each Intel Xeon Max and Intel Knowledge Heart GPU Max merchandise will likely be Argonne Nationwide Laboratory, which is constructing its >2 ExaFLOPS supercomputers primarily based on over 10,000 blades utilizing Xeon Max CPUs and Knowledge Heart GPU Max gadgets (two CPUs and 6 GPUs per blade). As well as, Intel and Argonne are ending constructing Sunspot, Aurora’s take a look at growth system consisting of 128 manufacturing blades that will likely be obtainable to events in late 2022. The Aurora supercomputer ought to come on-line in 2023.
Intel’s companions, amongst server makers, will launch machines primarily based on Xeon Max CPUs and Knowledge Heart GPU Max gadgets in January 2023.