On November 3, AMD revealed key particulars of its upcoming RDNA 3 GPU structure and the Radeon RX 7900-series graphics playing cards. It was a public announcement that the entire world was invited to observe. Shortly after the announcement, AMD took press and analysts behind closed doorways to dig a bit of deeper into what makes RDNA 3 tick — or is it tock? Regardless of.
We’re allowed to speak concerning the extra RDNA 3 particulars and different briefings AMD supplied now, which nearly definitely has nothing to do with Nvidia’s impending launch of the RTX 4080 on Wednesday. (That is sarcasm, simply in case it wasn’t clear. This form of factor occurs on a regular basis with AMD and Nvidia, or AMD and Intel, and even Intel and Nvidia now that Workforce Blue has joined the GPU race.)
AMD’s RDNA 3 structure essentially adjustments a number of of the important thing design parts for GPUs, because of using chiplets. And that is pretty much as good of a spot to start out as any. We have additionally obtained separate articles protecting AMD’s Gaming and ISV Relations, Software program and Platform particulars, and the Radeon RX 7900 Collection Graphics Playing cards.
RDNA 3 and GPU Chiplets
Navi 31 consists of two core items, the Graphics Compute Die (GCD) and the Reminiscence Cache Dies (MCDs). There are similarities to what AMD has completed with its Zen 2/3/4 CPUs, however all the things has been tailored to suit the wants of the graphics world.
For Zen 2 and later CPUs, AMD makes use of an Enter/Output Die (IOD) that connects to system reminiscence and gives all the vital performance for issues just like the PCIe Categorical interface, USB ports, and extra lately (Zen 4) graphics and video performance. The IOD then connects to a number of Core Compute Dies (CCDs — alternatively “Core Advanced Dies,” relying on the day of the week) by way of AMD’s Infinity Cloth, and the CCDs include the CPU cores, cache, and different parts.
A key level within the design is that typical basic computing algorithms — the stuff that runs on the CPU cores — will largely match inside the numerous L1/L2/L3 caches. Fashionable CPUs up by way of Zen 4 solely have two 64-bit reminiscence channels for system RAM (although EPYC Genoa server processors can have as much as twelve DDR5 channels).
The CCDs are small, and the IOD can vary from round 125mm^2 (Ryzen 3000) to as massive as 416mm^2 (EPYC xxx2 era). Most lately, the Zen 4 Ryzen 7000-series CPUs have an IOD made utilizing TSMC N6 that measures simply 122mm^2 with one or two 70mm^2 CCDs manufactured on TSMC N5, whereas the EPYC xxx4 era makes use of the identical CCDs however with a comparatively large IOD measuring 396mm^2 (nonetheless made on TSMC N6).
GPUs have very totally different necessities. Giant caches may also help, however GPUs additionally actually like having gobs of reminiscence bandwidth to feed all of the GPU cores. For instance, even the beastly EPYC 9654 with a 12-channel DDR5 configuration ‘solely’ delivers as much as 460.8 GB/s of bandwidth. The quickest graphics playing cards just like the RTX 4090 can simply double that.
In different phrases, AMD wanted to do one thing totally different for GPU chiplets to work successfully. The answer finally ends up being virtually the reverse of the CPU chiplets, with reminiscence controllers and cache being positioned on a number of smaller dies whereas the principle compute performance resides within the central GCD chiplet.
The GCD homes all of the Compute Models (CUs) together with different core performance like video codec {hardware}, show interfaces, and the PCIe connection. The Navi 31 GCD has as much as 96 CUs, which is the place the everyday graphics processing happens. Nevertheless it additionally has an Infinity Cloth alongside the highest and backside edges (linked by way of some form of bus to the remainder of the chip) that then connects to the MCDs.
The MCDs, because the title implies (Reminiscence Cache Dies) primarily include the massive L3 cache blocks (Infinity Cache), plus the bodily GDDR6 reminiscence interface. In addition they must include Infinity Cloth hyperlinks to hook up with the GCD, which you’ll be able to see within the die shot alongside the middle dealing with fringe of the MCDs.
GCD will use TSMC’s N5 node, and can pack 45.7 billion transistors right into a 300mm^2 die. The MCDs in the meantime are constructed on TSMC’s N6 node, every packing 2.05 billion transistors on a chip that is solely 37mm^2 in measurement. Cache and exterior interfaces are a number of the parts of contemporary processors that scale the worst, and we will see that total the GCD averages 152.3 million transistors per mm^2, whereas the MCDs solely common 55.4 million transistors per mm^2.
AMD’s Excessive Efficiency Fanout Interconnect
Interconnect | Picojoules per Bit (pJ/b) |
---|---|
On-die | 0.1 |
Foveros | 0.2 |
EMIB | 0.3 |
UCIe | 0.25-0.5 |
Infinity Cloth (Navi 31) | 0.4 |
TSMC CoWoS | 0.56 |
Bunch of Wires (BoW) | 0.5-0.7 |
Infinity Cloth (Zen 4) | ??? |
NVLink-C2C | 1.3 |
Infinity Cloth (Zen 3) | 1.5 (?) |
One potential concern with a chiplet strategy on GPUs is how a lot energy all the Infinity Cloth hyperlinks require — exterior chips virtually all the time use extra energy. For instance, the Zen CPUs have an natural substrate interposer that is comparatively low cost to make, nevertheless it consumes 1.5 pJ/b (Picojoules per bit). Scaling that as much as a 384-bit interface would have consumed a good quantity of energy, so AMD labored to refine the interface with Navi 31.
The result’s what AMD calls the excessive efficiency fanout interconnect. The picture above would not fairly clarify issues clearly, however the bigger interface on the left is the natural substrate interconnect used on Zen CPUs. To the proper is the excessive efficiency fanout bridge used on Navi 31, “roughly to scale.”
You possibly can clearly see the 25 wires used for the CPUs, whereas the 50 wires used on the GPU equal are packed right into a a lot smaller space, so you’ll be able to’t even see the person wires. It is about 1/8 the peak and width for a similar goal, which means about 1/64 the entire space. That, in flip, dramatically cuts energy necessities, and AMD says all the Infinity Fanout hyperlinks mixed ship 3.5 TB/s of efficient bandwidth whereas solely accounting for lower than 5% of the entire GPU energy consumption.
There is a fast attention-grabbing apart right here: all of the Infinity Cloth logic on each the GCD and MCDs takes up an honest quantity of die house. Wanting on the die shot, the six Infinity Cloth interfaces on the GCD use about 9% of the die space, whereas the interfaces are round 15% of the entire die measurement on the MCDs.
Wipe out the Infinity Cloth interface and construct the entire chip as a monolithic half on TSMC’s N5 node, and it might most likely solely measure 400–425mm^2. Apparently, the price of TSMC N5 is a lot increased than N6 that it was price taking the chiplet route, which says one thing concerning the rising prices of smaller fabrication nodes.
Associated to this, we all know that sure facets of a chip design scale higher with course of shrinks. Exterior interfaces — just like the GDDR6 bodily interface — have virtually stopped scaling. Cache additionally tends to scale poorly. What can be attention-grabbing to see is that if AMD’s next-generation GPUs (Navi 4x / RDNA 4) leverage the identical MCDs as RDNA 3 whereas shifting the GCD presumably to the longer term TSMC N3 node.
RDNA 3 Structure Upgrades
That takes care of the chiplet side of the design, so now let’s go into the structure adjustments to the varied elements of the GPU. These will be broadly divided into 4 areas: basic adjustments to the chip design, enhancements to the GPU shaders (Stream Processors), updates to enhance ray tracing efficiency, and enhancements to the matrix operation {hardware}.
Wanting on the uncooked specs, it may not look like AMD has elevated clock speeds all that a lot, however beforehand we solely had the Recreation Clock figures. Now we will say that the enhance clocks are increased, and usually use, we count on AMD’s RDNA 3 GPUs will exceed even the official enhance clocks — they’re conservative boosts, in different phrases.
AMD says that RDNA 3 has been architected to succeed in speeds of three GHz. The official enhance clocks on the reference 7900 XTX / XT are effectively under that mark, however we additionally really feel AMD’s reference designs targeted extra on maximizing effectivity. Third-party AIB playing cards might very effectively bump up energy limits, voltages, and clock speeds fairly a bit. Will we see 3 GHz out-of-factory overclocks? Maybe, so we’ll wait and see.
In accordance with AMD, RDNA 3 GPUs can hit the identical frequency as RDNA 2 GPUs whereas utilizing half the ability, or they will hit 1.3 instances the frequency whereas utilizing the identical energy. After all, finally, AMD desires to steadiness frequency and energy to ship the very best total expertise. Nonetheless, given we see increased energy limits on the 7900 XTX, we must also count on that to come back with an honest bump to clock speeds and efficiency.
One other level AMD makes is that it has improved silicon utilization by roughly 20%. In different phrases, there have been practical items on RDNA 2 GPUs the place elements of the chip have been incessantly sitting idle even when the cardboard was below full load. Sadly, we do not have a great way to measure this instantly, so we’ll take AMD’s phrase on this, however finally this could end in increased efficiency.
Compute Unit Enhancements
Outdoors of the chiplet stuff, lots of the largest adjustments happen inside the Compute Models (CUs) and Workgroup Processors (WGPs). These embody updates to the L0/L1/L2 cache sizes, extra SIMD32 registers for FP32 and matrix workloads, and wider and quicker interfaces between some parts.
AMD’s Mike Mantor offered the above and the next slides, that are dense! He mainly talked continuous for the higher a part of an hour, attempting to cowl all the things that is been completed with the RDNA 3 structure, and that wasn’t practically sufficient time. The above slide covers the big-picture overview, however let’s step by way of a number of the particulars.
RDNA 3 comes with an enhanced Compute Unit pair — the twin CUs that turned the principle constructing block for RDNA chips. A cursory take a look at the above may not look that totally different from RDNA 2, however then discover that the primary block for the scheduler and Vector GPRs (basic goal registers) says “Float / INT / Matrix SIMD32” adopted by a second block that claims “Float / Matrix SIMD32.” That second block is new for RDNA 3, and it mainly means double the floating level throughput.
You possibly can select to take a look at issues in considered one of two methods: Both every CU now has 128 Stream Processors (SPs, or GPU shaders), and also you get 12,288 whole shader ALUs (Arithmetic Logic Models), or you’ll be able to view it as 64 “full” SPs that simply occur to have double the FP32 throughput in comparison with the earlier era RDNA 2 CUs.
That is form of humorous as a result of some locations are saying that Navi 31 has 6,144 shaders, and others are saying 12,288 shaders, so I particularly requested AMD’s Mike Mantor — the Chief GPU Architect and the principle man behind the RDNA 3 design — whether or not it was 6,144 or 12,288. He pulled out a calculator, punched in some numbers, and mentioned, “Yeah, it needs to be 12,288.” And but, in some methods, it is not.
AMD’s personal slides in a special presentation (above) say 6,144 SPs and 96 CUs for the 7900 XTX, and 84 CUs with 5,376 SPs for the 7900 XT, so AMD is taking the strategy of utilizing the decrease quantity. Nevertheless, uncooked FP32 compute (and matrix compute) has doubled. Personally, it makes extra sense to me to name it 128 SPs per CU moderately than 64, and the general design seems just like Nvidia’s Ampere and Ada Lovelace architectures. These now have 128 FP32 CUDA cores per Streaming Multiprocessor (SM), but additionally 64 INT32 items.
Together with the additional 32-bit floating-point compute, AMD additionally doubled the matrix (AI) throughput because the AI Matrix Accelerators seem to not less than partially share a number of the execution assets. New to the AI items is BF16 (brain-float 16-bit) assist, in addition to INT4 WMMA Dot4 directions (Wave Matrix Multiply Accumulate), and as with the FP32 throughput, there’s an total 2.7x enhance in matrix operation pace.
That 2.7x seems to come back from the general 17.4% enhance in clock-for-clock efficiency, plus 20% extra CUs and double the SIM32 items per CU. (However do not quote me on that, as AMD did not particularly break down all the features.)
Larger and Sooner Caches and Interconnects
The caches, and the interfaces between the caches and the remainder of the system, have all obtained upgrades. For instance, the L0 cache is now 32KB (double RDNA 2), and the L1 caches are 256KB (double RDNA 2 once more), whereas the L2 cache elevated to 6MB (1.5x bigger than RDNA 2).
The hyperlink between the principle processing items and the L1 cache is now 1.5x wider, with 6144 bytes per clock throughput. Likewise, the hyperlink between the L1 and L2 cache can be 1.5x wider (3072 bytes per clock).
The L3 cache, additionally known as the Infinity Cache, did shrink relative to Navi 21. It is now 96MB vs. 128MB. Nevertheless, the L3 to L2 hyperlink is now 2.25x wider (2304 bytes per clock), so the entire throughput is a lot increased. In truth, AMD offers a determine of 5.3 TB/s — 2304 B/clk at a pace of two.3 GHz. The RX 6950 XT solely had a 1024 B/clk hyperlink to its Infinity Cache (most), and RDNA 3 delivers as much as 2.7x the height interface bandwidth.
Observe that these figures are just for the totally configured Navi 31 resolution within the 7900 XTX. The 7900 XT has 5 MCDs, dropping right down to a 320-bit GDDR6 interface and 1920 B/clk hyperlinks to the mixed 80MB of Infinity Cache. We’ll seemingly see lower-tier RDNA 3 elements that additional in the reduction of on interface width and efficiency, naturally.
Lastly, there at the moment are as much as six 64-bit GDDR6 interfaces for a mixed 384-bit hyperlink to the GDDR6 reminiscence. The VRAM additionally clocks at 20 Gbps (vs 18 Gbps on the later 6×50 playing cards and 16 Gbps on the unique RDNA 2 chips) for a complete bandwidth of 960 GB/s.
It is attention-grabbing how a lot the hole between GDDR6 and GDDR6X has narrowed with this era, not less than for delivery configurations. AMD’s 960 GB/s on the RX 7900 XTX is barely 5% lower than the 1008 GB/s of the RTX 4090 now, whereas with the RX 6900 XT and RTX 3090 have been solely pushing 512 GB/s in comparison with Nvidia’s 936 GB/s again in 2020.
AMD 2nd Technology Ray Tracing
Ray tracing on the RDNA 2 structure all the time felt like an afterthought — one thing tacked on to satisfy the required function guidelines for DirectX 12 Final. AMD’s RDNA 2 GPUs lack devoted BVH traversal {hardware}, opting to do a few of that work by way of different shared items, and that is not less than partially guilty for his or her weak efficiency.
RDNA 2 Ray Accelerators might do as much as 4 ray/field intersections per clock, or one ray/triangle intersection. By means of distinction, Intel’s Arc Alchemist can do as much as 12 ray/field intersections per RTU per clock, whereas Nvidia would not present a selected quantity however has as much as two ray/triangle intersections per RT core on Ampere and as much as 4 ray/triangle intersections per clock on Ada Lovelace.
It isn’t clear if RDNA 3 truly improves these figures instantly or if AMD has targeted on different enhancements to cut back the variety of ray/field intersections carried out. Maybe each. What we do know is that RDNA 3 could have improved BVH (Bounding Quantity Hierarchy) traversal that can enhance ray tracing efficiency.
RDNA 3 additionally has 1.5x bigger VGPRs, which implies 1.5x as many rays in flight. There are different stack optimizations to cut back the variety of directions wanted for BVH traversal, and specialised field sorting algorithms (closest first, largest first, closest midpoint) can be utilized to extract improved effectivity.
Total, because of the brand new options, increased frequency, and elevated variety of Ray Accelerators, AMD says RDNA 3 ought to ship as much as a 1.8x efficiency uplift for ray tracing in comparison with RDNA 2. That ought to slim the hole between AMD and Nvidia Ampere. Nonetheless, Nvidia additionally appears to have doubled down on its ray tracing {hardware} for Ada Lovelace, so we would not rely on AMD delivering equal efficiency to RTX 40-series GPUs.
Different Architectural Enhancements
Lastly, RDNA 3 has tuned different parts of the structure associated to the command processor, geometry, and pixel pipelines. There’s additionally a brand new Twin Media Engine with assist for AV1 encode/decode, AI-enhanced video decoding, and the brand new Radiance Show Engine.
The Command Processor (CP) updates ought to enhance efficiency for sure workloads whereas additionally decreasing CPU bottlenecks on the motive force and API facet. {Hardware}-based culling efficiency can be 50% quicker on the geometry facet of issues, and there is a 50% enhance in peak rasterized pixels per clock.
That final appears to be a results of rising the variety of ROPs (Render Outputs) from 128 on Navi 21 consequence to 192 on Navi 31. That is sensible, as there’s additionally a 50% enhance in reminiscence channels, and AMD would wish to scale different parts in keeping with that.
The Twin Media Engine ought to convey AMD as much as parity with Nvidia and Intel on the video facet of issues, although we’ll have to check to see how high quality and efficiency evaluate. We all know from our Arc A380 video encoding assessments that Intel usually delivered the very best efficiency and high quality, Nvidia wasn’t far behind, and AMD was a comparatively distant third on the standard entrance. Sadly, we have not been in a position to check Nvidia’s AV1 assist but, however we’re trying ahead to testing each of the brand new AMD and Nvidia AV1 implementations.
AMD additionally features not less than a couple of factors for together with DisplayPort 2.1 assist. Intel additionally has DP2 assist on its Arc GPUs, nevertheless it tops out at 40 Gbps (UHBR 10), whereas AMD can do 54 Gbps (UHBR 13.5). AMD’s show outputs can drive as much as 4K at 229 Hz with out compression for 8-bit colour depths, or 187 Hz with 10-bit colour. Show Stream Compression can greater than double that, permitting for 4K and 480 Hz or 8K and 165 Hz — not that we’re anyplace close to having shows that truly assist such speeds.
Realistically, we’ve to marvel how necessary DP2.1 UHBR 13.5 can be with the RDNA 3 graphics playing cards. You may want a brand new monitor that helps DP2.1 to start with, and second, there’s the query of how a lot better one thing like 4K 180 Hz seems with and with out DSC — as a result of DP1.4a can nonetheless deal with that decision with DSC whereas UHBR 13.5 might do it with out DSC.
RDNA 3 Structure Closing Ideas
Graphics Card | RX 7900 XTX | RX 7900 XT | RX 6950 XT | RTX 4090 | RTX 4080 |
---|---|---|---|---|---|
Structure | Navi 31 | Navi 31 | Navi 21 | AD102 | AD103 |
Course of Expertise | TSMC N5 + N6 | TSMC N5 + N6 | TSMC N7 | TSMC 4N | TSMC 4N |
Transistors (Billion) | 58 (45.7 + 6x 2.05) | 56 (45.7 + 5x 2.05) | 26.8 | 76.3 | 45.9 |
Die measurement (mm^2) | 300 + 222 | 300 + 185 | 519 | 608.4 | 378.6 |
CUs / SMs | 96 | 84 | 80 | 128 | 76 |
SPs / Cores (Shaders) | 6144 (12288) | 5376 (10752) | 5120 | 16384 | 9728 |
Tensor / Matrix Cores | ? | ? | ? | 512 | 304 |
Ray Tracing “Cores” | 96 | 84 | 80 | 128 | 76 |
Enhance Clock (MHz) | 2500 | 2400 | 2310 | 2520 | 2505 |
VRAM Velocity (Gbps) | 20 | 20 | 18 | 21 | 22.4 |
VRAM (GB) | 24 | 20 | 16 | 24 | 16 |
VRAM Bus Width | 384 | 320 | 256 | 384 | 256 |
L2 / Infinity Cache | 96 | 80 | 128 | 72 | 64 |
ROPs | 192 | 192 | 128 | 176 | 112 |
TMUs | 384 | 336 | 320 | 512 | 304 |
TFLOPS FP32 (Enhance) | 56.5 | 43 | 23.7 | 82.6 | 48.7 |
TFLOPS FP16 (FP8) | 113 | 86 | 47.4 | 661 (1321) | 390 (780) |
Bandwidth (GBps) | 960 | 800 | 576 | 1008 | 717 |
TDP (watts) | 355 | 300 | 335 | 450 | 320 |
Launch Date | Dec 13, 2022 | Dec 13, 2022 | Could 2022 | Oct 12, 2022 | Nov 16, 2022 |
Launch Value | $999 | $899 | $1,099 | $1,599 | $1,199 |
For many who need the complete assortment of slides on the RDNA 3 structure, you’ll be able to flip by way of them within the above gallery. Total, it feels like a formidable feat of engineering and we’re wanting to see how the graphics playing cards based mostly on the RDNA 3 GPUs stack up.
As we have famous earlier than, we really feel like there is a good probability AMD can compete fairly effectively in opposition to Nvidia’s RTX 4080 card, which launches on November 16. Then again, it appears fairly unlikely that AMD will have the ability to go head-to-head in opposition to the larger RTX 4090 in most video games.
Basic math gives loads of meals for thought. With FP32 12,288 shaders operating at 2.5 GHz vs. Nvidia’s 16,384 shaders at 2.52 GHz, Nvidia clearly has the uncooked compute benefit — 61 teraflops vs. 83 teraflops. As famous, including extra FP32 items makes AMD’s RDNA 3 appear extra like Ampere and Ada Lovelace, so there is a affordable probability that real-world gaming efficiency will match up extra intently with the teraflops. Reminiscence bandwidth not less than seems fairly shut and the distinction most likely should not matter an excessive amount of.
Past uncooked compute, we have transistor counts and die sizes. Nvidia has constructed monolithic dies with its AD102, AD103, and AD104 GPUs. The biggest has 76.3 billion transistors in a 608mm^2 chip. Even when AMD have been doing a monolithic 522mm^2 chip with 58 billion transistors, we would count on Nvidia to have some benefits. Nonetheless, the GPU chiplet strategy means a number of the space and transistors get used on issues circuitously associated to efficiency.
In the meantime, Nvidia’s penultimate Ada chip, the AD103 used within the RTX 4080, falls on the opposite facet of the fence. With a 256-bit interface, 45.9 billion transistors, and a 368.6mm^2 die measurement, Navi 31 ought to have some clear benefits — each with the RX 7900 XTX and the marginally decrease tier 7900 XT. And do not even get us began on the AD104 with 35.8 billion transistors and a 294.5mm^2 die. There isn’t any approach the “unlaunched” RTX 4080 12GB was going to maintain tempo with an RX 7900 XT, not with out DLSS3 being a significant a part of the story.
However there’s extra to efficiency than paper specs. Nvidia invests extra transistors into options like DLSS (Tensor cores) and now DLSS3 (the Optical Stream Accelerator), and ray tracing {hardware}. AMD appears extra prepared to surrender some ray tracing efficiency whereas boosting the extra frequent use circumstances. We’ll see how the RTX 4080 performs in simply a few days, after which we’ll want to attend till December to see AMD’s RX 7900 collection response.
For many who aren’t fascinated about graphics playing cards costing $900 or extra, 2023 can be once we get RTX 4070 and lower-tier Ada Lovelace elements, and we’ll seemingly get RX 7800, 7700, and perhaps even 7600 collection choices from AMD. Navi 32 is rumored to make use of the identical MCDs, however with a smaller GCD, whereas additional out, Navi 33 will supposedly be a monolithic die nonetheless constructed on the N6 node.
Based mostly on what we have seen and heard to date, the longer term RTX 4070 and RX 7800 will seemingly ship comparable efficiency to the earlier era RTX 3090 and RX 6950 XT, hopefully at considerably decrease costs and whereas utilizing much less energy. Test again subsequent month for our full evaluations of AMD’s first and quickest RDNA 3 graphics playing cards.