AMD Vega Features Leaked – 4x Efficiency, 2x Performance/Clock , 8x Capacity Per HBM Stack & Next Gen Compute Engine-April 2024-www.yitit.com

The features of AMD's upcoming Radeon RX 500 Series Vega architecture have been discovered in the code of the just launched ve.ga teaser site and they're incredibly impressive. The company's upcoming next generation Vega graphics architecture is due for a major preview at CES on Thursday, less than three days away.

However, thanks to our crafty friends over at 3DCenter who have managed to dig up some majoryet unreleased details regarding the brand new architecture you don't have to wait one more minute. All the details have been pulled from within the code-base of the Vega teaser website, ve.ga.Which not only makes this the biggest Vega leak yet, it also makes it the most significant because of its impeccable authenticity and accuracy. So without any further delay, let's get to the juicy bits!

Vega, AMD's Most Advanced& Most Impressive Graphics Architecture To Date

AMD Vega Architecture Features

Wordcloud pulled fromAMD's ve.ga teaser site.

Let's start off with a simple summary of Vega's key features. This should help paint a picture of how much of a drasticstep forward the new architecture is compared to Polaris.

Vega Architecture

- 4x Power Efficiency

- 2x Peak Throughput/Performance Per Clock

- High Bandwidth Cache

- 2x Bandwidth per pin

- 8x Capacity Per stack ( 2nd Generation High Bandwidth Memory )

- 512TB Virtual Address Space

- Next Generation Compute Engine

- Next Generation Pixel Engine

- Next Compute Unit Architecture

- Rapid Packed Math

- Draw Stream Binning Rasterizer

- Primitive Shaders

AMD Vega Lineup

Graphics Card	Radeon R9 Fury X	Radeon RX 480	Radeon RX Vega Frontier Edition	Radeon Vega Pro	Radeon RX Vega (Gaming)	Radeon RX Vega Pro Duo
GPU	Fiji XT	Polaris 10	Vega 10	Vega 10	Vega 10	2x Vega 10
Process Node	28nm	14nm FinFET	FinFET	FinFET	FinFET	FinFET
Stream Processors	4096	2304	4096	3584	4096 (?)	Up to 8192
Performance	8.6 TFLOPS 8.6 (FP16) TFLOPS	5.8 TFLOPS 5.8 (FP16) TFLOPS	~13 TFLOLPS ~25 (FP16) TFLOPS	11 TFLOLPS 22 (FP16) TFLOPS	>13 TFLOLPS >25 (FP16) TFLOPS	TBA TBA
Memory	4GB HBM	8GB GDDR5	16GB HBM2	TBA	TBA	TBA
Memory Bus	4096-bit	256-bit	2048-bit	2048-bit	2048-bit	4096-bit
Bandwidth	512GB/s	256GB/S	480GB/S	400GB/S	TBA	TBA
TDP	275W	150W	TBA	TBA	TBA	TBA
Launch	2015	2016	June 2017	June 2017	July 2017	TBA

Vega's Next Compute Unit (NCU), 2x Peak Throughput per Clock And 4x The Power Efficiency

According to the newly dug up dataVega delivers four times the graphics performance at the same power compared to AMD's previous generation. There isn't much detail to expand uponin terms of the context here. However, it's very clear that AMD is referring to half precision compute. Which would mean that Vega delivers double the single precision compute at the same power.

This is the mostimpressive figure of the bunch. Doubling thepower efficiency of a graphics architecture whilst maintaining or boosting performance is anincredibly challenging engineering feat. One that's made even harder in the case of Vegaconsidering that itis built on the same 14nm manufacturing process as Polaris. If it stands true then AMD engineers will have pulled nothing short of amiracle.

2x peak throughput/clockis another impressive figure that stands as a testament to how radically different Vega is compared to AMD's previous generation GCNarchitecture. It means that Vega should deliver double the performance at any given clock speed compared to AMD's previous generation GCN based GPUs.

High Bandwidth Cache, 8x Capacity Per Stack, 2x Bandwidth Per Pin And 512TB Address Space

These specs and features are specific to Vega's second generation High Bandwidth Memory technology. HBM2 offers 8x the capacity per stack compared to first generation HBM and twice the bandwidth per stack thanks to a higher clock speed. First generation HBM found in AMD's Fury series of enthusiast graphics cards features a maximum of 1GB capacity per stack and 128GB/s of bandwidth per stack.

Second generation HBM comes in stacks of up to 8GB and 256GB/s of bandwidth. Interestingly, the Vega engineering sample that AMD demoed last month was actually an 8GB model with 512GB/s of bandwidth. Which would indicate that it was equipped with two 4GB HBM2 stacks, each delivering 256GB/s of bandwidth, rather than a single 8GB stack. However, the Radeon Instinct MI25 deep-learning accelerator based on the same Vega GPU features 16GB of memory and 512GB/s of bandwidth. Which means that AMD had toequipit with two 8GB stacks.

Each HBM stack connects to the GPU via a 1024bit memory controller. HBM2 comes out of the factory clocked at double the frequency of first generation HBM. Which is how it delivers double the bandwidth per pin. The 512TB virtual address space feature is quite an interesting one and is likely achieved by quickly swapping data in and out of the HBM cache.

Below you will find a quick recap of what we know about AMD's Vega architecture & the upcoming RX 500 series graphics cards.

A New Top-To-Bottom Range Of Radeon RX 500 Series Graphics Cards Based On The Vega Architecture

AMD will be rolling out its next generation Vega architecture across the entire range of its 2017 Radeon graphics cards and it'll do it "soon". The new lineupwill span atop-end 4K 60FPS triple A gaming Radeon graphics card, the very same one that was demoed last week, to mid-range and entry level offerings for 1440p and 1080p gaming. The highest end models will feature HBM2 whilst the mid-range and more budget oriented cards will feature GDDR5/X memory.

We've already seen one upcoming Radeon graphics card based on Vega in action. The yet unreleased graphics card was demoed in a head-to-head comparison with NVIDIA's GTX 1080.The demo Vega graphics card had 8GB of HBM2 and itoutperformed the 1080 by 10%whilst runningDoom in Vulkan at 4K.

The Vega Architecture - AMD's Next Generation Compute Unit

One big announcement that AMD made in its recent press event where Vega was demoed is that the new architecture features what the company calls its NCU, short for Next Compute Unit. We had already detailed key parts of this new design in our exclusive piece about Vega 10 and Vega 11 a couple of months ago.

This new architecture holds several key advantages over its predecessor. Chief among which is that each SIMD inside a given Vega NCUis now capable of simultaneously processing variable length wavefronts. Which to the average person sounds like a bunch of meaningless technical jargon, I know it did to me when I first learned about it. However, once you scratch the surface and truly understand what this means you quickly begin to realize how much of a bigdeal thisreally is.

AMD Radeon Instinct_Final for Distribution-page-017

In AMD's current GCNimplementation, each compute unit has four 16-wide vector SIMD units, capable of executing four 16-wide wavefronts (a group of threads) over four cycles. In addition to one scalar unit, capable of executing one instructionper cycle. This unit is delegated time-critical tasks, where the four-cycle turnaround of the SIMD unit is simply not good enough.

Unfortunately, these 16-wide SIMD units work exactly the same no matter how small of a wavefront they're fed. The SIMD unit has to spend four cycles executing whatever threads are presented to it, no matter what. Which means that executing a16-wide wavefront would take just as long as executing a4-wide wavefront as an example, rendering the other 12 ALUs inside the SIMD completely useless. Graphicsworkloads are inherently non-uniform, which means that it'seffectively impossible to find any scenario whereall 16-wide SIMD units would befully occupied at any given time.

Variable Width Wavefront SIMDs, Getting More Performance Out Of Fewer Cycles

This is no longer the case in AMD's new GCN implementation inside Vega. The V9 architectureincludes new clever schedulers and coherency subsystemsthat allow several wavefronts, of different widths, tobe executed simultaneously inside any compute unitthat's able to accommodate the workload. So that more ALUs would be doing useful work at any given time instead of idling or executing predicted off threads that produce no results.

AMD Vega architecture

This in effect allows each NCUto finish considerably more work in the same amount of time compared to a traditional CU. In addition to freeing up valuable cache and memory resources forother compute units. It's very hard to predict how much of a difference this big of animprovement in resource utilization and CU occupancywill yield given how unpredictable and inherently fluctuantgraphics workloads are. Vega's Next Compute Units are therefore not only faster but also more power efficient. Although byhow much exactly remains to be seen.