AMD has presented tons of more information on their upcoming Zen architecture at Hot Chips. Expected to launch later this year, the Zen architecture focuses on three key departments, performance, throughput and efficiency. With Zen, AMD plans to come back to the performance CPU sector with a bang in the mainstream and enthusiast market.
AMD Zen Architecture Fully Detailed - Wider, High-Performance and Efficient Core Design
To start off with the details, Zen is based on the latest 14nm FinFET node. The only two foundries that have this node are Global Foundries and Samsung but we suspect AMD is using the former to develop Zen chips. The Zen core is said to feature 40% more instructions per clock compared to Excavator core.

AMD's full Zen Hot Chips presentation reveals complete architecture details. (Image Credits: Golem.de)
Excavator core is featured on AMD's Carrizo and Godavari processors. The large jump in IPC would help AMD achieve performance parity with Intel chips. In fact, AMD already demoed a 8 core Summit Ridge CPU based on Zen against a Broadwell-E 8 core chip. The demo showed AMD's solution having better rendering performance than Intel's HEDT solution.






2 of 9
AMD Zen Core Design and Core Engine
The basic building block of Zen is the core complex. The core complex comprises of four cores connected to an L3 cache. The L3 cache is 16-Way associated and makes up a total of 8 MB (mostly exclusive of L2 cache). The L3 cache is sliced into four, each comprising of two 1 MB L3 sub-slices. All cores can access these cache blocks with the same average latency speed.

The cores themselves feature two threads each. The core complex hence comprises of 8 threads while the 8 core SKUs will comprise of 16 threads. On each core, branch misdirect is improved and the branch prediction has been improved with two branches per BTB. The large Op cache helps improve throughput and latency at the same time.The integer cluster in each Zen core has six pipes, four ALUs, Arithmetic Logic Units, and two AGUs which is short for Address Generation Units.
These AGUs can perform two 16-byte loads and oine 16-byte store per cycle via a 32 KB 8-way set associative write-back L1 data cache. According to AMD the move from a write-through to a write-back cache has noticeably reduced stalls in several types of code paths. The load/store cache operations cache in Zen also reportedly exhibit lower latency compared to Excavator.




2 of 9
AMD has tried to improve Zen with a larger dispatch of 6 vs 4 on Excavator. Instruction schedulers for integer and floating point have also increased to 84 and 96, respectively. The FPU is now an Quad Issue while queue sizes for retire, load and store have increased to 192, 72, 44 compared to 128, 44, 32 on Excavator.
The two floating point units on the new core consist of 4 pipes with 128 FMACs per FPU. There are two FADD and two FMUL units for calculations on the FPU. The FPU consists of a 2-level scheduling queue with a 160 entry register file, 8-Wide retire and a single pipe for 128b store.It has its own two AES units and is SSE, AVX1, AVX2, AES, SHA and legacy MMX compliant.
AMD Zen With SMT (Simultaneous Multi-Threading Support)
One of the most anticipated arrival on the new core is SMT support. This brings the design level much closer to Intel's implementation. The SMT design offers increased throughput byexecuting two threads simultaneously. These virtual threads will appear as independent cores to software and allow more execution resources at the hand applications.

Along with the SMT support, Zen also features support for several new instructions. These include ADX, RDSEED, SMAP, SHA1, XSAVEC, CLZERO and PTE Coalescing. AMD also supports all the standard ISA that are mentioned above.

AMD Zen High Bandwidth, Low Latency Cache System
AMD has been talking about a disruptive cache system on their new core for a while. With the details finally out, we can now better understand this system. The cache hierarchy is made up of a fast private L2 cache on each core (512 KB L2 L+D 8-Way) and a fast shared L3 cache (8 MB L3 L+D 16-Way).

2 of 9
This enables faster band width for prefetch improvements allowing faster cache-to-cache transfers. The L3 cache is mostly filled up of the L2 victims while offering larger queues for L1 and L2 misses.
Each core also comprises of an 64K L1 L (4-Way) cache and 32K L1 D (8-Way) cache. The entire systems adds up to faster L1, L2 and L3 caches that offer faster load to FPU (7 cycles required). Bandwidth is improved to almost 2x on L1 and L2 while L3 cache system bandwidth is improved by 5x.
AMD Zen - A 14nm FinFET, Low Power and Faster Design
Performance is one thing but one place where AMD has really lacked is efficiency. With Zen, that is going to change. Zen has much higher efficiency than Excavator which is a highly tuned design in itself. This is achieved through the use of aggressive clock-gating techniques on multi-level regions inside the core block. Some of the features that help achieve lower power on Zen include:




2 of 9
AMD Zen Low Power Features:
Aggressive Clock Gating with multi-level regionsWrite Back L1 CacheLarge OP CacheStack EngineMove EliminationPower Focus from Project InceptionLow Power design Methodologies
| CPU Microarchitecture | AMD Phenom II / K10 | AMD BD/PD | AMD SR/XV | AMD Zen | Intel Skylake |
|---|---|---|---|---|---|
| Instruction Decode Width | 3-wide | 4-wide | 8-wide | 4-wide | 4-wide |
| Single Core Peak Decode Rate | 3 instructions | 4 instructions | 8 instructions | 4 instructions | 4 instructions |
| Dual Core Peak Decode Rate | 6 instructions | 4 instructions | 8 instructions | 8 instructions | 8 instructions |










