NVIDIA Volta has just been announced at GTC 2017 and boy it's a beast. The next-generation graphics processing unitis the world's first chip that will make use of the industry leading TSMC 12nm FinFET process, so let's cover every detail of this compute powerhouse.
NVIDIA Volta GV100 Unveiled - Tesla V100 With 5120 CUDA Cores, 16 GB HBM2 and 12nm FinFET Process
Last GTC, NVIDIA announced the Pascal based GP100 GPU. It was back then, the fastest graphics chip designed for supercomputers. This year, NVIDIA is taking the next leap in graphics performance and announced their Volta based GV100 GPU. We are going to take a very deep look at the next-generation GPU designed for AI Deep Learning.

"Artificial intelligence is driving the greatest technology advances in human history," said Jensen Huang, founder and chief executive officer of NVIDIA, who unveiled Volta at his GTC keynote. "It will automate intelligence and spur a wave of social progress unmatched since the industrial revolution.
"Deep learning, a groundbreaking AI approach that creates computer software that learns, has insatiable demand for processing power. Thousands of NVIDIA engineers spent over three years crafting Volta to help meet this need, enabling the industry to realize AI's life-changing potential," he said.
Volta, NVIDIA's seventh-generation GPU architecture, is built with 21 billion transistors and delivers the equivalent performance of 100 CPUs for deep learning.
It provides a 5x improvement over Pascal, the current-generation NVIDIA GPU architecture, in peak teraflops, and 15x over the Maxwell architecture, launched two years ago. This performance surpasses by4x the improvements that Moore's law would have predicted.
via NVIDIA

First of all, we need to talk about the workloads this specific chip is designed to handle. The NVIDIA Volta GV100 GPU isdesigned to power the most computationally intensive HPC, AI, and graphics workloads.
The GV100 GPU includes 21.1 billion transistors with a die size of 815 mm2. It is fabricated on a new TSMC 12 nm FFN high performance manufacturing process customized for NVIDIA.The GPU is much bigger than the 610mm2 Pascal GP100 GPU. NVIDIA Volta GV100 delivers considerably more compute performance, and adds many new features compared to its predecessor, the Pascal GP100 GPU and its architecture family. Further simplifying GPU programming and application porting, GV100 also improves GPU resource utilization. GV100 is an extremely power-efficient processor, delivering exceptional performance per watt.

The chip itself is a behometh, featuring a brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total of 84 Volta streaming multiprocessor units, 42 TPCs (each including two SMs). The 84 SMs come with 64 CUDA cores per SM so we are looking at a total of 5376 CUDA cores on the complete die. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units.

The memory architecture is updated with eight 512-bit memory controllers. This rounds up to a total of 4096-bit bus interface that supports up to 16 GB of HBM2 VRAM. The bandwidth is boosted with speeds of 878MHz, which delivers increased transfer rates of 900 GB/s compared to 720 GB/s on Pascal GP100. Each memory controller is attached to 768 KB of L2 cache which totals to 6 MB of L2 cache for the entire chip.
NVIDIA Tesla Graphics Cards Comparison:
| Tesla Graphics Card Name | NVIDIA Tesla M2090 | NVIDIA Tesla K40 | NVIDIA Telsa K80 | NVIDIA Tesla P100 | NVIDIA Tesla V100 |
|---|---|---|---|---|---|
| GPU Architecture | Fermi | Kepler | Maxwell | Pascal | Volta |
| GPU Process | 40nm | 28nm | 28nm | 16nm | 12nm |
| GPU Name | GF110 | GK110 | GK210 x 2 | GP100 | GV100 |
| Die Size | 520mm2 | 561mm2 | 561mm2 | 610mm2 | 815mm2 |
| Transistor Count | 3.00 Billion | 7.08 Billion | 7.08 Billion | 15 Billion | 21.1 Billion |
| CUDA Cores | 512 CCs (16 CUs) | 2880 CCs (15 CUs) | 2496 CCs (13 CUs) x 2 | 3840 CCs | 5120 CCs |
| Core Clock | Up To 650 MHz | Up To 875 MHz | Up To 875 MHz | Up To 1480 MHz | Up To 1455 MHz |
| FP32 Compute | 1.33 TFLOPs | 4.29 TFLOPs | 8.74 TFLOPs | 10.6 TFLOPs | 15.0 TFLOPs |
| FP64 Compute | 0.66 TFLOPs | 1.43 TFLOPs | 2.91 TFLOPs | 5.30 TFLOPs | 7.50 TFLOPs |
| VRAM Size | 6 GB | 12 GB | 12 GB x 2 | 16 GB | 16 GB |
| VRAM Type | GDDR5 | GDDR5 | GDDR5 | HBM2 | HBM2 |
| VRAM Bus | 384-bit | 384-bit | 384-bit x 2 | 4096-bit | 4096-bit |
| VRAM Speed | 3.7 GHz | 6 GHz | 5 GHz | 737 MHz | 878 MHz |
| Memory Bandwidth | 177.6 GB/s | 288 GB/s | 240 GB/s | 720 GB/s | 900 GB/s |
| Maximum TDP | 250W | 300W | 235W | 300W | 300W |









