The NVIDIA Volta V100 GPU based Tesla V100 for PCI Express platforms has just been announced at this year'sInternational Supercomputing Conference (ISC). The new graphics card is aimed at compute power houses, delivering increased performance with superb efficiency.
NVIDIA Volta V100 GPU Based Tesla V100 PCIe Graphics Card Announced - 14 TFLOPs of FP32 and 27 TFLOPs of FP16 Compute Performance in a 250 Watt Package
NVIDIA announced their Volta V100 GPU based Tesla V100 accelerator at GTC 2017. The new chip from NVIDIA is a behemoth that utilizes the new TSMC 12nm FFN (FinFET NVIDIA) process that is custom built to power NVIDIA's Volta GPUs. The chip houses an incredible 21 Billion transistors under the hood and is an incredible feat of engineering. The graphics accelerator we saw at GTC '17 utilized theSXM2 form factor and while NVIDIA did tease their PCI Express based variant, they are formally announcing it today.
The NVIDIA Tesla V100 for PCI Express based systems has the same Volta V100 GPU as the SXM2 variant. It features a GPU die size of 815mm2 (the biggest chip to date) and houses tons of HBM2 memory on board the main interposer. Let's do a run down of the core specifications.
NVIDIA Volta V100 GPU Based Tesla V100 PCI Express Specifications
The chip itself is a behometh, featuring a brand new chip architecture that is just insane in terms of raw specifications. The NVIDIA Volta GV100 GPU is composed of six GPC (Graphics Processing Clusters). It has a total of 84 Volta streaming multiprocessor units, 42 TPCs (each including two SMs).

The 84 SMs come with 64 CUDA cores per SM so we are looking at a total of 5376 CUDA cores on the complete die. All of the 5376 CUDA Cores can be used for FP32 and INT32 programming instructions while there are also a total of 2688 FP64 (Double Precision) cores. Aside from these, we are looking at 672 Tensor processors, 336 Texture Units. The core clocks are maintained at a boost clock of around 1370 MHz which delivers 28 TFLOPs of FP16, 14 TFLOPs of FP32 and 7.0 TFs of FP64 compute performance.
The chip also delivers 112 DLOPs (Deep Learning Teraflops) which is the fastest any chip has delivered to date. This is achieved by the separate tensor cores that are dedicated to deep learning tasks. So while the clocks and compute performance is slightly lower than the SXM2 variant, it does feature a TDP of just 250W. Compared to 300W on the SXM2 card, this is an incredible feat that delivers increased efficiency.

The memory architecture is updated with eight 512-bit memory controllers. This rounds up to a total of 4096-bit bus interface that supports up to 16 GB of HBM2 VRAM. The bandwidth is boosted with speeds of 878MHz, which delivers increased transfer rates of 900 GB/s compared to 720 GB/s on Pascal GP100. Each memory controller is attached to 768 KB of L2 cache which totals to 6 MB of L2 cache for the entire chip.
NVIDIA Volta Tesla V100S Specs:
| NVIDIA Tesla Graphics Card | Tesla K40 (PCI-Express) | Tesla M40 (PCI-Express) | Tesla P100 (PCI-Express) | Tesla P100 (SXM2) | Tesla V100 (PCI-Express) | Tesla V100 (SXM2) | Tesla V100S (PCIe) |
|---|---|---|---|---|---|---|---|
| GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) | GP100 (Pascal) | GV100 (Volta) | GV100 (Volta) | GV100 (Volta) |
| Process Node | 28nm | 28nm | 16nm | 16nm | 12nm | 12nm | 12nm |
| Transistors | 7.1 Billion | 8 Billion | 15.3 Billion | 15.3 Billion | 21.1 Billion | 21.1 Billion | 21.1 Billion |
| GPU Die Size | 551 mm2 | 601 mm2 | 610 mm2 | 610 mm2 | 815mm2 | 815mm2 | 815mm2 |
| SMs | 15 | 24 | 56 | 56 | 80 | 80 | 80 |
| TPCs | 15 | 24 | 28 | 28 | 40 | 40 | 40 |
| CUDA Cores Per SM | 192 | 128 | 64 | 64 | 64 | 64 | 64 |
| CUDA Cores (Total) | 2880 | 3072 | 3584 | 3584 | 5120 | 5120 | 5120 |
| Texture Units | 240 | 192 | 224 | 224 | 320 | 320 | 320 |
| FP64 CUDA Cores / SM | 64 | 4 | 32 | 32 | 32 | 32 | 32 |
| FP64 CUDA Cores / GPU | 960 | 96 | 1792 | 1792 | 2560 | 2560 | 2560 |
| Base Clock | 745 MHz | 948 MHz | 1190 MHz | 1328 MHz | 1230 MHz | 1297 MHz | TBD |
| Boost Clock | 875 MHz | 1114 MHz | 1329MHz | 1480 MHz | 1380 MHz | 1530 MHz | 1601 MHz |
| FP16 Compute | N/A | N/A | 18.7 TFLOPs | 21.2 TFLOPs | 28.0 TFLOPs | 30.4 TFLOPs | 32.8 TFLOPs |
| FP32 Compute | 5.04 TFLOPs | 6.8 TFLOPs | 10.0 TFLOPs | 10.6 TFLOPs | 14.0 TFLOPs | 15.7 TFLOPs | 16.4 TFLOPs |
| FP64 Compute | 1.68 TFLOPs | 0.2 TFLOPs | 4.7 TFLOPs | 5.30 TFLOPs | 7.0 TFLOPs | 7.80 TFLOPs | 8.2 TFLOPs |
| Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM |
| Memory Size | 12 GB GDDR5 @ 288 GB/s | 24 GB GDDR5 @ 288 GB/s | 16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s | 16 GB HBM2 @ 732 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 1134 GB/s |
| L2 Cache Size | 1536 KB | 3072 KB | 4096 KB | 4096 KB | 6144 KB | 6144 KB | 6144 KB |
| TDP | 235W | 250W | 250W | 300W | 250W | 300W | 250W |









