Nvidia has just unveiledits fastest GPUyet here at GTC 2016, a brand new graphics chip based on the company's next generation Pascal architecture. The GP100 isNVIDIA's most advanced GPU to date, powering the company's next generation compute monster, the Tesla P100.

Nvidia claims that GP100is the largest FinFET GPU that has ever been made, measuringat 600mm² and packing over 15billion transistors. The Tesla P100 features a slightly cut back GP100 GPU and delivers5.3 teraflops of double precision compute, 10.6 TFLOPSof single precision compute and 21.2 TFLOPS of half precision FP16 compute. Keeping this massive GPU fed is4MB of L2 cache and a whopping 14MB worth ofregister files.

The entire Telsa P100 package is comprised of many chips not just the GPU, that collectively add up to over 150 billion transistors and features 16GB of stacked HBM2 VRAM for a total of 720GB/s of bandwidth. Nvidia's CEO & Co-Founder Jen-Hsun Huang confirmed that this behemoth of a graphics card is already in volume production with samples already delivered to customers which will begin announcing their products in Q4 and will be shipping their products in Q1 2017.

Pascal GP100 Architecture & Specs
Nvidia Press Release
Five Architectural Breakthroughs
The Tesla P100 delivers its unprecedented performance, scalability and programming efficiency based on five breakthroughs:
NVIDIA Pascal architecture for exponential performance leap -- A Pascal-based Tesla P100 solution delivers over a 12x increase in neural network training performance compared with a previous-generation NVIDIA Maxwell™-based solution.
NVIDIA NVLink for maximum application scalability -- The NVIDIA NVLink™ high-speed GPU interconnect scales applications across multiple GPUs, delivering a 5x acceleration in bandwidth compared to today's best-in-class solution1. Up to eight Tesla P100 GPUs can be interconnected with NVLink to maximize application performance in a single node, and IBM has implemented NVLink on its POWER8 CPUs for fast CPU-to-GPU communication.
16nm FinFET for unprecedented energy efficiency -- With 15.3 billion transistors built on 16 nanometer FinFET fabrication technology, the Pascal GPU is the world's largest FinFET chip ever built2. It is engineered to deliver the fastest performance and best energy efficiency for workloads with near-infinite computing needs.
CoWoS with HBM2 for big data workloads -- The Pascal architecture unifies processor and data into a single package to deliver unprecedented compute efficiency. An innovative approach to memory design, Chip on Wafer on Substrate (CoWoS) with HBM2, provides a 3x boost in memory bandwidth performance, or 720GB/sec, compared to the Maxwell architecture.
New AI algorithms for peak performance -- New half-precision instructions deliver more than 21 teraflops of peak performance for deep learning.
The GP100 GPU is comprised of 3840 CUDA cores, 240 texture units and a 4096bit memory interface. The 3840 CUDA cores are arranged in six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors. As mentioned earlier in the article the Tesla P100 features a cut down GP100 GPU. This cut back version has3584 CUDA cores and 224 texture mapping units.

Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multirprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instructionbuffer, matching that of Maxwell.

The massive GP100 GPU has significantly more pascal streaming multiprocessors, or CUDA core blocks. Because each of these has access to a register file that's the same size of Maxwell's 128 CUDA core SMM. This means that each Pascal CUDA core has access to twice the register files. In turn we should expect even more performance out of each Pascal CUDA cores compared to Maxwell.

Nvidia Press Release
Tesla P100 Specifications
Specifications of the Tesla P100 GPU accelerator include:
5.3 teraflops double-precision performance, 10.6 teraflops single-precision performance and 21.2 teraflops half-precision performance with NVIDIA GPU BOOST™ technology
160GB/sec bi-directional interconnect bandwidth with NVIDIA NVLink
16GB of CoWoS HBM2 stacked memory
720GB/sec memory bandwidth with CoWoS HBM2 stacked memory
Enhanced programmability with page migration engine and unified memory
ECC protection for increased reliability
Server-optimized for highest data center throughput and reliability
Tesla P100 Boosts To Nearly 1.5Ghz
Perhaps one of the most exciting, yet perhaps predictable, revaluations about the GP100 Pascal flagship GPU is that it can achieve clocks even higher than Maxwell. Despite Nvidia opting for very conservative clock speeds on its professional GPUs like the Tesla & Quadro products the P100 actually has a base clock speed of 1328mhz and a boost clock speed of 1480mhz. Considering that GPU Boost 2.0 actually allows these cards to operate at even higher clock speeds than the nominal boost clock.
We're looking at actual frequencies of upwards of 1500Mhz on the GeForce equivalent of the P100. What is inevitably going to launch as the next GTX Titan.This means boost clocks of even upwards of 1600Mhz on factory overclocked models, and perhaps 2Ghz+ manual overclocks. This should be extremely exciting news to all GeForce fans.
| Tesla Products | Tesla K40 | Tesla M40 | Tesla P100 |
|---|---|---|---|
| GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) |
| SMs | 15 | 24 | 56 |
| TPCs | 15 | 24 | 28 |
| FP32 CUDA Cores / SM | 192 | 128 | 64 |
| FP32 CUDA Cores / GPU | 2880 | 3072 | 3584 |
| FP64 CUDA Cores / SM | 64 | 4 | 32 |
| FP64 CUDA Cores / GPU | 960 | 96 | 1792 |
| Base Clock | 745 MHz | 948 MHz | 1328 MHz |
| GPU Boost Clock | 810/875 MHz | 1114 MHz | 1480 MHz |
| Compute Performance - FP32 | 5.04 TFLOPS | 6.82 TFLOPS | 10.6 TFLOPS |
| Compute Performance - FP64 | 1.68 TFLOPS | 0.21 TFLOPS | 5.3 TFLOPS |
| Texture Units | 240 | 192 | 224 |
| Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 |
| Memory Size | Up to 12 GB | Up to 24 GB | 16 GB |
| L2 Cache Size | 1536 KB | 3072 KB | 4096 KB |
| Register File Size / SM | 256 KB | 256 KB | 256 KB |
| Register File Size / GPU | 3840 KB | 6144 KB | 14336 KB |
| TDP | 235 Watts | 250 Watts | 300 Watts |
| Transistors | 7.1 billion | 8 billion | 15.3 billion |
| GPU Die Size | 551 mm² | 601 mm² | 610 mm² |
| Manufacturing Process | 28-nm | 28-nm | 16-nm |









