Today at the 2016 GPU Technology Conference Nvidiaannounced the Tesla P100, the company's most ambitious graphics card to date. The P100 features Nvidia's most powerful and most complex GPU ever conceived by the company, code namedGP100. This flagship Pascal GPU is an engineering marvel, and in this piece we'llprovide an overview of the Pascal architecture and in particular all the details that Nvidia has revealed about this spectacular graphicschip and Pascal architecture.

This overview is derived from an excellent talk given by Nvidia's Senior Architect, Lars Nyland and Chief Technologist, GPU Computing Software Mark Harris.
So let'sget straight to it!
The Five "Miracles" Of Nvidia'sGP100 GPU & Tesla P100 Accelerator
At his keynote earlier today, jen-Husn Huang, Nvidia's Co-Founder & CEO jokingly said that Nvidia never relies on more than one technical miracle with a given architecture. Despite that, with the GP100 GPU the company was successful in creating the most ambitious and most miraculous graphics chip to date, by relying onnot one but fivetechnological "miracles".

Jen-Hsun summarized thesemiracles in the slide above.And they are :
- Next generation Pascal graphics architecture.
- TSMC's 16nm FinFET manufacturing process technology.
- Next generation, vertically stacked High Bandwidth Memory ( HBM 2 )
- The company's brand new revolution in platform atomics, the high speed NV-Link GPU interconnect.
- And finally, the workload that GP100 was designed for and excells at, AI.
Nvidia'sPascalArchitecture & The GP100 GPU, Opening The Taps
It has been a long tradition at Nvidia to introduce major performance and power efficiency advancements with each of its next generation graphics architectures and Pascal is no exception. The pivotal structure that's the basic building block for every Pascal GPU is called the SM, short for streaming multiprocessor. Maxwell before Pascal had the SMM , Streaming Maxwell Multiprocessor, as its building block and Kepler before both had the SMX..The streaming multiprocessor is the engine that "creates, manages, schedules and executes instructions from many threads in parallel."
The GP100 GPU is comprised of 3840 CUDA cores, 240 texture units and a 4096bit memory interface, arranged in eight 512bit segments. The 3840 CUDA cores make up six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors.
Nvidia Pascal GP100 GPU Block Diagram
Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multiprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instructionbuffer, matching that of Maxwell.
The GP100 GPU is actually enormous coming inroughly at 610mm² and 15 billion transistors, pretty much making it double the GM200 GPU powering NVidia's GTX Titan X and GTX 980 Ti graphics cards. GP100 has significantly more pascal streaming multiprocessors, or CUDA core blocks, compared to GM200. Again because each Pascal SM is only comprised of 64 CUDA cores as opposed to 128 like in Maxwell.
Additionally because each Pascal SM the same number of registersasMaxwell's 128 CUDA core SMM. This translates toeach Pascal CUDA core havingaccess to twice the registers. This in turn means that not only does GP100 has more threads than Nvidia's prior large GPUs, but each thread insidehas access to more registers and thus a lot more throughput.
As always the goal was to deliver higher performance and improved power efficiency. As such Pascal builds on the changes that were implemented into Maxwell after Kepler.
The Pascal Streaming Multiprocessor
The combined 14MBof register files and 4MB Overall shared memory across the GP100 GPU result in a two fold increase in overall bandwidth inside the chip compared to GM200.
Chief Technologist, GPU Computing Software Mark Harris
A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).

According to Nvidia the end result isthat each Pascal SM actually requires less power and area to manage data transfers even compared to a Kepler SMX. Which improves bothperformance and power efficiency. Pascal also includes an updated scheduler that not only improves SM utilization ( editorial note : better async compute performance anyone?.. ) but is also more intelligent and power efficient. Finally, each warp scheduler can dispatch two instructions per clock.
Nvidia's Senior Architect, Lars Nyland admits that the 16nm FinFET process played an important role in realizing the team's power efficiency goals, but maintains that numerous architectural improvements aided in further reducing the energy footprint of the architecture.




2 of 9
The table below is a high-level comparison of the Tesla P100's specifications in comparison with previous generation Tesla accelerators.
| Tesla Products | Tesla K40 | Tesla M40 | Tesla P100 |
|---|---|---|---|
| GPU | GK110 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) |
| SMs | 15 | 24 | 56 |
| TPCs | 15 | 24 | 28 |
| FP32 CUDA Cores / SM | 192 | 128 | 64 |
| FP32 CUDA Cores / GPU | 2880 | 3072 | 3584 |
| FP64 CUDA Cores / SM | 64 | 4 | 32 |
| FP64 CUDA Cores / GPU | 960 | 96 | 1792 |
| Base Clock | 745 MHz | 948 MHz | 1328 MHz |
| GPU Boost Clock | 810/875 MHz | 1114 MHz | 1480 MHz |
| Compute Performance - FP32 | 5.04 TFLOPS | 6.82 TFLOPS | 10.6 TFLOPS |
| Compute Performance - FP64 | 1.68 TFLOPS | 0.21 TFLOPS | 5.3 TFLOPS |
| Texture Units | 240 | 192 | 224 |
| Memory Interface | 384-bit GDDR5 | 384-bit GDDR5 | 4096-bit HBM2 |
| Memory Size | Up to 12 GB | Up to 24 GB | 16 GB |
| L2 Cache Size | 1536 KB | 3072 KB | 4096 KB |
| Register File Size / SM | 256 KB | 256 KB | 256 KB |
| Register File Size / GPU | 3840 KB | 6144 KB | 14336 KB |
| TDP | 235 Watts | 250 Watts | 300 Watts |
| Transistors | 7.1 billion | 8 billion | 15.3 billion |
| GPU Die Size | 551 mm² | 601 mm² | 610 mm² |
| Manufacturing Process | 28-nm | 28-nm | 16-nm |









