yitit
Home
/
Hardware
/
Nvidia Pascal Architecture Detailed – DX12 Async Compute & Scheduling Improved, CUDA Core Clusters Entirely Redsigned
Nvidia Pascal Architecture Detailed – DX12 Async Compute & Scheduling Improved, CUDA Core Clusters Entirely Redsigned-February 2024
Feb 12, 2026 3:13 AM

Today at the 2016 GPU Technology Conference Nvidiaannounced the Tesla P100, the company's most ambitious graphics card to date. The P100 features Nvidia's most powerful and most complex GPU ever conceived by the company, code namedGP100. This flagship Pascal GPU is an engineering marvel, and in this piece we'llprovide an overview of the Pascal architecture and in particular all the details that Nvidia has revealed about this spectacular graphicschip and Pascal architecture.

Nvidia Tesla P100 accelerator

This overview is derived from an excellent talk given by Nvidia's Senior Architect, Lars Nyland and Chief Technologist, GPU Computing Software Mark Harris.

So let'sget straight to it!

The Five "Miracles" Of Nvidia'sGP100 GPU & Tesla P100 Accelerator

At his keynote earlier today, jen-Husn Huang, Nvidia's Co-Founder & CEO jokingly said that Nvidia never relies on more than one technical miracle with a given architecture. Despite that, with the GP100 GPU the company was successful in creating the most ambitious and most miraculous graphics chip to date, by relying onnot one but fivetechnological "miracles".

Nvidia GTC-12

Jen-Hsun summarized thesemiracles in the slide above.And they are :

- Next generation Pascal graphics architecture.

- TSMC's 16nm FinFET manufacturing process technology.

- Next generation, vertically stacked High Bandwidth Memory ( HBM 2 )

- The company's brand new revolution in platform atomics, the high speed NV-Link GPU interconnect.

- And finally, the workload that GP100 was designed for and excells at, AI.

Nvidia'sPascalArchitecture & The GP100 GPU, Opening The Taps

It has been a long tradition at Nvidia to introduce major performance and power efficiency advancements with each of its next generation graphics architectures and Pascal is no exception. The pivotal structure that's the basic building block for every Pascal GPU is called the SM, short for streaming multiprocessor. Maxwell before Pascal had the SMM , Streaming Maxwell Multiprocessor, as its building block and Kepler before both had the SMX..The streaming multiprocessor is the engine that "creates, manages, schedules and executes instructions from many threads in parallel."

The GP100 GPU is comprised of 3840 CUDA cores, 240 texture units and a 4096bit memory interface, arranged in eight 512bit segments. The 3840 CUDA cores make up six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors.

NVIDIA GP100 Block DiagramNvidia Pascal GP100 GPU Block Diagram

Each Pascal streaming multiprocessor includes 64 FP32 CUDA cores, half that of Maxwell. Within each Pascal streaming multiprocessor there are two 32 CUDA core partitions, two dispatch units, a warp scheduler and a fairly large instructionbuffer, matching that of Maxwell.

The GP100 GPU is actually enormous coming inroughly at 610mm² and 15 billion transistors, pretty much making it double the GM200 GPU powering NVidia's GTX Titan X and GTX 980 Ti graphics cards. GP100 has significantly more pascal streaming multiprocessors, or CUDA core blocks, compared to GM200. Again because each Pascal SM is only comprised of 64 CUDA cores as opposed to 128 like in Maxwell.

Additionally because each Pascal SM the same number of registersasMaxwell's 128 CUDA core SMM. This translates toeach Pascal CUDA core havingaccess to twice the registers. This in turn means that not only does GP100 has more threads than Nvidia's prior large GPUs, but each thread insidehas access to more registers and thus a lot more throughput.

As always the goal was to deliver higher performance and improved power efficiency. As such Pascal builds on the changes that were implemented into Maxwell after Kepler.

gp100_SM_diagramThe Pascal Streaming Multiprocessor

The combined 14MBof register files and 4MB Overall shared memory across the GP100 GPU result in a two fold increase in overall bandwidth inside the chip compared to GM200.

Chief Technologist, GPU Computing Software Mark Harris

A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).

NVIDIA Pascal GP100 SM

According to Nvidia the end result isthat each Pascal SM actually requires less power and area to manage data transfers even compared to a Kepler SMX. Which improves bothperformance and power efficiency. Pascal also includes an updated scheduler that not only improves SM utilization ( editorial note : better async compute performance anyone?.. ) but is also more intelligent and power efficient. Finally, each warp scheduler can dispatch two instructions per clock.

Nvidia's Senior Architect, Lars Nyland admits that the 16nm FinFET process played an important role in realizing the team's power efficiency goals, but maintains that numerous architectural improvements aided in further reducing the energy footprint of the architecture.

p100_575px_2

nvidia_tesla_p100_gpu_front4_575px_2

p100back

nvidia_tesla_p100_gpu_topangleleft4_575px

2 of 9

The table below is a high-level comparison of the Tesla P100's specifications in comparison with previous generation Tesla accelerators.

Tesla ProductsTesla K40Tesla M40Tesla P100
GPUGK110 (Kepler)GM200 (Maxwell)GP100 (Pascal)
SMs152456
TPCs152428
FP32 CUDA Cores / SM19212864
FP32 CUDA Cores / GPU288030723584
FP64 CUDA Cores / SM64432
FP64 CUDA Cores / GPU960961792
Base Clock745 MHz948 MHz1328 MHz
GPU Boost Clock810/875 MHz1114 MHz1480 MHz
Compute Performance - FP32 5.04 TFLOPS6.82 TFLOPS10.6 TFLOPS
Compute Performance - FP64 1.68 TFLOPS0.21 TFLOPS5.3 TFLOPS
Texture Units240192224
Memory Interface384-bit GDDR5384-bit GDDR54096-bit HBM2
Memory SizeUp to 12 GBUp to 24 GB16 GB
L2 Cache Size1536 KB3072 KB4096 KB
Register File Size / SM256 KB256 KB256 KB
Register File Size / GPU3840 KB6144 KB14336 KB
TDP235 Watts250 Watts300 Watts
Transistors7.1 billion8 billion15.3 billion
GPU Die Size551 mm²601 mm²610 mm²
Manufacturing Process28-nm28-nm16-nm

Comments
Welcome to yitit comments! Please keep conversations courteous and on-topic. To fosterproductive and respectful conversations, you may see comments from our Community Managers.
Sign up to post
Sort by
Login to display more comments
Hardware
Recent News
Copyright 2023-2026 - www.yitit.com All Rights Reserved