NVIDIA has officially lifted the curtains off its greatest and most powerful GPU to date, the 7nm Ampere GPU. The first product to feature the new Ampere architecture is a GPU called GA100 & this chip is currently the largest GPU to be produced on the bleeding edge TSMC's 7nm process node. Today, we will be taking a deep-dive in the Ampere GA100 GPU architecture, specifications & the first products that it would be featured inside.
NVIDIA's Ampere GA100 GPU Official - World's Biggest 7nm GPU With Insane Specs
The Ampere GA100 GPU is by far the largest 7nm GPU ever designed. The GPU is designed entirely for the HPC market with applications such as scientific research, Artificial Intelligence, Deep Neural Networking, and AI Inferencing. There's a lot of specifications and a lot of products to talk about so let's start.

First of all, the NVIDIA Ampere GA100 GPU will be available in various form factors. Ranging from a singular Mezzanine Modular card to full-length PCIe 4.0 graphics card form factors. The GPU also comes in various configurations but the one NVIDIA is highlighting today is the Tesla A100 which is used on the DGX A100 and HGX A100 system.
The NVIDIA 7nm Ampere GA100 GPU Architecture & Specifications
When it comes to core specifications, the Ampere GA100 GPU from NVIDIA is a complete monster. Measuring in at a massive 826mm2 which is even bigger than the Volta GV100 GPU which was 815 mm2. The GPU also features more than twice the number of transistors at 54 Billion versus 21.1 on its predecessor which is very impressive. Given the die size and the transistor count, the Ampere GA100 GPU is single-handily the densest GPU ever built.
The full implementation of the NVIDIA Ampere GA100 GPU includes the following units:
8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU6 HBM2 stacks, 12 512-bit memory controllers
The A100 Tensor Core GPU implementation of the NVIDIA Ampere GA100 GPU includes the following units:
7 GPCs, 7 or 8 TPCs/GPC, 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU4 third-generation Tensor Cores/SM, 432 third-generation Tensor Cores per GPU5 HBM2 stacks, 10 512-bit memory controllers
Figure 4 shows a full GA100 GPU with 128 SMs. The A100 is base
While the Tesla A100 features cut-down specifications due to early 7nm yields which are still great considering the size of this 'SUPER GPU', the NVIDIA Ampere GA100 GPU in its full-fat version is what we're going to be looking at first.

Featuring 128 SMs with 8192 CUDA cores, the NVIDIA Ampere GA100 also houses the largest single GPU core count we've ever seen. It comes with 8192 FP32 cores, 4096 FP64 cores, and 512 tensor cores. There are 8 Graphics Processing Clusters on the GPU, each with 16 SM units and 8 TPCs. The GA100 GPU has a TDP of 400W for its Tesla A100 variant.
The NVIDIA A100 GPU is a technical design breakthrough fueled by five key innovations:
NVIDIA Amperearchitecture— At the heart of A100 is the NVIDIA Ampere GPU architecture, which contains more than 54 billion transistors, making it the world’s largest 7-nanometer processor.Third-generation Tensor Cores with TF32 — NVIDIA’s widely adopted Tensor Cores are now more flexible, faster, and easier to use. Their expanded capabilities include new TF32 for AI, which allows for up to 20x the AI performance of FP32 precision, without any code changes. In addition,Tensor Coresnow support FP64, delivering up to 2.5x more compute than the previous generation for HPC applications.Multi-instance GPU— MIG, a new technical feature, enables a single A100 GPU to be partitioned into as many as seven separate GPUs so it can deliver varying degrees of compute for jobs of different sizes, providing optimal utilization and maximizing return on investment.Third-generation NVIDIA NVLink— Doubles the high-speed connectivity between GPUs to provide efficient performance scaling in a server.Structural sparsity — This new efficiency technique harnesses the inherently sparse nature of AI math to double the performance.
Other specifications for the NVIDIA Ampere GA100 GPU include a huge 6144-bit bus interface which features up to 48 GB HBM2e memory in six HBM2 stacks that are scattered around the GPU die. Each stack has 2 GB VRAM capacity per die so to reach 48 GB, you would need 4-hi stacks. Each 4-hi stack would consist of 8GB capacity and 6 stacks equal 48 GB capacity. The memory is stated to be running at over 2.0 Gbps pin speeds which would result in around 1.6 Tbps bandwidth.





2 of 9
The NVIDIA Ampere GPU will come with several HBM memory configurations but it maxes out at 48 GB unless NVIDIA wants to offer a 6-hi or 8-hi variant in the future which would raise the memory capacity to 72 or even 96 GB. NVIDIA's Tesla V100S already double the HBM capacity of the Tesla V100, offering 32 GB vs 16 GB so it's entirely possible NVIDIA could do the same with a future variant of the Tesla A100.
NVIDIA Ampere GA100 GPU Block Diagram:

NVIDIA Ampere GA100 GPU SM Block Diagram:

NVIDIA Ampere GH100 Compute
| GPU | Kepler GK110 | Maxwell GM200 | Pascal GP100 | Volta GV100 | Ampere GA100 | Hopper GH100 |
|---|---|---|---|---|---|---|
| Compute Capability | 3.5 | 5.3 | 6.0 | 7.0 | 8.0 | 9/0 |
| Threads / Warp | 32 | 32 | 32 | 32 | 32 | 32 |
| Max Warps / Multiprocessor | 64 | 64 | 64 | 64 | 64 | 64 |
| Max Threads / Multiprocessor | 2048 | 2048 | 2048 | 2048 | 2048 | 2048 |
| Max Thread Blocks / Multiprocessor | 16 | 32 | 32 | 32 | 32 | 32 |
| Max 32-bit Registers / SM | 65536 | 65536 | 65536 | 65536 | 65536 | 65536 |
| Max Registers / Block | 65536 | 32768 | 65536 | 65536 | 65536 | 65536 |
| Max Registers / Thread | 255 | 255 | 255 | 255 | 255 | 255 |
| Max Thread Block Size | 1024 | 1024 | 1024 | 1024 | 1024 | 1024 |
| CUDA Cores / SM | 192 | 128 | 64 | 64 | 64 | 128 |
| Shared Memory Size / SM Configurations (bytes) | 16K/32K/48K | 96K | 64K | 96K | 164K | 228K |
The NVIDIA Tesla A100 Accelerator - Specs & Performance
With the specifications of the full-fat NVIDIA Ampere GA100 GPU covered, let's talk about the Tesla A100 graphics accelerator itself. The Tesla A100 makes use of a cut-down variant of the Ampere GA100 GPU that offers 108 SMs featuring 6912 FP32 cores, 3456 FP64 cores, and 432 Tensor cores. The card comes with a 5120-bit bus interface with a maximum VRAM capacity of 40 GB HBM2. It is interesting here because 40 GB HBM2 would suggest either a 5-hi stack design which seems unlikely or a 6-hi stack with a defective DRAM chip on each stack. In the case of the former, a spacer would be introduced on the GA100 HBM stack to fill up its space.







2 of 9
The NVIDIA Ampere Tesla A100 features a 400W TDP which is 100W more than the Tesla V100 Mezzanine unit. The PCIe variant comes with a 300W TDP but has lowered down clock speeds. The Mezzanine board has a GPU-to-GPU connection through the new NVLINK switches which enables up to 600 Gb/s GPU-To-GPU interconnect and 4.8 Tb/s bi-directional channel. The PCIe variant has a Mellanox switch on board along with two next-gen NVLINK connections and two EDR ports.




2 of 9
| V100 | A100 | A100 Sparsity1 | A100 Speedup | A100 Speedup with Sparsity | |
| A100 FP16 vs. V100 FP16 | 31.4 TFLOPS | 78 TFLOPS | N/A | 2.5x | N/A |
| A100 FP16 TC vs. V100 FP16 TC | 125 TFLOPS | 312 TFLOPS | 624 TFLOPS | 2.5x | 5x |
| A100 BF16 TC vs.V100 FP16 TC | 125 TFLOPS | 312 TFLOPS | 624 TFLOPS | 2.5x | 5x |
| A100 FP32 vs. V100 FP32 | 15.7 TFLOPS | 19.5 TFLOPS | N/A | 1.25x | N/A |
| A100 TF32 TC vs. V100 FP32 | 15.7 TFLOPS | 156 TFLOPS | 312 TFLOPS | 10x | 20x |
| A100 FP64 vs. V100 FP64 | 7.8 TFLOPS | 9.7 TFLOPS | N/A | 1.25x | N/A |
| A100 FP64 TC vs. V100 FP64 | 7.8 TFLOPS | 19.5 TFLOPS | N/A | 2.5x | N/A |
| A100 INT8 TC vs. V100 INT8 | 62 TOPS | 624 TOPS | 1248 TOPS | 10x | 20x |
| A100 INT4 TC | N/A | 1248 TOPS | 2496 TOPS | N/A | N/A |
| A100 Binary TC | N/A | 4992 TOPS | N/A | N/A | N/A |
In terms of performance, the NVIDIA Ampere GA100 GPU delivers 1 Peta-OPs which is a 20x increase over the Volta GV100 GPU. The double-precision performance is rated at 2.5x higher over NVIDIA's Volta GV100 GPU which should end up somewhere around 19.5 TFLOPs FP64 since Volta had around 8 TFLOPs FP64 compute power. This would mean that the single-precision performance is rated at over 19.5 standard rates and up to 156 TFLOPs (FP32) which would be mind-blowing for the HPC segment.
NVIDIA HPC / AI GPUs
| NVIDIA Tesla Graphics Card | NVIDIA H200 (SXM5) | NVIDIA H100 (SMX5) | NVIDIA H100 (PCIe) | NVIDIA A100 (SXM4) | NVIDIA A100 (PCIe4) | Tesla V100S (PCIe) | Tesla V100 (SXM2) | Tesla P100 (SXM2) | Tesla P100 (PCI-Express) | Tesla M40 (PCI-Express) | Tesla K40 (PCI-Express) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPU | GH200 (Hopper) | GH100 (Hopper) | GH100 (Hopper) | GA100 (Ampere) | GA100 (Ampere) | GV100 (Volta) | GV100 (Volta) | GP100 (Pascal) | GP100 (Pascal) | GM200 (Maxwell) | GK110 (Kepler) |
| Process Node | 4nm | 4nm | 4nm | 7nm | 7nm | 12nm | 12nm | 16nm | 16nm | 28nm | 28nm |
| Transistors | 80 Billion | 80 Billion | 80 Billion | 54.2 Billion | 54.2 Billion | 21.1 Billion | 21.1 Billion | 15.3 Billion | 15.3 Billion | 8 Billion | 7.1 Billion |
| GPU Die Size | 814mm2 | 814mm2 | 814mm2 | 826mm2 | 826mm2 | 815mm2 | 815mm2 | 610 mm2 | 610 mm2 | 601 mm2 | 551 mm2 |
| SMs | 132 | 132 | 114 | 108 | 108 | 80 | 80 | 56 | 56 | 24 | 15 |
| TPCs | 66 | 66 | 57 | 54 | 54 | 40 | 40 | 28 | 28 | 24 | 15 |
| L2 Cache Size | 51200 KB | 51200 KB | 51200 KB | 40960 KB | 40960 KB | 6144 KB | 6144 KB | 4096 KB | 4096 KB | 3072 KB | 1536 KB |
| FP32 CUDA Cores Per SM | 128 | 128 | 128 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 192 |
| FP64 CUDA Cores / SM | 128 | 128 | 128 | 32 | 32 | 32 | 32 | 32 | 32 | 4 | 64 |
| FP32 CUDA Cores | 16896 | 16896 | 14592 | 6912 | 6912 | 5120 | 5120 | 3584 | 3584 | 3072 | 2880 |
| FP64 CUDA Cores | 16896 | 16896 | 14592 | 3456 | 3456 | 2560 | 2560 | 1792 | 1792 | 96 | 960 |
| Tensor Cores | 528 | 528 | 456 | 432 | 432 | 640 | 640 | N/A | N/A | N/A | N/A |
| Texture Units | 528 | 528 | 456 | 432 | 432 | 320 | 320 | 224 | 224 | 192 | 240 |
| Boost Clock | ~1850 MHz | ~1850 MHz | ~1650 MHz | 1410 MHz | 1410 MHz | 1601 MHz | 1530 MHz | 1480 MHz | 1329MHz | 1114 MHz | 875 MHz |
| TOPs (DNN/AI) | 3958 TOPs | 3958 TOPs | 3200 TOPs | 2496 TOPs | 2496 TOPs | 130 TOPs | 125 TOPs | N/A | N/A | N/A | N/A |
| FP16 Compute | 1979 TFLOPs | 1979 TFLOPs | 1600 TFLOPs | 624 TFLOPs | 624 TFLOPs | 32.8 TFLOPs | 30.4 TFLOPs | 21.2 TFLOPs | 18.7 TFLOPs | N/A | N/A |
| FP32 Compute | 67 TFLOPs | 67 TFLOPs | 800 TFLOPs | 156 TFLOPs (19.5 TFLOPs standard) | 156 TFLOPs (19.5 TFLOPs standard) | 16.4 TFLOPs | 15.7 TFLOPs | 10.6 TFLOPs | 10.0 TFLOPs | 6.8 TFLOPs | 5.04 TFLOPs |
| FP64 Compute | 34 TFLOPs | 34 TFLOPs | 48 TFLOPs | 19.5 TFLOPs (9.7 TFLOPs standard) | 19.5 TFLOPs (9.7 TFLOPs standard) | 8.2 TFLOPs | 7.80 TFLOPs | 5.30 TFLOPs | 4.7 TFLOPs | 0.2 TFLOPs | 1.68 TFLOPs |
| Memory Interface | 5120-bit HBM3e | 5120-bit HBM3 | 5120-bit HBM2e | 6144-bit HBM2e | 6144-bit HBM2e | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 384-bit GDDR5 | 384-bit GDDR5 |
| Memory Size | Up To 141 GB HBM3e @ 6.5 Gbps | Up To 80 GB HBM3 @ 5.2 Gbps | Up To 80 GB HBM2e @ 2.0 Gbps | Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 1.6 TB/s | Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 2.0 TB/s | 16 GB HBM2 @ 1134 GB/s | 16 GB HBM2 @ 900 GB/s | 16 GB HBM2 @ 732 GB/s | 16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s | 24 GB GDDR5 @ 288 GB/s | 12 GB GDDR5 @ 288 GB/s |
| TDP | 700W | 700W | 350W | 400W | 250W | 250W | 300W | 300W | 250W | 250W | 235W |









