NVIDIA A100 Ampere GPU Launched in PCIe Form Factor, 20 Times Faster Than Volta at 250W & 40 GB HBM2 Memory-March 2024-www.yitit.com

NVIDIA has added a third variant to its growing Ampere A100 GPU family, the A100 PCIe which is PCIe 4.0 compliant and comes in the standard full-length, full height form factor compared to the mezzanine board we got to see earlier.

NVIDIA's A100 Ampere GPU Gets PCIe 4.0 Ready Form Factor - Same GPU Configuration But at 250W, Up To 90% Performance of the Full 400W A100 GPU

Just like the Pascal P100 and Volta V100 before it, the Ampere A100 GPU was bound to get a PCIe variant sooner or later. Now NVIDIA has announced that its A100 PCIe GPU accelerator is available for a diverse set of use cases with system ranging from a single A100 PCIe GPU to servers utilizing two cards at the same time through the 12 NVLINK channels that deliver 600 GB/s of interconnect bandwidth.

NVIDIA Ampere A100 PCIe GPU Accelerator_1

In terms of specifications, the A100 PCIe GPU accelerator doesn't change much in terms of core configuration. The GA100 GPU retains the specifications we got to see on the 400W variant with 6912 CUDA cores arranged in 108 SM units, 432 Tensor Cores and 40 GB of HBM2 memory that delivers the same memory bandwidth of 1.55 TB/s (rounded off to 1.6 TB/s). The main difference can be seen in the TDP which is rated at 250W for the PCIe variant whereas the standard variant comes with a 400W TDP.

Now we can guess that the card would feature lower clocks to compensate for the less TDP input but NVIDIA has provided the peak compute numbers and those remain unaffected for the PCIe variant. The FP64 performance is still rated at 9.7/19.5 TFLOPs, FP32 performance is rated at 19.5 /156/312 TFLOPs (Sparsity), FP16 performance is rated at 312/624 TFLOPs (Sparsity) & INT8 is rated at 624/1248 TOPs (Sparsity).

NVIDIA Ampere A100 PCIe GPU Accelerator_Specs

According to NVIDIA, the A100 PCIe accelerator can deliver 90% the performance of the A100 HGX card (400W) in top server applications. This is mainly due to the less time it takes for the card to achieve the said tasks however, in complex situations which required sustained GPU capabilities, the GPU can deliver anywhere from up to 90% to down to 50% the performance of the 400W GPU in the most extreme cases. NVIDIA told that the 50% drop will be very rare and only a few tasks can push the card to such extend.

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics Card	NVIDIA H200 (SXM5)	NVIDIA H100 (SMX5)	NVIDIA H100 (PCIe)	NVIDIA A100 (SXM4)	NVIDIA A100 (PCIe4)	Tesla V100S (PCIe)	Tesla V100 (SXM2)	Tesla P100 (SXM2)	Tesla P100 (PCI-Express)	Tesla M40 (PCI-Express)	Tesla K40 (PCI-Express)
GPU	GH200 (Hopper)	GH100 (Hopper)	GH100 (Hopper)	GA100 (Ampere)	GA100 (Ampere)	GV100 (Volta)	GV100 (Volta)	GP100 (Pascal)	GP100 (Pascal)	GM200 (Maxwell)	GK110 (Kepler)
Process Node	4nm	4nm	4nm	7nm	7nm	12nm	12nm	16nm	16nm	28nm	28nm
Transistors	80 Billion	80 Billion	80 Billion	54.2 Billion	54.2 Billion	21.1 Billion	21.1 Billion	15.3 Billion	15.3 Billion	8 Billion	7.1 Billion
GPU Die Size	814mm2	814mm2	814mm2	826mm2	826mm2	815mm2	815mm2	610 mm2	610 mm2	601 mm2	551 mm2
SMs	132	132	114	108	108	80	80	56	56	24	15
TPCs	66	66	57	54	54	40	40	28	28	24	15
L2 Cache Size	51200 KB	51200 KB	51200 KB	40960 KB	40960 KB	6144 KB	6144 KB	4096 KB	4096 KB	3072 KB	1536 KB
FP32 CUDA Cores Per SM	128	128	128	64	64	64	64	64	64	128	192
FP64 CUDA Cores / SM	128	128	128	32	32	32	32	32	32	4	64
FP32 CUDA Cores	16896	16896	14592	6912	6912	5120	5120	3584	3584	3072	2880
FP64 CUDA Cores	16896	16896	14592	3456	3456	2560	2560	1792	1792	96	960
Tensor Cores	528	528	456	432	432	640	640	N/A	N/A	N/A	N/A
Texture Units	528	528	456	432	432	320	320	224	224	192	240
Boost Clock	~1850 MHz	~1850 MHz	~1650 MHz	1410 MHz	1410 MHz	1601 MHz	1530 MHz	1480 MHz	1329MHz	1114 MHz	875 MHz
TOPs (DNN/AI)	3958 TOPs	3958 TOPs	3200 TOPs	2496 TOPs	2496 TOPs	130 TOPs	125 TOPs	N/A	N/A	N/A	N/A
FP16 Compute	1979 TFLOPs	1979 TFLOPs	1600 TFLOPs	624 TFLOPs	624 TFLOPs	32.8 TFLOPs	30.4 TFLOPs	21.2 TFLOPs	18.7 TFLOPs	N/A	N/A
FP32 Compute	67 TFLOPs	67 TFLOPs	800 TFLOPs	156 TFLOPs (19.5 TFLOPs standard)	156 TFLOPs (19.5 TFLOPs standard)	16.4 TFLOPs	15.7 TFLOPs	10.6 TFLOPs	10.0 TFLOPs	6.8 TFLOPs	5.04 TFLOPs
FP64 Compute	34 TFLOPs	34 TFLOPs	48 TFLOPs	19.5 TFLOPs (9.7 TFLOPs standard)	19.5 TFLOPs (9.7 TFLOPs standard)	8.2 TFLOPs	7.80 TFLOPs	5.30 TFLOPs	4.7 TFLOPs	0.2 TFLOPs	1.68 TFLOPs
Memory Interface	5120-bit HBM3e	5120-bit HBM3	5120-bit HBM2e	6144-bit HBM2e	6144-bit HBM2e	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	384-bit GDDR5	384-bit GDDR5
Memory Size	Up To 141 GB HBM3e @ 6.5 Gbps	Up To 80 GB HBM3 @ 5.2 Gbps	Up To 80 GB HBM2e @ 2.0 Gbps	Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 1.6 TB/s	Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 2.0 TB/s	16 GB HBM2 @ 1134 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 732 GB/s	16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s	24 GB GDDR5 @ 288 GB/s	12 GB GDDR5 @ 288 GB/s
TDP	700W	700W	350W	400W	250W	250W	300W	300W	250W	250W	235W