NVIDIA has announced their latest Pascal based Tesla P40 and Tesla P4 GPU accelerators. The new cards are designed to accelerator AI / Neural Network inferencingwith a boost up to 45x over the CPUs and around 4x increase over past generation GPUs. The GPU accelerators are backed up with powerful software tools that deliver a massive increase in overall efficiency.
NVIDIA Tesla P40 and Tesla P4 Announced - Accelerating AI / Deep Neural Network Inferences
NVIDIA has created a platform for deep learning with their latest Tesla cards. The platform is segmented into Training and InfrerencingGPUs. For AI Training, NVIDIA offers the Tesla P100 solution with the fastest compute performance available to date, both FP16 and FP64. This along with DIGITS Training system and Deep learning frameworks adds in higher efficiency and performance. On the other hand, we have interfacing cards and this line is powered by the Tesla P40 and Tesla P4 accelerators.

The Tesla P4 and P40 are specifically designed for inferencing, which uses trained deep neural networks to recognize speech, images or text in response to queries from users and devices. Based on the Pascal architecture, these GPUs feature specialized inference instructions based on 8-bit (INT8) operations, delivering 45x faster response than CPUs1 and a 4x improvement over GPU solutions launched less than a year ago. via NVIDIA
Replacing the Tesla M40 and Tesla M4, the Pascal based accelerators come with DeepStream SDK and TensorRT support. The two interfacing cards are based on the GP102 and GP104 architecture, both of which are available on NVIDIA's consumer platforms in the form of GeForce and Quadro. Let's take a look at the specifications for these cards:
NVIDIA Tesla P40 "Pascal GP102" Specifications:
The Tesla P40 is the faster part of the two, featuring a full fledged GP102 GPU core. The card consists of 3840 CUDA cores and 24 GB of GDDR5 memory. Clock speeds are maintained at 1303 MHz base and 1531 MHz for boost. The memory is clocked at 7.2 GHz effective which delivers 346 GB/s bandwidth along a 384-bit interface. The chip packs 12 TFLOPs of FP32 and 47 TFLOPs of INT8 compute performance on a 250W TDP package. Like the Tesla M40 before it, the P40 also comes in passive form factor.


2 of 9
NVIDIA Tesla P4 "Pascal GP104" Specifications:
The Tesla P4 on the other hand features the GP104 core. It has the full 2560 CUDA cores attached to it but run at a much lower clock speed of 810 MHz base and 1063 MHz boost. This has to do with the low form factor design which the card is offered in, as it is designed for blade servers. The P4 also comes in a 50-75W package which is much lower than the GTX 1080's 190W TDP. The GTX 1080 does feature the same core count but has higher clock speeds reaching up to 2 GHz. This product is clocked at half the rate of the 1080 hence the higher power efficiency.


2 of 9
Rest of the specifications include a 8 GB video ram. Clock speeds for memory is retained at 6 GHz that offers 192 GB/s bandwidth along a 256-bit bus. The compute performance for this card is rated at 5.5 TFLOPs (FP32) and 22 DLTOPs (INT8). No price has been announced for the Tesla P40 or Tesla P4 but they are expected to hit the market through OEM channels in late Q4 (October-Novemeber) 2016.

NVIDIA Tesla P40 and Tesla P4 Specifications:
| Product Name | Tesla M4 | Tesla M40 | Tesla P4 | Tesla P40 |
|---|---|---|---|---|
| GPU Architecture | Maxwell GM206 | Maxwell GM200 | Pascal GP104 | Pascal GP102 |
| GPU Process | 28nm | 28nm | 16nm FinFET | 16nm FinFET |
| CUDA Cores | 1280 CUDA | 3072 CUDA | 2560 CUDA | 3840 CUDA |
| Clock Speed | 1072 MHz | 1114 MHz | 1063 MHz | 1531 MHz |
| FP32 Compute | 2.20 TFLOPs | 7.00 TFLOPs | 5.50 TFLOPs | 12.0 TFLOPs |
| INT8 Compute | N/A | N/A | 22 DLTOPs | 47 DLTOPs |
| VRAM | 4 GB GDDR5 | 24 GB GDDR5 | 8 GB GDDR5 | 24 GB GDDR5 |
| Memory Clock | 5.5 GHz | 6.0 GHz | 6.0 GHz | 7.2 GHz |
| Memory Bus | 128-bit | 384-bit | 256-bit | 384-bit |
| Memory Bandwidth | 88.0 GB/s | 288.0 GB/s | 192.0 GB/s | 346 GB/s |
| TDP | ~75W | 250W | ~75W | 250W |
| Launch | 2015 | 2015 | 2016 | 2016 |
Software Tools for Faster Inferencing
Complementing the Tesla P4 and P40 are two software innovations to accelerate AI inferencing: NVIDIA TensorRT and the NVIDIA DeepStream SDK.
TensorRT is a library created for optimizing deep learning models for production deployment that delivers instant responsiveness for the most complex networks. It maximizes throughput and efficiency of deep learning applications by taking trained neural nets — defined with 32-bit or 16-bit operations — and optimizing them for reduced precision INT8 operations.


2 of 9
NVIDIA DeepStream SDK taps into the power of a Pascal server to simultaneously decode and analyze up to 93 HD video streams in real time compared with seven streams with dual CPUs.This addresses one of the grand challenges of AI: understanding video content at-scale for applications such as self-driving cars, interactive robots, filtering and ad placement. Integrating deep learning into video applications allows companies to offer smart, innovative video services that were previously impossible to deliver.
NVIDIA Offers 10W, Palm-SizedEnergy-Efficient AI Computer for Self-Driving Cars
NVIDIA also announced a new Drive PX 2 board for self driving cars. While the original design uses two Parker SOCs, the new model is a single chip based design. With a TDP of just 10W and a much smaller board footprint, the AI supercomputer adds more affordability to the product.

"Baidu and NVIDIA are leveraging our AI skills together to create a cloud-to-car system for self-driving," said Liu Jun, vice president of Baidu. "The new, small form-factor DRIVE PX 2 will be used in Baidu's HD map-based self-driving solution for car manufacturers." via NVIDIA
The new single-processor DRIVE PX 2 will be available to production partners in the fourth quarter of 2016. DriveWorks software and the DRIVE PX 2 configuration with two SoCs and two discrete GPUs are available today for developers working on autonomous vehicles.










