Intel Doubled FP8 GPT3 Performance For Its Gaudi 2 AI Chip In Latest MLPerf Benchmarks-February 2024-www.yitit.com

Intel released the November 2023 update to its MLPerf Training 3.1 results and delivered a 103% performance increase compared to their projection of 90% back in June. There are only three accelerators that are currently submitting GPT-3 results on MLPerf right now: Intel, NVIDIA and Google - making Intel's Gaudi 2 currently the only viable alternative to NVIDIA's GPUs (is that even the correct term anymore?) for MLPerf AI workloads.

Intel showcases competitive price/performance to NVIDIA's leading edge Hopper chips in latest MLPerf 3.1

Intel was also quick to point out that Xeon is the only CPU submitting training results on the MLPerf Benchmark as well. Without any further ado here are the slides presented:

As you can see, Intel's Gaudi team initially projected a 90% performance gain in FP8 - but were able to deliver a 103% gain in GPT-3 industry benchmark, decreasing their time to train in minutes (across 384 accelerators) from 311.94 minutes or 5.2 hours down to just over 2 hours or 153.58 minutes. Intel also presented several slides to aid in TCO (total cost of ownership) based decision making showcasing that the Gaudi 2 chip offers similar performance to the NVIDIA H100 while having a lower server cost - making it competitive in price/performance.

On GPTJ-99, Gaudi 2 shines even more - coming in just slightly behind NVIDIA's new Hopper chips. While the discussion back in June was about Gaudi 2 just being a viable alternative to NVIDIA's chips and significantly behind H100 (only trading blows with the older A100 model), now the Gaudi 2 chip is just slightly behind the H100 and GH200-96G setups. The H100 is just 9% faster while GH200-96G is just 12 % faster than Gaudi 2 in Server throughput benchmarks. This lead extends to 28% in offline benchmarks. Gaudi 2 outperformed A100 by close to 2x in both cases.

Lastly, Intel also pointed out that the Xeon is the only CPU that is currently submitting MLPerf benchmarks and emphasized on its commitment to AI workloads.

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0001

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0004

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0003

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0002

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0006

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0005

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0009

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0008

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0007

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0011

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0010

intel-briefing-slides-mlperf-nov_embargoed-until-9-am-pt-nov-8_pages-to-jpg-0012

2 of 9

About the Intel Gaudi2 Results:

Gaudi2 continues to be the only viable alternative to NVIDIA’s H100 for AI compute needs, delivering significant price-performance. MLPerf results for Gaudi2 displayed the AI accelerator’s increasing training performance:

Gaudi2 demonstrated a 2x performance leap with the implementation of the FP8 data type on the v3.1 training GPT-3 benchmark, reducing time-to-train by more than half compared to the June MLPerf benchmark, completing the training in 153.58 minutes on 384 Intel Gaudi2 accelerators. The Gaudi2 accelerator supports FP8 in both E5M2 and E4M3 formats, with the option of delayed scaling when necessary.Intel Gaudi2 demonstrated training on the Stable Diffusion multi-modal model with 64 accelerators in 20.2 minutes, using BF16. In future MLPerf training benchmarks, Stable Diffusion performance will be submitted on the FP8 data type.On eight Intel Gaudi2 accelerators, benchmark results were 13.27 and 15.92 minutes for BERT and ResNet-50, respectively, using BF16.About the 4th Gen Xeon Results: Intel remains the only CPU vendor to submit MLPerf results. The MLPerf results for 4th Gen Xeon highlighted its strong performance:Intel submitted results for RESNet50, RetinaNet, BERT and DLRM dcnv2. The 4th Gen Intel Xeon scalable processors’ results for ResNet50, RetinaNet and BERT were similar to the strong out-of-box performance results submitted for the June 2023 MLPerf benchmark.DLRM dcnv2 is a new model from June’s submission, with the CPU demonstrating a time-to-train submission of 227 minutes using only four nodes.