MLPerf Training Benchmarks Released
The University of Florida (UF) is announcing its first performance result submission on MLPerf® Training benchmarks, published by MLCommons®
University of Florida Demonstrates Leadership in Frontier-Scale AI Training with First MLPerf Training Submission
The University of Florida (UF) is announcing its first performance result submission on MLPerf® Training benchmarks, published by MLCommons®. It marks a major milestone for UF, the nation’s first and leading AI university, and validates that strong, large-scale AI model training performance can be achieved on academic supercomputing infrastructure. The submission leveraged the newly completed fourth-generation installation of HiPerGator, UF’s supercomputer, which features an NVIDIA DGX B200 SuperPOD. The results demonstrate readiness to support diverse and demanding AI training workloads across research, education and industry.
UF’s submission included all seven models in MLPerf Training v5.1 benchmark suite: SSD-RetinaNet, DLRMv2, Llama-3.1-405B, Llama-2-70B-LoRA, Llama-3.1-8B, RGAT and Flux.1, using NVIDIA’s MLPerf containers lightly adjusted for UF’s environment. Benchmarks were executed across node configurations from one node to 56 nodes (eight to 448 B200 GPUs) on the B200 SuperPOD. As the sole academic institution in this submission, UF delivered industry-class, competitive performance results comparable to leading commercial vendors.
"We are thrilled to welcome UF and the HiPerGator team as the only academic submitter in this round,” said David Kanter, Head of MLPerf® for MLCommons. “I am impressed by the comprehensive breadth and scalability of the results and their dedication to driving the frontiers of AI performance."
The table below highlights the outstanding performance across MLPerf Training benchmarks with a wide range of node counts, demonstrating both comprehensive coverage and strong scaling efficiency. Notably, UF was the only organization to submit DGX B200 results for Llama 3.1 405B with 32-node and 56-node configurations, showcasing exceptional scalability and system stability at unprecedented scales. Across all benchmarks, UF’s submissions consistently ranked among the top performers, achieving leading latencies and demonstrating near-linear scaling efficiency. These results highlight HiPerGator’s DGX B200 system as a high-performing, production-ready platform capable of supporting state-of-the-art AI workloads in a variety of domains with excellent overall throughput and reliability.
To preserve reproducibility and compliance with MLPerf closed-division requirements, all runs used Apptainer containers with SLURM batch scheduling and a parallel Lustre file system. The runs employed secure, rootless container execution without Docker or elevated privileges. This practical methodology shows how multi-tenant academic systems can run compliant, high-performance AI workloads while retaining software-stack integrity.
As the only academic institution in this submission round, UF is contributing operational insights to the community and helping advance transparent, trustworthy performance measurements for AI training. The submission highlights UF’s commitment to sharing experience, collaborating with peers, and enabling more institutions to run compliant AI workloads on shared HPC infrastructure.
|
MLPerf Training Benchmark v5.1 Results HiPerGator NVIDIA Blackwell GPU (B200-SXM-180GB) |
||||
|
Model |
Node (Total Accelerators) |
Avg. Latency (min) |
UF Rank Among B200 Submissions |
Model Description |
|
Flux.1 |
2 (16) |
173.448 |
#1 |
A text-to-image generative model (11.9B parameters) trained on CC12M. Used for high-quality image generation, multimodal tasks, and scientific/creative AI applications. |
|
4 (32) |
93.461 |
#2 |
||
|
8 (64) |
60.078 |
#1 |
||
|
9 (72) |
54.54 |
#1 |
||
|
16 (128) |
47.716 |
#1 |
||
|
Llama2 70B LoRA |
2 (16) |
6.296 |
#2 |
A 70B-parameter LLM fine-tuned with LoRA, trained on SCROLLS GovReport. Used for summarization, instruction following, and productivity tasks. |
|
4 (32) |
3.58 |
#2 |
||
|
8 (64) |
2.086 |
#1 |
||
|
16 (128) |
1.865 |
#1 |
||
|
Llama3.1 405B |
32 (256) |
256.333 |
#1 |
A 405B-parameter LLM trained on the C4 corpus. Designed for advanced reasoning, coding, enterprise AI, and next-generation multimodal workloads. |
|
56 (448) |
147.155 |
#1 |
||
|
Llama3.1 8B |
2 (16) |
59.561 |
#2 |
An 8B-parameter small LLM trained on C4. Optimized for efficient training, fast iteration, and edge/on-device AI applications. |
|
4 (32) |
35.682 |
#2 |
||
|
8 (64) |
18.1 |
#1 |
||
|
16 (128) |
11.569 |
#1 |
||
|
RetinaNet |
2 (16) |
38.451 |
#3 |
A 37M-parameter object detection model trained on OpenImages. Widely used for real-time image recognition and computer vision workloads. |
|
4 (32) |
27.686 |
#1 |
||
|
8 (64) |
17.645 |
#1 |
||
|
16 (128) |
10.107 |
#1 |
||