AI Solution Brief
AI Inference on Ampere Altra Max
The Ampere® Altra® Max processor is a complete system-on-chip (SOC) solution that supports up to 128 high-performance cores with innovative architecture that delivers predictable high performance, linear scaling, and high energy efficiency. Running AI inference is a rapidly growing production workload in the cloud. While training deep neural networks require a significant amount of GPU or similar hardware acceleration infrastructure, running inference on fully trained, deployment ready AI algorithms can be handled by CPUs in most situations. We demonstrate that Ampere Altra Max is ideal for running AI inference in the cloud, not only meeting latency and throughput requirements, but also outperforming CPUs based on x86 architecture as well as other ARM based processors currently used in the Cloud.
Ampere Altra Max processors deliver exceptional performance and power efficiency for AI workloads. Running AI inference on Ampere Altra Max requires no modification or translation of your neural network, independent of the platform it was trained on as long as it was done with one of the industry standard AI development frameworks such as TensorFlow, PyTorch or ONNX. Ampere’s optimized TensorFlow, Pytorch and ONNX are available at no charge either from our cloud suppliers or directly from Ampere.
Ampere Altra Max is the only cloud processor that currently supports the fp16 data format. Fp16 provides up to a 2x performance speed-up over fp32 models with no or negligible loss of accuracy. The quantization from fp32 is straightforward and requires no retraining or rescaling the weights. If trained on fp16 on GPU, inference can be run on the model out of the box. Ampere Altra Max supports fp32, fp16 and int8 data formats.
Ampere provides an ever-growing family of optimized, pre-trained models available for download to use for demos or to adapt and use in your applications.
Finally, Ampere Altra Max CPUs also work in tandem with NVIDIA GPUs for your training needs.
We have run a series of benchmarks, following MLCommons guidelines to demonstrate and document Ampere Altra Max CPUs superior performance in many representative AI inference tasks including Computer Vision and NLP applications.
Cloud Native: Designed from the ground up for 'born in the cloud' workloads, Ampere Altra Max delivers up to 2x higher inference performance than the best x86 servers and 5x better than similar ARM based processors.
Industry Standard Platforms: Ampere Altra Max runs AI inference workloads developed on TensorFlow, PyTorch or ONNX without modifications. Customers can run their applications by simply using our optimized frameworks, available free of charge from Ampere or our cloud partners.
Support for fp16 format: Ampere Altra Max is the only broadly available cloud CPU that natively supports the fp16 data format. Quantizing fp32 trained networks to fp16 is straightforward and results in no visible accuracy loss.
Scalable: With an innovative scale-out architecture, Ampere Altra Max processors have a high core count with compelling single-threaded performance. Combined with consistent frequency for all cores Ampere Altra Max delivers consistent performance at the socket level greater than the best x86 servers. This leads to much higher resistance to noisy neighbors in multitenant environments
Energy Efficiency: With up to 128 energy-efficient Arm cores, Ampere Altra Max has a 60% performance/watt advantage over leading x86 servers with better performance. Industry leading performance and high energy efficiency results in Ampere Altra Max having a smaller carbon footprint and reduces Total Cost of Ownership (TCO).
The benchmarks were performed using TensorFlow on bare metal single socket servers with equivalent memory, networking, and storage configurations for the x86 platforms shown. Processors tested include AMD EPYC 7J13 “Milan” with TF2.7 ZenDNN, Intel Xeon 8375C “Cascade Lake” with TF 2.7 DNNL, Intel Xeon 8380 “Ice Lake” with TF 2.7 DNNL and Ampere Altra Max M128-30 with Ampere Optimized TF 2.7. ARM-64 based “Graviton 2”, available exclusively through AWS (c6g shape), was tested in 64-core configuration.
Detailed benchmark conditions and configurations for each device type can be found
Having run various AI workloads according to MLCommons benchmarking guidelines, we present some of our results below.
In Computer Vision using SSD ResNet-34 for a typical Object Detection application Ampere Altra Max outperforms in latency, Intel Xeon 8375C by 2x, AMD EPYC7Ji3 and Graviton by 4x in fp32 mode. In fp16, Altra Max extends its lead by an additional factor of two while maintaining the same accuracy. See Figure 1.
Ampere Altra Max also has a significant advantage in performance/watt over its competitors. In fp16 resolution Altra Max is around 5x more power efficient than Intel Xeon and AMD EPYC. In fp32 resolution Altra Max maintains a 2x advantage over the same Intel and AMD devices. (Figure 2.)
Ampere Altra Max processors are a complete System on a Chip (SOC) solution built for Cloud Native workloads, designed to deliver exceptional performance and energy efficiency for AI inferencing. Ampere Altra Max has up to 4x faster performance compared to Intel® Xeon® Platinum 8375c and AMD EPYC 7J13. In power efficiency Ampere Altra Max also leads its competitors by consuming 60% less power at equivalent throughputs.
Visit https://solutions.amperecomputing.com/solutions/ampere-ai to learn how to access Ampere systems from our partner Cloud Service Providers and experience the performance and power efficiency of Ampere processors.
All data and information contained herein is for informational purposes only and Ampere reserves the right to change it without notice. This document may contain technical inaccuracies, omissions and typographical errors, and Ampere is under no obligation to update or correct this information. Ampere makes no representations or warranties of any kind, including but not limited to express or implied guarantees of noninfringement, merchantability, or fitness for a particular purpose, and assumes no liability of any kind. All information is provided “AS IS.” This document is not an offer or a binding commitment by Ampere. Use of the products contemplated herein requires the subsequent negotiation and execution of a definitive agreement or is subject to Ampere’s Terms and Conditions for the Sale of Goods.
System configurations, components, software versions, and testing environments that differ from those used in Ampere’s tests may result in different measurements than those obtained by Ampere.
©2022 Ampere Computing. All Rights Reserved. Ampere, Ampere Computing, Altra and the ‘A’ logo are all registered trademarks or trademarks of Ampere Computing. Arm is a registered trademark of Arm Limited (or its subsidiaries). All other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
Ampere Computing® / 4655 Great America Parkway, Suite 601 / Santa Clara, CA 95054 / amperecomputing.com
TensorFlow benchmarks were performed on bare metal single socket servers with equivalent memory, networking and storage configurations for the x86 platforms shown. Processors tested include AMD EPYC 7763 “Milan” with TF2.7 ZenDNN, Intel Xeon 7375 “Cascade Lake” with TF 2.7 DNNL, Intel Xeon 8380 “Ice Lake” with TF 2.7 DNNL and Ampere Altra Max M128-80 with Ampere Optimized TF 2.7. ARM-64 based “Graviton 2”, available exclusively through AWS (c6g shape), was tested in 64-core configuration. Benchmarks were performed with Ampere’s internal testing software based on Ampere Model Library. This software is written entirely in Python and complies with MLCommons Inference (a.k.a. MLPerf) methodology of calculating latency and throughput. It utilizes the standard APIs of frameworks and common ways while replicating usage in real-life applications. For latency benchmarks for each configuration listed below, a single system process has been executed at a time. Each process, following a warm-up run, has run workloads of batch size equal to 1 in loop for a minimum of 60 seconds. A final latency value has then been calculated based on collected net inference time of each pass through the network.
When it comes to the multi-process throughput benchmarks a search-space of different batch sizes and number of threads per process has been covered. Final throughput values have been estimated based on average (50th percentile) latencies observed during 60 second multi process runs. All systems were benchmarked running workloads of following batch sizes per each of n parallel processes: [1, 4, 16, 32, 64, 128, 256]. The number of threads per process vs. the number of processes in total were respectively:
Benchmarks of all platforms were run using the same scripting, same datasets, same representation of models. All platforms ran the same workloads, applying identical pre- and post- processing and making uniform inference calls. In the case of fp16 Altra data, values were obtained with the use of same scripting, while AI model representations differed from their fp32 counterparts only in the precision of weights – model quantization process consisted of only casting to the lower float precision.
Across all systems that were tested, TensorFlow library was used in its latest version available for a given platform: