Supermicro and NVIDIA Deliver Optimized Systems for AI, ML, and More

Making the Most of Advanced Data Access and Transfer To Boost Productivity

Modern enterprises are gaining considerable competitive advantages from using advanced applications and data processing in their businesses and operations. These include AI-based large language models such as ChatGPT, LLaMa, and so forth, machine learning analyses based on enormous sets of training and real data, complex 3D and finite element models and simulations, and other data- and compute-intensive applications.

All such workloads have at least this much in common: They benefit significantly from expedited access to storage across any kind of tiered model you might care to use. That’s one major reason why so many enterprises and service providers have turned to GPU-based servers to handle large, complicated datasets and the workloads that consume them. They’re much more capable of handling those workloads and can complete such tasks more quickly than conventional servers with more typical storage configurations (e.g., local RAM and NVMe SSDs, with additional storage tiers on the LAN or in the cloud).

The secret to boosting throughput is reduced latency and better storage bandwidth. These translate directly into improved productivity and capability, primarily through clever IO and networking techniques that rely on direct and remote memory access, as explained next. Faster model training and job completion mean AI-powered applications can be deployed more quickly, and get things done faster, speeding time to value.

Direct Memory Access and Remote Equivalents

Direct memory access (aka DMA) has been used to speed IO since the early days of computing. Basically, DMA involves memory-to-memory transfers across a bus (or another interface of some kind) from one device to another. It works by copying a range of memory addresses directly from the sender’s memory to the receiver’s memory (or between two parties for two-way transfers). This feature takes the CPU out of the process and speeds transfer by reducing the number of copy operations involved (so that the CPU need not copy the sender’s data into its memory, then copy that data from its memory to the receiver’s memory).

Indeed, DMA performance on a single system is limited only by the speed of the bus (or other interface) that links the sending and receiving devices involved in a data transfer. For PCIe 4.0, that’s 16 gigatransfers/second (GT/s), with double that amount for PCIe 5.0 (32 GT/s). Data rates are naturally slower because of encoding and packaging overheads, but the rated bandwidth for these two PCIe versions is 64 Gbps (4.0) and 128 Gbps (5.0), respectively. That’s fast!

Remote DMA (aka RDMA) extends the capability of DMA within a single computer to work between a pair of devices across a network connection. RDMA is typically based on a unique application programming interface (API) that works with specialized networking hardware and software to provide as many of the same benefits of local DMA as underlying network technology allows.

NVIDIA GPUs support three such networking technologies, in order by decreasing speed and cost (fastest, most expensive first):

NVIDIA NVLink uses the highest-speed proprietary interfaces and switch technologies to speed data transfers between GPUs on a high-speed network. It currently clocks the highest performance on standard MLPerf Training v3.0 benchmarks for any technology. A single NVIDIA H100 Tensor Core GPU supports up to 18 NVLink connections for up to 900 Gbps (7 times the effective speed of PCIe 5.0).
InfiniBand is a high-speed networking standard overseen by the InfiniBand Trade Association (IBTA) widely implemented on high-performance networks. Its highest measured data rates run around (1.2 Tbps, ~154 GBps) as of 2020.
Ethernet is a standard networking technology with many variants, including seldom-used TbE (~125 GBps) and more common 400 GbE (50 GBps). It has the advantages of being more affordable, widely deployed, and familiar technology in many data centers.

Putting NVIDIA GPUs to Work in Supermicro Servers

NVIDIA RDMA technologies can support GPU-based data access across all three of the preceding networking technologies. Each offers a different price-performance tradeoff, where more cost translates into greater speed and lower latency. Organizations can choose the underlying connection type that best fits their budgets and needs, understanding that each option represents a specific combination of price and performance upon which they can rely. As various AI- or ML-based (and other data- and compute-intensive applications) run on such a server, they can exploit the tiered architecture of GPU storage, where the following tiers are available (in descending order of performance, ascending by size and capacity):

1st tier: GPU memory is the fastest, most expensive, and smallest data store (e.g., Tensor H100 GPU has 188GB of HBM3 RAM)
2nd tier: local SSDs on the PCIe bus are next fastest, still expensive, and from 10 to 100 times the capacity of a high-end GPU
3rd tier: remote storage servers on the LAN can support more than 1,000 times the capacity of the GPUs that access them

Because AI and ML applications need both low latency and high bandwidth, RDMA helps extend the local advantages of DMA to network resources (subject to the underlying connections involved). This feature enables speedy access to external data via memory-to-memory transfers across devices (GPU on one end, storage device on the other). Working with NVLink, InfiniBand, or some high-speed Ethernet variant, the remote adapter transfers data from memory in a remote system to memory on some local GPU. NVIDIA Magnum IO provides an IO acceleration platform for data centers to support parallel, intelligent data center IO to maximize storage, network, and multi-node, multi-GPU communications for the demanding applications that need them.

In its GPU server systems, Supermicro uses NVIDIA GPUs and their supporting access methods. These include local DMA, RDMA via its API, plus high-performance networking via multiple NICs and switches that support all three connection types. In addition, Supermicro GPU servers also include one or two special-purpose ASICs called Data Processing Units (DPUs) to support the accelerated IO that GPUs can deliver. These offload additional IO overhead from the server CPUs. Likewise, such servers can support up to eight network adapters per server to enable sustained and extended access to network bandwidth for maximizing transfers between PCIe 5.0 devices and RDMA devices. This ensures there are no bottlenecks, even on the PCIe bus, and help maximize throughput and minimize latency.

The implications for performance are strongly positive. Performance gains from using NVIDIA’s accelerated IO range from as little as 20% to 30% to up to 2 times for intensive workloads. It’s also essential to design applications to take advantage of the storage to prevent inefficiencies. Thus, such applications should be configured to make regular checkpoints. Otherwise, they must restart from their initial inception should a node fall out of the network or be blocked for some time. Using checkpoints means that progress only reverts back to the most recent snapshot in the event of a node failure or other blocking event (such capabilities may be available from local and network data protection tools and may not need to be specifically built into the application, in fact).

Overall, the real advantage of using DPU- and GPU-based servers for AI, ML, and other high-demand workloads (e.g., 3D or finite element models, simulations, and so forth) is that they enable the separation of infrastructure components from application activities. This saves 20% to 30% of CPU cycles currently devoted to infrastructure access and management. This frees up resources and speeds access by pushing IO functions into hardware.

機架伺服器

1U 雙處理器

2U 雙處理器

單處理器

多處理器

產品系列

GPU 伺服器

8U/10U GPU 系列

4U/5U GPU 系列

2U GPU 系列

1U GPU 系列

Twin 伺服器

FlexTwin™

BigTwin®

GrandTwin®

TwinPro®

FatTwin®

刀鋒伺服器

SuperBlade®

MicroBlade®

MicroCloud

儲存系列伺服器

所有儲存系列產品

全閃存 NVMe

頂部裝載存儲

JBOF

Petascale Grace Storage

Petascale Grace Storage

企業優化的存儲

JBOD Storage Enclosures

主機板

機箱

SuperRack®

配件區

邊緣人工智慧和物聯網系統

緊湊型邊緣系統

緊湊型邊緣伺服器

機架式邊緣伺服器

嵌入式組件

嵌入式/物聯網主機板

嵌入式系統機箱

交換機

網路卡

超級工作站

Liquid-Cooled AI Development Platform

單處理器

雙處理器

Supero™ 遊戲解決方案

人工智能基礎設施

Data Center Building Block Solutions® (DCBBS)

AI 工廠

邊緣人工智慧

人工智慧儲存

NVIDIA 解決方案

AMD 解決方案

Intel 解決方案

HPC

機櫃解決方案

液體冷卻

數據管理

人工智慧儲存

軟體定義儲存和記憶體

超融合基礎架構

Veeam

企業應用和數據分析

數據工程

數據庫和 ERP

Microsoft

雲端和虛擬化

Cloud Service Providers (CSPs)

IT / Hosting Services

Google Distributed Cloud

Canonical OpenStack

Red Hat OpenStack

Kubernetes

虛擬桌面

5G、Edge Computing 和 IoT

電信解決方案

IoT Edge Solutions

邊緣人工智慧