Skip to main content

Supermicro and NVIDIA Deliver Optimized Systems for AI, ML, and More

Making the Most of Advanced Data Access and Transfer To Boost Productivity

Modern enterprises are gaining considerable competitive advantages from using advanced applications and data processing in their businesses and operations. These include AI-based large language models such as ChatGPT, LLaMa, and so forth, machine learning analyses based on enormous sets of training and real data, complex 3D and finite element models and simulations, and other data- and compute-intensive applications.

All such workloads have at least this much in common: They benefit significantly from expedited access to storage across any kind of tiered model you might care to use. That’s one major reason why so many enterprises and service providers have turned to GPU-based servers to handle large, complicated datasets and the workloads that consume them. They’re much more capable of handling those workloads and can complete such tasks more quickly than conventional servers with more typical storage configurations (e.g., local RAM and NVMe SSDs, with additional storage tiers on the LAN or in the cloud).

The secret to boosting throughput is reduced latency and better storage bandwidth. These translate directly into improved productivity and capability, primarily through clever IO and networking techniques that rely on direct and remote memory access, as explained next. Faster model training and job completion mean AI-powered applications can be deployed more quickly, and get things done faster, speeding time to value.

Direct Memory Access and Remote Equivalents

Direct memory access (aka DMA) has been used to speed IO since the early days of computing. Basically, DMA involves memory-to-memory transfers across a bus (or another interface of some kind) from one device to another. It works by copying a range of memory addresses directly from the sender’s memory to the receiver’s memory (or between two parties for two-way transfers). This feature takes the CPU out of the process and speeds transfer by reducing the number of copy operations involved (so that the CPU need not copy the sender’s data into its memory, then copy that data from its memory to the receiver’s memory).

Indeed, DMA performance on a single system is limited only by the speed of the bus (or other interface) that links the sending and receiving devices involved in a data transfer. For PCIe 4.0, that’s 16 gigatransfers/second (GT/s), with double that amount for PCIe 5.0 (32 GT/s). Data rates are naturally slower because of encoding and packaging overheads, but the rated bandwidth for these two PCIe versions is 64 Gbps (4.0) and 128 Gbps (5.0), respectively. That’s fast!

Remote DMA (aka RDMA) extends the capability of DMA within a single computer to work between a pair of devices across a network connection. RDMA is typically based on a unique application programming interface (API) that works with specialized networking hardware and software to provide as many of the same benefits of local DMA as underlying network technology allows.

NVIDIA GPUs support three such networking technologies, in order by decreasing speed and cost (fastest, most expensive first):

  • NVIDIA NVLink uses the highest-speed proprietary interfaces and switch technologies to speed data transfers between GPUs on a high-speed network. It currently clocks the highest performance on standard MLPerf Training v3.0 benchmarks for any technology. A single NVIDIA H100 Tensor Core GPU supports up to 18 NVLink connections for up to 900 Gbps (7 times the effective speed of PCIe 5.0).
  • InfiniBand is a high-speed networking standard overseen by the InfiniBand Trade Association (IBTA) widely implemented on high-performance networks. Its highest measured data rates run around (1.2 Tbps, ~154 GBps) as of 2020.
  • Ethernet is a standard networking technology with many variants, including seldom-used TbE (~125 GBps) and more common 400 GbE (50 GBps). It has the advantages of being more affordable, widely deployed, and familiar technology in many data centers.

Putting NVIDIA GPUs to Work in Supermicro Servers

NVIDIA RDMA technologies can support GPU-based data access across all three of the preceding networking technologies. Each offers a different price-performance tradeoff, where more cost translates into greater speed and lower latency. Organizations can choose the underlying connection type that best fits their budgets and needs, understanding that each option represents a specific combination of price and performance upon which they can rely. As various AI- or ML-based (and other data- and compute-intensive applications) run on such a server, they can exploit the tiered architecture of GPU storage, where the following tiers are available (in descending order of performance, ascending by size and capacity):

  • 1st tier: GPU memory is the fastest, most expensive, and smallest data store (e.g., Tensor H100 GPU has 188GB of HBM3 RAM)
  • 2nd tier: local SSDs on the PCIe bus are next fastest, still expensive, and from 10 to 100 times the capacity of a high-end GPU
  • 3rd tier: remote storage servers on the LAN can support more than 1,000 times the capacity of the GPUs that access them

Because AI and ML applications need both low latency and high bandwidth, RDMA helps extend the local advantages of DMA to network resources (subject to the underlying connections involved). This feature enables speedy access to external data via memory-to-memory transfers across devices (GPU on one end, storage device on the other). Working with NVLink, InfiniBand, or some high-speed Ethernet variant, the remote adapter transfers data from memory in a remote system to memory on some local GPU. NVIDIA Magnum IO provides an IO acceleration platform for data centers to support parallel, intelligent data center IO to maximize storage, network, and multi-node, multi-GPU communications for the demanding applications that need them.

In its GPU server systems, Supermicro uses NVIDIA GPUs and their supporting access methods. These include local DMA, RDMA via its API, plus high-performance networking via multiple NICs and switches that support all three connection types. In addition, Supermicro GPU servers also include one or two special-purpose ASICs called Data Processing Units (DPUs) to support the accelerated IO that GPUs can deliver. These offload additional IO overhead from the server CPUs. Likewise, such servers can support up to eight network adapters per server to enable sustained and extended access to network bandwidth for maximizing transfers between PCIe 5.0 devices and RDMA devices. This ensures there are no bottlenecks, even on the PCIe bus, and help maximize throughput and minimize latency.

The implications for performance are strongly positive. Performance gains from using NVIDIA’s accelerated IO range from as little as 20% to 30% to up to 2 times for intensive workloads. It’s also essential to design applications to take advantage of the storage to prevent inefficiencies. Thus, such applications should be configured to make regular checkpoints. Otherwise, they must restart from their initial inception should a node fall out of the network or be blocked for some time. Using checkpoints means that progress only reverts back to the most recent snapshot in the event of a node failure or other blocking event (such capabilities may be available from local and network data protection tools and may not need to be specifically built into the application, in fact).

Overall, the real advantage of using DPU- and GPU-based servers for AI, ML, and other high-demand workloads (e.g., 3D or finite element models, simulations, and so forth) is that they enable the separation of infrastructure components from application activities. This saves 20% to 30% of CPU cycles currently devoted to infrastructure access and management. This frees up resources and speeds access by pushing IO functions into hardware.