The Case for On-Prem AI Data Centers

AI has become and will continue to be a dominant technology for enterprises worldwide. The technology to change business practices and make better decisions in a wide range of industries has led to an unprecedented demand for access to servers that can perform the AI process's training or inference phase. The AI infrastructure needed for the training phase can be significant in terms of cost, but a high end system (multiple CPUs and GPUs) may not always be the best choice. By implementing AI training within an enterprise's data center, organizations can reduce costs and become more productive and flexible at the same time.

Graphic showing racks of Supermicro 4U 10-GPU systems

Cloud Benefits and Drawbacks

Many organizations are moving their workloads to a public cloud infrastructure, which, by definition, is shared by many clients. While the scalability in a public cloud can be quite large, very few training models require thousands of GPUs working concurrently. A benefit to using a public, shared cloud infrastructure is that a large number of high-end (read expensive) servers may be available. Conversely, a large number of high end servers may not be available when desired. In addition, the costs associated with data ingress and egress for large training models can be significant, especially if the training data needs to be imported from another public, shared cloud provider.

On-Prem for AI Training

Several reasons exist to consider and implement AI within an on-prem data center.

Cost – While acquiring servers with GPUs may be high, the longer term cost can be lower compared to using a public, shared cloud. Cloud fees can be relatively high over time, especially for data movements. In addition, the costs for acquiring a high end GPU server can be high, whether all CPUs or GPUs are used 100% of the available time, which is unlikely.
Performance – There are a range of CPU and GPU combinations available, both in terms of the quantity of each and the performance. With an understanding of enterprise AI requirements, the number and performance of the CPUs (1, 2, 4, or 8) is essential. The latest generation of CPUs range from 16 to 128 cores, and base clock rates approaching 4 GHz. A range of GPUs exist, from older generations to the latest releases, with up to thousands of cores. Optimal and multiple configurations can be implemented in a data center, depending on the project's CPU and GPU requirements.
Retraining – While there are various methods to estimate the cost to train a model of a particular size and number of GPUs available, many models need to be continuously re-trained with new parameters. For inference accuracy, the model must be retrained with updated and more recent data, which can take as long as the original training depending on the amount of new data to be used. In an on-prem data center, the systems can be used repeatedly, whereas in the public cloud, expenses can pile off with each iteration and re-training of the model.
Software – There are many software choices to consider when creating an efficient and effective AI training solution. A public, shared cloud provider may not have all the available components, which may require additional setup and testing for each instance acquired in a public cloud infrastructure.
Data Location and Sovereignty – For many industries and geographies, there may be restrictions and requirements for where the data used for AI training must reside. An on-prem data center allows organizations to adhere to these regulations, where using a remote, public cloud data center may not be permitted.
Security – For many organizations, the security of both data and results is critical. In an on-prem data center, security teams can implement more stringent security policies regarding access to the systems or storage devices. When creating and using AI that needs access to internal processes and data, implementing AI in an on-prem data center is an obvious choice.
Compliance – When the data is subject to various regulations, creating a conformant on-prem data center may be ideal, compared to identifying a public cloud that adheres to these regulations.

Trio of Supermicro AI GPU systems: 8U system, 4U system, 5U System

Summary

Implementing an effective and efficient on-prem AI-focused data center requires understanding the performance requirements for the workloads that best suit the enterprise. An on-prem data center, when properly designed, can decrease the time to get results for AI training and can deliver low latency inference results and decisions tuned to the type of model. An on-prem data center can be uniquely configured at a low cost to respond to the needs of the enterprise. Understanding workloads, the amount of data, the fine tuning of the AI workflow, and in-house expertise with various software layers will help determine the best option for the organization.

Rackmount Servers

1U Dual Processor

2U Dual Processor

Single Processor

Multi-Processor

Product Families

GPU Servers

8U/10U GPU Lines

4U/5U GPU Lines

2U GPU Lines

1U GPU Lines

Twin Servers

FlexTwin™

BigTwin®

GrandTwin®

TwinPro®

FatTwin®

Blade Servers

SuperBlade®

MicroBlade®

MicroCloud

Storage Servers

All Storage Systems

All-Flash NVMe

Top-Loading Storage

JBOF

Petascale Grace Storage

Enterprise-Optimized Storage

JBOD Storage Enclosures

Motherboards

Server Boards

Workstation Boards

Embedded / IoT Boards

Desktop / Gaming Boards

Motherboard Matrix

Global SKUs

Chassis

1U Chassis

2U Chassis

3U Chassis

4U / Tower Chassis

Mid / Mini-Tower

Embedded / IoT Chassis

Mobile Racks / Drive Kits

JBOD Storage Enclosures

Global SKUs

SuperRack®

Data Center Solution Engineering (DCSE)

Rack Integration Service

Accessories

Cable Matrix

Riser Card Matrix

Storage AOC Matrix

Power Supply Matrix

Heatsink Matrix

System Fan Matrix

Mobile Racks / Drive Kits

Front Chassis Bezels

Storage, I/O, Security

Edge AI and IoT Systems

Compact Edge Systems

Compact Edge Servers

Rackmount Edge Servers

Embedded Components

Embedded Motherboards

Embedded Chassis

Switches

Adapters

SuperWorkstations

Liquid-Cooled AI Development Platform

Single-Processor

Dual-Processor

Supero™ Gaming Solutions

AI Infrastructure

Data Center Building Block Solutions® (DCBBS)

AI Factory

Edge AI

AI Storage

NVIDIA Solutions

AMD Solutions