Aligning the AI High Performance Computing infrastructure with the ANSI/TIA-942 Ratings resilience model

Author: Barry Elliott, Director, Capitoline Ltd

 

The last few years have seen the rise of the Artificial Intelligence High Performance Computing model, or AI HPC, which is typified by Graphic Processing Units, GPUs, exemplified by the NVIDIA GB200/300 NVL72 computers and their supporting infrastructure.

Some industry observers have commented that this is like going from ‘servers in racks’ to ‘data centers in cabinets’.

Although we have seen an uptake in immersion-cooling techniques over the last few years the vast majority of data centers over the last twenty years are based on air-cooled racks with a cooling/power capacity of up to 15 kW per rack. We can describe the infrastructure support for these installations in terms of the ANSI/TIA-942 resilience model based on four availability Ratings. We can summarize them as:

  • Rating 1: Enough components to do the job but with no more. Very little in the way of redundancy, if any.
  • Rating 2: Some redundancy e.g. N+1 models for UPS and generators, but still many single points of failure
  • Rating 3: Enough components to make the system concurrently maintainable i.e. it is possible to repair or replace any single item in the power, cooling or cabling infrastructure and still maintain full operation of the Information Technology system
  • Rating 4: Enough redundant components and systems that the support infrastructure is not only concurrently maintainable but automatically fault tolerant with no human intervention required to switch in a redundant system

Although ANSI/TIA-942 covers BMS, fire detection and control, physical security and architectural issues, the main focus is on power, cooling and cabling. To achieve the much desired ‘professional’ grade of data center i.e. Rating 3 or 4, we have seen these models realized by redundant cooling systems, dual power supplies to racks and IT equipment and multiple cabling routes between racks and to external sources.

When confronted with dense stacks of GPUs in ‘super pods’ racks, then things start to look a little different.

Power

Most data center equipment racks consume between 5 to 15 kW of power, but GPU clusters now expect to provide up to 132 kW per rack with talk of going up to 1 MW per rack.

In the GPU world the ‘compute units’ are powered by 48 V dc. This is provided by a three-phase UPS sending power to a distribution bus within the rack and on this ac bus are connected ac to dc power supply units which in turn feed a dc distribution bus. Compute units connect into this dc distribution bus to receive their power.  There is no redundancy in this basic system although there is no reason why the ac-dc power supplies couldn’t be in an N+1 format. Some versions also have battery packs to act as short term in-rack backup power supply.

To go beyond 132 kW per rack NVIDIA is proposing to move to an 800 V dc distribution system. This will reduce the number of ac-dc interfaces but mainly the higher voltage will allow a much lower current and subsequently much lower I2R losses.  To deliver 1 MW in the current technology would require huge copper conductors.

On a wider scale there is no reason why the pod cannot be supplied with multiple 800 V feeds or three-phase feeds from multiple UPS sets but within the racks it is almost as if we have to treat it as just one single computer. There will also be considerable cost in duplicating such high-power infrastructure.

Cooling

Cooling at 132 kW or even 1 MW goes way beyond air cooling and even full immersion cooling might not be able to handle it. The chipmakers have taken it upon themselves to design in water-cooling delivered to the actual chip itself in what they call Direct Liquid Cooling or DLC.

When we look at the compute units with their multiple water-cooled chips we can see there is only one water feed pipe going in and only one coming out. There is no redundancy possible. The cooling pipes connect into a manifold in the back of the racks which serves all the compute units. There is only one manifold with only one connection going in and one coming out. There is no redundancy possible.

Each rack manifold then connects into a pod manifold serving all the racks. This in turn connects to a Cooling Distribution Unit, CDU, which is the interface between the IT cooling system, called the Technology Cooling System, TCS, and the building cooling system, called the Facilities Water System, FWS, and this in turn connects to conventional cooling towers, chillers, dry coolers etc. There’s a lot going in the CDU in terms of temperature, flow rates, pressure and water quality but we needn’t go into the details here.

We can say though at the level of the CDU we can ask the question of why not two CDUs or N+1 CDUs which in turn would connect to a traditional N+1 or 2N external cooling infrastructure. Once again there would be questions about the huge cost of replicating such high-power equipment.

The cooling water going around the chips is being forced through very small channels and so water purity is essential and one manufacturer talks of a filter change every 1000 hours or every 42 days, in other words planned downtime, which is not in the Rating 3 or 4 philosophy.

Cabling

The traditional TIA942 cabling model proposes a structured cabling system overlaid onto the IT racks so that there are patch panels in every rack and therefore any piece of equipment can be physically connected to any other. Rating 3 and 4 models give redundant routing for cables ending with at least two access providers entering the building at least 20 meters apart. Considering the background of the TIA942 standard this model is very well developed.

Unfortunately, the model breaks down in the GPU supercluster world. Here we enter a world of front-end networks and scale-out and scale-up networks. These rely on very short, 3 – 7 m, direct attached cables, based on ultra-Ethernet and InfiniBand technology, interlinking many tens or hundreds of compute, storage and networking devices. One twelve-rack model contains 131,000 cable links! There is simply no room for redundancy or even conventional patch panels in such a set up. Redundancy is more easily defined at the logical level where networking equipment reroutes information around faulty nodes.

The AI HPC superpod still wants to communicate with other pods and the outside world and this is defined as the front-end network. This is the point where more conventional data center structured cabling models can come in with multiple telecommunications rooms, entrance rooms, Main Distribution Areas etc.

In the following diagrams we present some ideas as to what AI HPC look like from an infrastructure point of view.

Figure 1: A generic view of a data center with AI HPC superpods connected to redundant power, cooling and communications cabling Figure 1: A generic view of a data center with AI HPC superpods connected to redundant power, cooling and communications cabling

 

Figure 2: An example of a cooling system layout. It could be Rating 3, up to the point of the equipment racks, if there were sufficient isolation valves at every pipe junction to allow isolation of a component and still allow water flow to the IT racks although one or two racks could lose cooling under maintenance.

Figure 2: An example of a cooling system layout. It could be Rating 3, up to the point of the equipment racks, if there were sufficient isolation valves at every pipe junction to allow isolation of a component and still allow water flow to the IT racks although one or two racks could lose cooling under maintenance.

 

Figure 3: By duplication of all the major components and with every component having the capacity to take over the full load, a Rating 4 system could be envisaged up the point of the IT rack distribution manifold. Although concurrent maintainability is possible it is Rating 3 from that point with the same issues of losing IT racks under maintenance as seen in Figure 2.

Figure 3: By duplication of all the major components and with every component having the capacity to take over the full load, a Rating 4 system could be envisaged up the point of the IT rack distribution manifold. Although concurrent maintainability is possible it is Rating 3 from that point with the same issues of losing IT racks under maintenance as seen in Figure 2.

 

Conclusion

A sector of the data center market is moving towards AI HPC compute models with orders of magnitude more power, cooling and cabling requirements compared to existing and previous installations. The implementations for these models often have not considered redundancy and resiliency as covered by ANSI/TIA-942 and it’s four rated levels of availability. This is at least partly because the standard does not currently address architectures tailored for this environment.

We propose that the racks that contain these compute devices be viewed as one single computer with its own cabling distribution architecture, internal power distribution and single liquid-based cooling provision.

ANSI/TIA-942 Rating requirements can come into effect in ensuring dual telecommunication cabling into these pods with dual power supply available and redundancy models for facility water supply thus ensuring an overall facility/building infrastructure that aligns with the ANSI/TIA-942 model.

We will work with TIA’s TR-42 committee that oversees and revises ANSI/TIA-942 to propose solutions and architectures that offer redundancy and resiliency for AI HPC compute environments so data center operators and their users can choose to provide their chosen level of availability.

Resources

For more information on TIA recognized training please see www.capitolinetraining.com

For more information on ANSI/TIA-942 design and facility certification please see www.capitoline.org

One can obtain a copy of the standard by attending one of Capitoline’s courses that focus on TIA-942 and CTM, TDCD or CLAT. Visit https://www.capitolinetraining.com/product/ctm-tia942-masterclass-course-subscription/

For more information on the ANSI/TIA-942 standard please see https://tiaonline.org/products-and-services/tia942certification/

For more information on how to get involved at TIA’s TR-42 committee please see https://tiaonline.org/committee/tr-42-i-telecommunications-cabling-systems/

This blog was developed by a member of the TIA Data Center Program Workgroup. This workgroup includes participants from all aspects of the data center ecosystem.  To participate in TIA’s data center  Program, contact membership@tiaonline.org The ideas and views expressed in this guest blog article are those of the authors’ and not necessarily those of TIA or its members companies.