As chip power moves towards 1,000 watts, cooling becomes the number one problem

Anandtech reports that an increasingly clear trend in high performance computing (HPC) is that power consumption per chip and per rack unit does not stop due to air-cooling limitations. As supercomputers and other high-performance systems have met -- and in some cases exceeded -- these limits, power requirements and power densities have expanded. According to news from TSMC's recent annual technology symposium, we should expect to see this trend continue as TSMC lays the groundwork for more intensive chip configurations.

The problem at hand is not a new one: the power consumption of transistors shrinks almost as fast as their size. Power per transistor is growing rapidly in the HPC space, since chip makers don't give up performance (and can't offer customers half a year of growth). Another problem is that Chiplet is paving the way for building chips with more silicon constraints than traditional lines, which is good for performance and latency, but more problematic for cooling.

Supporting this silicon and power growth are modern technologies such as TSMC CoWoS and InFO, which allow chipmakers to build integrated multi-chip system-level packages (SIPs) with twice as much silicon as TSMC. Constrained by reticle. By 2024, advances in TSMC's CoWoS packaging technology will make it possible to build larger multi-chip SIPs, which TSMC expects to stitch together over the size of four markers, This will achieve enormous complexity (potentially over 300 billion transistors per SiP) and performance that TSMC and its partners are focusing on, but naturally at the cost of huge power consumption and heating.

Flagship products such as NVIDIA's H100 accelerator module already require more than 700 WATTS of power to achieve peak performance. Therefore, the prospect of using multiple GH100-sized Chiplet on a single product is eye-opening -- and power budget. TSMC expects multi-chip SIPs with power consumption of about 1,000 watts or more to emerge in a few years, creating cooling challenges.

As chip power moves towards 1,000 watts, cooling becomes the number one problemAt 700W, H100 already needs liquid cooling; Intel's Chiplet-based Ponte Vecchio and AMD's Instinct MI250X have much the same story. But even conventional liquid cooling has its limitations. As the chips accumulate to 1 kW, TSMC envisages data centers will need to use immersion liquid cooling systems for such extreme AI and HPC processors. Immersion liquid cooling, in turn, would require a redesign of the data center itself, which would be a major change in design as well as a major continuity challenge.

Short-term challenges aside, once data centers are set up for immersion liquid cooling, they will be ready for hotter chips. Immersion cooling has great potential for handling large cooling loads, which is one of the reasons Intel is investing heavily in the technology to make it more mainstream.

In addition to submerged liquid cooling, there is another technology that can be used to cool ultra-hot chips -- on-chip water cooling. Last year, TSMC revealed it had experimented with on-chip water cooling and said it could even use the technology to cool 2.6 kW of SiP. But of course, on-chip water cooling itself is an extremely expensive technology that will push the cost of those extreme AI and HPC solutions to unprecedented levels.

Still, while the future is not set in stone, it seems to have been forged in silicon. TSMC's chip-making customers have customers willing to pay high prices for these ultra-high performance solutions (think operators of super-large cloud data centers), even if they require high costs and technical complexity. To get things back to where we started, that's why TSMC developed CoWoS and InFO packaging processes in the first place -- because there are customers ready and eager to break the line limits with Chiplet technology. Today, we've seen some of this in products like Cerebras' large, wafer-grade engine processors, and with large, smaller chips, TSMC is preparing to make smaller (but still line-breaking) designs more accessible to a wider customer base.

These extreme demands on performance, packaging and cooling are not only pushing manufacturers of semiconductors, servers and cooling systems to their limits, but also requiring modifications to cloud data centers. If large-scale SIPs for AI and HPC workloads do become common, cloud data centers will look very different in the coming years.

Leave a Comment

Shopping Cart