A big-picture view of constructing a supercomputer
It’s been a busy summer for the engineers at the Ohio Supercomputer Center (OSC).
Busy, but exciting, as our newest system, the Dell/Intel-Xeon Owens Cluster, is being installed. When it’s completely up and running later this year, we’ll have deployed the most powerful supercomputer system in the 29-year history of the OSC.
It will increase the center’s total computing capacity by a factor of four and its storage capacity by three. The power and cooling for this system will literally be enough to power about 500 homes.
So, there’s plenty for those of us at OSC and the center’s clients to be excited about.
Of course, before Owens can be off and running like its namesake (Olympic and Ohio State University sprinter Jesse Owens), there’s the significant undertaking of installing this sparkling new system. And to put it mildly, it’s not as simple as opening a laptop box and plugging it in.
What all does it take to set up the Owens Cluster? Consider this comment from OSC’s Doug Johnson: “The way we talk about it as a whole is we’re kind of building a new supercomputer center inside the supercomputer center.”
That’s why we recently sat down with Johnson, OSC’s chief systems architect and HPC systems group manager, to get a bird’s eye view of this massive installation project.
It turns out there are five major steps to installing the supercomputer system.
“First of all,” Johnson said, “it’s a very large purchase, and coming up with the requirements we have in being able to support OSC’s user community is a difficult process.”
There’s no simple list of the types of usage OSC supports; in fact, it’s a wide range of research and applications that will use the computing, storage and networking. Evaluating everything that goes into that is a long process, not only in coming up with the goals for the system, but also aligning those goals with the available budget, estimating performance targets, pondering the administrative aspects of the complex infrastructure and other similar considerations.
A system this large requires extreme facilities to house it. Consider these numbers:
- 828 interconnected servers will make up the cluster
- There will be 20 racks of computer and networking equipment
- Just one of these racks, about the size of a large refrigerator, can hold up to 72 servers and weigh nearly 2,500 pounds
As stated earlier, the power and cooling will be enough to power 500 homes. So, it’s also very large-scale and pretty interesting as well.
“In these very dense, large racks we have the servers generate a lot of heat,” Johnson said. “So we’ve had to go to extreme measures to deal with that density of heat.”
For the Owens Cluster, refrigerant is being pumped through Rear Door Heat Exchangers (RDHX), the hot air that’s expelled from the servers is immediately cooled as it leaves the racks. This allows the room to remain cool.
“Ultimately the heat is transferred to what’s called the Condenser Water Loop, a large flow of water that’s running through the data center,” Johnson said. “That water is then cooled using to outside ambient temperatures in large outdoor cooling towers.”
3) Software environment
Once the computers are powered on, there has to be a way to manage all the software, which includes the operating system and user applications. The HPC systems group and the scientific applications group have been working on the components of that environment so when the system is powered on, clients have an operating system, user accounts and the types of scientific and engineering applications, libraries, tools and compilers needed to complete their research.
As mentioned earlier, the Owens Cluster will have significantly more storage than any system we’ve ever had. So it’s an absolute must that we have network connectivity and storage with sufficient performance to meet the capabilities of the computational cluster. For that to happen, we’re upgrading to larger and more durable project and scratch storage, and employing a high-speed network fabric to connect each of the compute nodes at 100 Gbits/second to each other, and the storage servers.
“All of that is going to be new for the Owens Cluster,” Johnson said. “So there are significant additions in capacity and performance and all sorts of capabilities we just don’t have today.”
Once the initial versions of the environment are established, the hardware and software need tested in multiple ways and in multiple combinations to see how everything is working.
Is all the hardware performing to the expected level?
How are all the components behaving individually?
How does the system work under heavy load?
From the hardware to the software, the Owens Cluster will be completely tested on individual levels and in ensemble to make sure we understand how the system and its different components behave during all levels and types of workloads.
Is that all? To put it mildly, no.
In fact, it sort of just scratches the surface. But it does provide a glimpse into the massive scope of what it takes for engineers to design our new Owens Cluster, install it and set it up to serve the needs of our users.
And once it is up and running, we’re eager to see the discoveries and innovations the Owens Cluster unlocks for our academic and industrial research communities.