Useful Sticky Notes

Monday, March 15, 2010

Exascale Computing: A ballpark guide to an epic journey.

The HPC community is taking concrete steps to conduct the necessary research to take us into the Exascale computing era. This post is really a reminder to myself about the scales and challenges we, as research scientists in the field, will face to varying degrees when Exascale computing comes upon us.

So, what scale of computing are we talking about? As a ballpark figure (ignoring the finer points about FLOPS vs OPS as a measure of computing power), we are talking about 1,000,000,000,000,000,000 (or 10 to the 18th) computation operations per second.

What does this mean in terms of today's technology? Well, one of the common processor chips in use today in HPC is AMD's Opteron 6-core chip with a theoretical peak performance of 10.4 GFlops per core. 100 million of those cores buys us approximately an EFlop. We are now in an era of improving performance by placing more cores on a chip rather than making each core go faster. I believe it is reasonable to believe that theoretical peak performance would not go past 15.0 GFlops per core (current roadmaps appear a little fuzzy about the details, so I could be wrong), so we could go down to just 60-70 million cores to achieve an EFlop. In terms of hardware, the challenge is to connect them efficiently and keep latency low. This may be helped by the expected increase in the number of cores per processor chip (Intel's larabee was planned for 24 cores) in addition to improvements in high-speed interconnects for traditional HPC setups. Power will also be a serious issue. Today, 40W is sometimes considered "energy sipping". However, if you think about it, even if each core "sips" only 1W of power, 100 million of them would eat a whopping 100 MW. At 40W, this would mean 4GW of power. This figure even ignores the cooling required to combat the heat generated by the chips! If this is not addressed, we will need about 4 nuclear reactors to be able to provide the necessary power (some figures I've seen indicate that a modern nuclear reactor can generate up to 500 MW of power).

There is some hope, however, if one takes into account accelerator architectures. These could, for various specialized types of tasks, dramatically increase the performance of each of their cores (some existing literature uses accelerator cores and regular CPU cores interchangeably). Energy consumption per core would also be affected if one uses these definitions. For example, there are many many GPU cores in a single GPU card. I do believe scientists in our field has already recognized these accelerator architectures as being essential to the success of exascale computing.

Using the above assumptions about the hardware, what then needs to be considered for the software? Instead of looking at the general field of software in HPC, I would like to now focus on my sub-field of performance analysis of HPC applications. The very first thing that stares me in the face would be the core-counts. Traditionally, the performance data captured on each core conveniently represents the most accurate description of "what happened?". In the above scenario, we would now have to deal with information coming from 100 million cores. Do we keep 100 million files? Do we merge them? Do we try to summarize the information by various compression techniques (as I have explored in my Phd thesis)? The second challenge would be the massive number of events we would have to deal with. If my hypothesis about core speeds not climbing much faster is true, I would be grateful. For, if they did skyrocket, the same events would happen at a much higher frequency. That would in turn mean, if memory speeds fail to keep up with processor speeds (as they have traditionally failed to), recording these events would result in increased overheads and perturbation. Even if this did not happen, I would still have to deal with the sheer numbers for, in-lieu of speed, the application would be pumping the performance information out in parallel. The I/O subsystem had then better catch up if buffers get filled. Finally, what of the analysis? The new challenges I see, even if we do preserve all the details of performance data, is making sense of the increased heterogeneity of the events and the increased complexity of event interactions and how they affect performance. For example, communication between cores on the same chip should be vastly superior to communication between cores out-of-chip. The increased number of cores per chip could also make performance more sensitive to memory access, since the bandwidth between chip and memory gets tighter (given assumptions). The nature of GPGPU processing also adds new scenarios where performance may be sensitive to the way code is written, data shared or the datatypes used.

This is exciting stuff and I look forward to dealing with these challenges as we march forward.

2 comments:

  1. Chee Wai,

    Just out of curiosity, how do you envision cores in 100 million-core machines to be organized? Would we see a huge increase in #cores/node? Would the increase in number of (shared memory) cores per node increase in proportion to the increase in number of nodes?

    -Forrest

    ReplyDelete
  2. Hey Forrest,

    Sorry about the lack of a reply for such a long time, I realized I hadn't even set this blog up to notify me of comments.

    As it stands right now, I see lots of cores per chip and a decent number of chips per node. I also see finer-grained hardware units in specialized hardware like GPGPUs considered as "cores" (some people treat them as such even now).

    ReplyDelete