Monitoring High-Performance Computing at Scale-Introduction and Worked Example

Type

Whitepaper

Description

There is still a long way to go to before using Tier-O High-Performance Computers for engineering design every day. Moving from a handful of demonstration runs to the desired mass production will require a special pair of glasses to get insights into how end-users used each software on each machine.

Takeaway

This white paper underlines the need for proper feedback to code developers when the simulation workflow must scale up. This feedback can focus on performances, crashes, users’ habits or any other metric. Unfortunately, there is no systematic method available yet.

Your organisation should consider adding a feedback process to your simulation workflow if:

HPC costs are not negligible.
The production is hard to track by a single worker (>1000 jobs per year).
HPC simulations are part of your design process.
The usage will span several years.

However, there are still many situations where it would be too burdensome:

HPC tools are in the demonstration stage only.
The volume of simulation is still manually manageable.
HPC tools are used in a break-through action, re-use is not expected

Web-URL:

Access the whitepaper on this link.

Acknowledgements

The authors wish to thank M. Nicolas Monnier, head of CERFACS Computer Support Group for his cooperation and discussions, Corentin Lapeyre, data-science expert, who created our first “all-queries MongoDb crawler”, and Tamon Nakano, Computer science and data science engineer who followed-up and created the crawler used to build the database behind these figures. (Many thanks to the multiple proof-readers from the EXCELLERAT initiative, of course.)

Elsa Gullaud After a PhD in Acoustics, she is now doing a postdoc in data science at CERFACS.

Antoine Dauptain is a research scientist focused on computer science and engineering topics for HPC.

Gabriel Staffelbach is a research scientist focused on new developments in HPC.

License

Public