Sep 12, 2007
Tests improve Grid performance
Being a Grid user isn't always straight-forward. You might have 10,000 CPUs and terabytes of disk space at your fingertips, but can you get your job to work on all of them – or any of them?
Although there are many more Grid users than there used to be, getting started on a Grid, and getting it to do what you want, is still not for the faint-hearted.
Fortunately there are people trying to make it easier. One of them is Steve Lloyd of Queen Mary, University of London. As part of his work on the ATLAS experiment, Lloyd has been sending jobs to Grids such as GridPP and EGEE for years. But although many of his jobs went off without a hitch Lloyd found some just kept failing, even though they were sent to a working site that had passed all of the Grid's tests.
Six months ago he decided to find out why. And as chair of the GridPP collaboration in the UK he was in a position to get problems fixed.
The result of Lloyd's work is a suite of three test jobs that run hourly on sites in the UK particle physics Grid. The complexity of these test jobs range from submitting "hello world" to analysing a file of particle physics data using the latest ATLAS software.
Lloyd explains: "When I first started running these tests their success rate was only around 50%. I'd get a massive range of problems: broken resource brokers, difficulties with the information system, proxy certificates timing out, sites that didn't have the latest version of the ATLAS software, and even sites without the required compiler."
Using the detailed log files provided by Lloyd's test jobs, and with the aid of the GridPP deployment team, each Grid site got to the bottom of their problems.
Lloyd's test jobs now run at a 90% success rate. This gives him some hope for future Grid users. "I used to wonder how users would ever be able to analyse the ATLAS data on the Grid. Now I'm more hopeful – but we've still got a lot of work to do." Lloyd's experience shows that things don't always go smoothly, even for experienced Grid users. But things are on the up.
• This article was published online in iSGTW on 13 June.
About the author
Sarah Pearce, GridPP