Since the beginning of the year CNL has been visiting the groups in CERN’s IT Department that provide support to the LHC Computing Grid (LCG). This issue we interview Tony Cass, the leader of the Fabric Infrastructure and Operations group (FIO). With nearly 60 members, the size of the group reflects the complexity of the tasks needed to run the CERN Computer Centre, which is the Tier-0 site of the Worldwide LCG collaboration and also offers Tier-1 functionality.

How is FIO structured to cope with the major changes that are occurring in the Computer Centre, in preparation for the LHC?

There are five sections in the group. In alphabetical order, the first is the Fabric Development section, which looks after the ELFms (the Extremely Large Fabric management system); this includes the LEMON monitoring system, Quattor for system configuration, and the CASTOR storage manager. These are grouped to ensure that all the software we support has a common design and development methodology, and to enable people to move between different software projects as priorities change. This has worked over the last year with people from ELFms, where development efforts are ramping down, moving to support CASTOR. Additional effort is needed for CASTOR at present, as increasingly realistic tests in preparation for LHC operations have highlighted some challenges. However, CASTOR2 is proving to be generally more robust than CASTOR1, and has sustained data throughput at more than 1 GB/s as well as meeting targets for data transfer to the Tier-1 sites.

The second section is Fabric Service, which is at the front line of delivering services to physicists. The team looks after disk and batch services, configures them and makes sure they are running to meet the experiments’ needs. The team also works closely with the Grid Deployment group to make sure that what we have in the Computer Centre is available at the Grid level, and that we are reporting the status of our machines correctly.

Then there is the Linux and AFS section. Although aging, AFS is a reliable file system that, together with the Kerberos authentication system, underpins much of how we manage our clusters. On the Linux side, the team has a challenging task given our technologically aware and demanding user community. In addition to general support, the team focuses on optimizing the Linux kernel to deliver the highest performance, especially for the file systems, for CASTOR and for networking.

The Technology and Storage Infrastructure section looks after the tape robots. We now have about 12 PB of tape capacity and must ramp up significantly for LHC operations next year. We’re also switching over from the obsolete 9940 drives and Powderhorn silos to new IBM and StorageTek robots. So there’s a lot of effort behind the scenes to copy 5 PB of existing data to new media. This section also looks after procurement. We installed 1200 PC boxes last year and we expect to buy at least as many this year. Finally, this team also handles hardware monitoring. We have several thousand disks now and it is important to understand how they and the RAIDs (Redundant Arrays of Independent Disks) are functioning.

Last but not least we have the System Administration and Operations section. This groups the system administration team – which provides a piquet (on-call) service and first-level cover for most of our systems – and long-term planning for the machine room. To install 2500 PCs you have to sort out racks, cabling and a lot of logistics well in advance. And of course, a growing preoccupation is to provide adequate cooling for all this equipment.

Power and cooling seems to be an increasing challenge for data centres. What is your view?

Looking into the future there are some eye-opening statistics. At the moment we use less than 1 MW to power the equipment in the Computer Centre. If the cooling stops, the temperature rises by half a degree per second. By 2009, when the rest of the equipment will be installed, temperatures will increase by one degree per second if cooling fails. This leaves little time to protect the equipment.

My biggest concern, though, is that computing power is projected to grow for years to come. Although chips are becoming more efficient this is not occurring fast enough to offset the overall growth in processing needs from the LHC experiments. A conservative estimate is that the Computer Centre will consume 20 MW by 2020. If we extrapolate from the growth during the Large Electron Positron collider era, however, 100 MW may be a more accurate estimate. Based on preliminary investigations it will not be possible to upgrade the Computer Centre for this, so we will need another data centre at CERN or elsewhere. Nothing is excluded and various hosting solutions are being investigated.

How is the status of Tier-0 services visible to users and to the Grid?

The LEMON status displays are open for all to look at but are designed for a specialist audience. A recent success for the group is SLS, the Service Level Status display. This is a user-oriented view of the service status that helps users understand how the Tier-0 and Tier-1 services at CERN are behaving, and allows them to drill down to individual components if necessary. This year we aim to deliver an XML interface to our monitoring database. This will enable experiment production managers to take decisions on what to do: for example, throttling back on production if there are problems with CASTOR.

In a similar vein, we want to avoid parallel monitoring systems querying sites on the Grid, so our data is being fed into systems like NAGIOS and GANGLIA even if we don’t use those monitoring tools ourselves.

Reliability remains an issue for the Grid. How is the Tier-0 faring in this respect?

We’re processing 5000–6000 jobs round the clock, with peaks of 40,000 jobs in the queue. We easily run 200,000 jobs each week and I’m pleased that ours is the most reliable site on the Grid. We’ve met the target every month since it has been measured and we’re easily 95–97% reliable.

Considerable ingenuity is required to maintain this sort of reliability, especially given the evolving nature of the Grid middleware. For example, the middleware queries the batch system once per minute per job to check if it is still running. At a Tier-2 site with a few hundred jobs, that is manageable. But for 6,000 jobs we’re getting hundreds of queries per second. So we had to develop a caching system to avoid overloading when just reporting that everything is OK.

I would emphasize, though, that the bottom line to providing such high reliability is the level of commitment of everyone in the group. I’ve lost count of the number of times people have come in at weekends and at night to fix even minor problems. This commitment of the people in the group is ultimately what makes a 24/7 service a reality.