The Grid Deployment (GD) group in CERN's IT department is in many ways the technical linchpin of CERN's Grid activities. As well as ensuring that the LHC Computing Grid (LCG) is primed for experimental data, it also supports the broader technical goals of the multiscience Grid infrastructure run by the Enabling Grids for E-sciencE (EGEE) project, now in its second phase. CNL interviewed Ian Bird, the leader of the GD group, about recent progress in LCG operations, and the road ahead.

How is the GD group structured and funded, and how will this evolve?

There are about 60 people in GD, who are organized in three sections: LCG service coordination; EGEE Grid operations; and middleware certification and data management development. EGEE funds about 20 of those people. Also integrated in the group is ETICS, an EGEE-related project for middleware certification that CERN manages. There are a further five associates who are funded by other related projects in which CERN participates. In total, about half the group is funded by EGEE and its related projects. So the Grid operations for LCG depend strongly on external funding. This is true not just at CERN but also at many other LCG sites.

The Grid operations are a success story of EGEE that is often overlooked. Much fuss is made about the EGEE middleware, but actually the operations form a larger part of the EGEE effort, and today we can safely say that this is now working well. Certainly there is room for improvement. In particular, the operations still require many people. This is partly a reflection of the immaturity of the software, and partly a reflection of the fact that everything is so distributed. One focus at the moment is on automation and improved monitoring of operations, in order to reduce the number of people involved. This is not just to be cheaper but also to provide a more reliable service. This trend to rationalize the operations of the Grid will be emphasized even more strongly during the third phase of the EGEE project, which is expected to start in 2008.

How have Grid operations evolved over the last year?

I think the main achievement is the general increase in scale. In the last year the amount of work being run on the EGEE service has increased by more than a factor of two. This is something that the operators don't notice – you only see it when you look at the accounting after the fact (figure 1). For me that is an indication that the processes we've got in place are capable of dealing with even more than we are currently running.

One of the important steps in getting more reliable operations is the Service Availability Monitoring (SAM) developed in GD. This is running Grid jobs to a site and probing each site externally on an hourly basis using a whole suite of tests of the services they are running. This has been the basis of the reliability measurements that we do for the Tier-1 centres in LCG. Previously, a criticism from the sites has been that this only gave output centrally. So over the last few months we have put effort into feeding back this information to the local fabric monitors at the sites. This is a prototype now but it should make a big difference, since site administrators will receive alarms – previously only seen centrally – directly in their own fabric monitoring systems.

The philosophy that we are following is: whatever can be done locally is best. The system administrators should be able to address a problem at source rather than us having to open a ticket centrally and then contact the site administrator. As part of this push, we have achieved agreement on a common format for publishing data from monitoring systems, so that this data can be shared between systems. The result is that whatever appears on an experiment dashboard about a particular site can be fed directly back to the site, so the site administrator knows what is going on.

We've also started a collaboration with EDS this year, via CERN openlab, that will focus on better ways of visualizing the complex results being recorded by the various monitors. The idea is to get some high-level visualization for the whole Grid, so you can see what is going on at a glance. I'm quite optimistic that this will provide a more intuitive way to assess the situation.

What are the challenges facing the LCG service?

We've heard a lot about service challenges and data rates over the last couple of years. I don't think that data rates are the main challenge anymore, it is the reliability that needs to be improved. The issue is to understand site problems; things stop working and it is not always clear why. A lot of effort is spent understanding why one experiment obtains a high data rate from a site, and another doesn't. So GD's role is a coordination effort with the experiments, trying to systematize how we handle problems. Today the LCG service focuses on the data transfer to the Tier-1s. For the rest, we rely primarily on either EGEE or its US counterpart, the Open Science Grid (OSG), to manage operations.

Interoperability remains a challenge for the LCG service. There has been good progress on some fronts. We now have a joint operation between EGEE and OSG, with OSG participating in the weekly Grid operations meeting that GD runs. A positive outcome of this is that CMS has been using the interoperability between the two Grids in their production for nearly a year: they submit work through the EGEE Resource Broker middleware, which runs in OSG. We are also making progress with the Naregi Grid project in Japan and NorduGrid in the Nordic countries. The Japanese Tier-2 sites would like to receive their support locally through Naregi. How interoperable the two middleware systems will become remains to be seen, but this scenario is not excluded. There is funded effort for interoperability between NorduGrid's ARC middleware and EGEE's gLite. We're close to being able to run jobs from the EGEE Resource Broker into ARC, but the other way around is further off.

What is the status of middleware certification and data management development?

Getting gLite 3 out of the door last year was a major milestone. This was a merger of the two middleware stacks, LCG and gLite. But at the same time it was a merger of two groups, two build systems and all of the processes behind that. And we did it in just six months.

Today the certification activity is taking middleware from EGEE, from the Virtual Data Toolkit (VDT), and building the gLite distribution. There is a large certification test-bed, both at CERN and with EGEE partners, where we do functional testing, integration testing and stress testing of each release. This is an activity that started in the LCG days and has expanded considerably. One of the problems with the middleware has been the lack of portability. There has been a significant effort in the last few months to untangle some of the dependencies in order to make it more portable.

The data management development team is responsible for the File Transfer Service (FTS) software and the corresponding service at CERN. The team is also responsible for the file catalogue that the experiments are using. There have been some major improvements in that area recently, in particular the development of the Disk Pool Manager (DPM) system, a storage element for sites that do not have enough effort to deploy the dCache system, for example.

What lies ahead for GD as the LHC start-up approaches?

One challenge is whether this Grid can really cope with the level of work we anticipate when the LHC starts up. The EGEE infrastructure is now running about 100,000 jobs a day. The number of jobs is often a better indicator of complexity than sheer CPU time, since jobs are a measure of how busy the system is. CMS and ATLAS say that they will each need to run a quarter of a million jobs a day once the data start flowing. I'm confident that the operation itself will deal with this five-fold increase of activity, because resources at sites will increase. The issue is whether the middleware can really scale in a manageable way to these levels, or whether you have to introduce so many service nodes that it becomes hard for each site to manage them.

Another challenge is that we will reach the end of the third phase of EGEE project funding in 2010. The idea is that we will make the transition to a model of national Grid infrastructures with a European coordination, which is a fundamentally different operations model from what we have now. How we make the transition to that, without disrupting ongoing operations, remains to be seen. We don't know what responsibilities each stakeholder will take. While 2010 may seem a way off, this is something that needs to be worked out within the next year if it is to be implemented in time.

Finally, we don't have much experience yet with the experiments doing analysis on the Grid. There are probably about 100 people per experiment who have run a job on the Grid. By next year perhaps half the collaboration will want to look at the data. How the Grid will cope with that is still a big unknown.