Managing a Universe of Data at CERN

By Jamie Shiers

With its Large Hadron Collider expected to generate 100PB of data, CERN is looking to a modified, off-the-shelf ODBMS to manage a "big bang" of data

At the European Laboratory for Particle for Physics (better known by its acronym CERN), we're pushing all kinds of limits. CERN's Large Hadron Collider (LHC)--which, when completed in 2005, will be the largest particle accelerator ever built, is designed to further our understanding of forces and matter. But equally exciting may be our comparatively unknown efforts to redefine the limits of data storage, management, and access, considering that the LHC is expected to produce nearly 100PB of high-energy physics (HEP) data over its operational lifetime.
      In 1995, the RD45 project was launched by CERN to address these issues. The project was explicitly directed toward the management of persistent objects under the assumption that the LHC experiments would use object-oriented solutions and C++ as an implementation language. (Not surprisingly, we are now including Java in our investigations; more on this later.) In addition, the project was directed to focus on commercial, open-system solutions. Although we considered many alternatives, including language extensions, so-called lightweight object managers, and others, we rapidly reached the conclusion that only a full-fledged ODBMS could come close to meeting our initial requirements.
      We have successfully deployed an off-the-shelf ODBMS together with application specific enhancements in a production environment. Using this technology, we anticipate being able to manage the vast volumes of data that will be generated at the LHC, using essentially the same amount of labor as was required for a smaller project that produces three orders of magnitude less data. Although a small amount of code has been written to facilitate the use of this product, the cost savings associated through the adoption of object database technology have been extremely impressive. For example, one of our existing database applications took approximately 10 person-years to write in Fortran. The equivalent application based on an ODBMS was implemented last year by a single graduate student.
 
BIRTH OF A COLLIDER
CERN was established in Switzerland in the mid-1950s to rekindle the spirit of fundamental scientific research in war-ravaged Europe. From the outset, it has been an excellent example of international collaboration; some 20 nations jointly fund CERN, which is staffed primarily, but not exclusively, by personnel from these countries.
      Over the years, CERN has built several accelerators that have been reused as infrastructure for subsequent machines. The largest accelerator currently in operation at CERN is the Large Electron Positron (LEP) collider, a 17-mile ring located just outside Geneva. Following the CERN tradition of reuse, the LHC will be housed in the LEP tunnel and fed by the existing complex of accelerators.
      Four major experiments are planned for the LHC. Two of these, Atlas and CMS, will record about 1PB of data per year at data rates reaching 100MB/second. A third experiment, Alice, will also store about 1PB/year but record data at even higher rates--up to 1.5GB/second. A fourth experiment, LHC-B, will produce a total data volume of 5PB per year. Based on LHC's projected operational lifespan of 20 years, the total volume of data that is expected to be stored will reach 100PB--approximately twice the amount of what would be on record had one bit been stored every second since the "Big Bang" 15 billion years ago!
 
OBJECT DATABASES AND STORAGE
One of the many challenges of a project with such a long timeline is the uncertainties presented by the evolution of technology. It is not enough to be blindly optimistic; you must create realistic, conservative scenarios based on which products are likely to be available, as well as fallback possibilities should technology not advance as expected. For example, 25 years ago, many of today's giants in the computing industry either did not exist or hadn't entered the mainstream--hardware architectures such as Digital Equipment Corp.'s VAX were yet to be invented (and replaced), Codasyl databases were all the rage, and ODBMSs were virtually unknown.
      When planning for the future, being too specific is clearly unrealistic. You shouldn't depend on a given implementation or on precise characteristics; it is more meaningful to predict whether there will be a market for a product with certain functionality. For example, it's hard to believe that the storage market will do anything but grow, but the exact form factor of disks 10 years from now cannot be predicted. This analysis suggests that the basic hardware building blocks necessary to create a multipetabyte store will be available as well as affordable, although the relative costs of disk versus tape (or its replacement) are much harder to predict. Indeed, the long-term market for tape as a medium may simply disappear.
      In our storage strategy for RD45, we anticipate the need for a mass storage system (see sidebar, "Mass Storage Systems") because it is far from clear that it will be financially possible to store all the data on disk. Clearely, such a system must be transparently integrated with the overall data management solution. Latency aside, a user should not (and need not) be aware if data is stored on disk, tape, or anywhere else.
 
CHOOSING AN ODBMS
The first thing to realize about choosing an ODBMS is that there is no "ideal" product. The available systems differ widely in architecture and functionality, so the question is more one of matching your requirements against several potential solutions. In a perfect world, several alternatives would meet all of the mandatory requirements and you could select the primary supplier based on noncritical issues. In reality, this is typically not the case for large projects; requirements must be prioritized and compromises made.
      In CERN's case, the most critical requirement is that of scale. A solution that will not with a reasonable number of enhancements scale to meet our data volume needs is out of the question. Clearly, a single physical database of 100PB is currently impossible, so some form of distributed database is necessary. This fact raises even more scalability issues: Even if physical ODBMSs could be as large as 100GB--already very large by today's standards--then one million of such databases would be required to accommodate 100PB. These databases would have to appear to the user as a single, logical database with transparent, cross-database references and a single (ideally replicated), shared, and hence consistent, set of schema. We describe below how such features are supported in our preferred ODBMS, namely Objectivity/DB.
      Performance plays a close second to scalability in our ranking of criteria. This requirement can be expressed in terms of the raw data rate that must be supported--from 100MB/second for Atlas and CMS to 1.5GB/second for Alice--and the aggregate data rate needed to perform a typical physics analysis. Of course, absolute performance depends on many factors, including the layers that reside below the database. Consequently, we prefer to express performance as a percentage of the performance of the underlying systems.
 
FALL-BACK STRATEGY
In order to ensure maximum vendor independence, we adhere to the programming interfaces defined by the ODMG. In principle, an application that is built using the ODMG C++ binding can be ported from one conforming implementation to another simply by recompiling. A data interchange format is even specified for converting the data. However, adhering to this spec is unlikely to help us completely avoid vendor lock-in. The ODMG does not specify implementation, only interface, so the various products offering various degrees of conformance to the ODMG bindings have widely differing architectures. Apart from trivial example programs, production applications will almost certainly be influenced by the architecture of the product on which they are deployed.
      Which is best for your application, object server or page server? What level of locking granularity is required? How is object clustering implemented? These questions imply that a quick recompile and conversion of existing data are far from sufficient should you need to change vendors. This conversion should not be undertaken lightly, but it can be necessary and its probability will not reduce over time.
      Including an escrow clause into the contract is not necessarily the answer. (Can you afford to take over the development and support of the product in-house?) Nevertheless, the need for a fall-back strategy is clear; you cannot afford to cancel the whole program simply because of a problem in one area. However, that doesn't necessarily mean that two given solutions are equal in functionality or performance. The fall-back solution almost certainly implies reduced functionality, perhaps a more restrictive computing model, and maybe even a delay in results.
      As discussed previously, the best protection against such problems is a large, healthy market--in other words, adopt commodity solutions wherever you can. Unfortunately, the demand for fully distributed, highly scalable ODBMSs has yet to reach a commodity level, and the HEP market will probably not be large enough to sustain such products by itself. However, some of the markets in which ODBMSs are currently successful, such as the telecomm market, can be assumed to be large enough to support at least one ODBMS product. Performance and distribution are also critical in this area, so the better question is: Can we exploit technology "designed" for (and supported by) this market?
 
TESTS OF SCALABILITY
As our key requirement is scalability, particularly in terms of data volume, we designed a series of tests to gauge the suitability of the Objectivity/DB ODBMS from Objectivity Inc. These tests are described in detail at the RD45 Web page at http://wwwinfo.cern.ch/asd/cernlib/rd45/index.shtml.
      Objectivity/DB supports distributed databases with shared, consistent schema; as many as 216 physical databases form a federation in Objectivity's terminology. Currently, each of these databases maps to a normal file in the underlying file system. Hence, when using 32-bit file systems, databases are limited to 2GB, whereas in 64-bit file systems their size for all practical purposes is unlimited. Databases in turn comprise containers, each of which is composed of database "pages" on which the objects themselves are stored. Essentially, the four 16-bit fields of Objectivity/DB's 64-bit object identifier map to the database ID, the container ID, the logical page, and the slot on that page.
      Our tests hit every documented limit in this architecture (although not simultaneously) in order to find the undocumented limits. In these tests, due to limited disk space, the largest federation we were able to create was a mere 500GB. We were, however, able to demonstrate physical databases well beyond the 2GB limit from 32-bit file systems and found that a federation containing the maximum number of databases was indeed possible. By exploiting both of these potentialities--which remain to be verified--distributed databases up to the petabyte range are possible. Unfortunately, as our requirements go far beyond this range, the architecture should, on paper, scale to at least one order of magnitude beyond this limit to avoid arbitrary constraints such as filling every slot on every page, every container up to its limit, and so on. Based on these requirements, we required three main enhancements to the ODBMS: full ODMG compliance, including support for STL collection classes; architectural changes to permit scaling to at least 100PB; and integration with a mass storage system.
      ODBMSs typically store their data in files, using the native file system on the host machine. In other words, some level of the containment hierarchy maps to a standard file, albeit with complex internal structure. Although 64-bit file systems are becoming fairly widespread, a maximum realistic file size must still be imposed. For example, if inactive data is stored offline on tape, the basic quantum that is moved to/from tape should be no larger than a single tape volume and be recallable within a reasonable period of time (1,000 seconds would be acceptable, 100 seconds preferable). Although other reasons exist for limiting the file size, such rule-of-thumb arguments suggest that a maximum file size of a few gigabytes is reasonable today, 10GB will be acceptable in a year or so, and perhaps 100GB will be acceptable in 2005. Using such a maximum file size, it can be argued that at least two ODBMS architectures permit distributed databases in the petabyte range. However, as 100PB is our requirement, the architecture should permit much larger databases to avoid any arbitrary constraints.
      Basically, if the maximum file size is limited, there is only one way to make a very large object database possible: by increasing the maximum allowable number of files per logical database. A simple way this goal can be met is by increasing the object identifier. However, this solution is not necessarily ideal as it has a direct impact on storage overhead, which can be significant for small objects.
      An alternative solution involves logical-physical mapping. For example, in some systems, one database can be mapped to multiple files (volumes). In the case of Objectivity/DB, this tactic could be implemented by mapping containers, rather than databases, to files. Essentially, as this approach would permit 232 files--if these files are kept relatively small (a few gigabytes)--it is still architecturally possible to reach the EB region. In fact, extending the object identifier theoretically makes it possible to reach the zettabyte--1021 bytes--range!
      As for integrating an ODBMS with a mass storage system, suffice it to say that this integration is performed at the file-system level; the database issues "standard" I/O calls and the underlying system calls formerly inactive files that have been staged out to tape. This approach, of course, requires efficient clustering because the latency involved in such operations is significant, and you want to avoid reloading a file to retrieve just a few objects. Thus objects that will probably be accessed together should be stored together, at least for the major access patterns.
 
DATA CLUSTERING
Data clustering is another issue that will clearly become important. Already complex in its own right, the combination of an object database with a mass storage system adds further complications. The latency involved in bringing and object into memory from a tape-resident database is significantly larger than that of a disk resident or server-side cached object. Furthermore, while some architectures transfer individual objects from server to client, this will clearly not occur directly from tertiary storage; much larger data volumes than a single object must be moved to and from tape.
      A related issue is that of metadata. If much of the data is tape-resident, maintaining consistent metadata such as schema across large numbers of databases is essentially impossible. For example, if a single schema change required every database to be restored to disk, modified, and flushed out to tape again, the system could never work in production.
      In our environment, the nature of our data is a helpful factor here. The data corresponding to individual events--the results of interactions among the accelerated particles that are subsequently observed by the detector--is independent; no interevent relationships occur. Furthermore, individual events have a natural hierarchy. The information read from the detectors--the raw data that forms the bulk of the total information--is processed sequentially and infrequently (once per year or so). Information derived from the raw data can also be divided into several categories, leading to small subsets that are used for physics analysis. It is also possible to divide events into different categories, such as by physics "channel."
      Clearly, these data subsets will overlap and it will be impossible to define a clustering strategy that is ideal for all types of access. However, we believe that it will be possible to improve simple time-based clustering substantially, and that we will often, if not always, be able to use the random-access capabilities of the database to avoid duplicating data in multiple streams.
      Some classes of data will remain permanently disk-resident. This data includes "hot" physics channels such as Higgs-event candidates and metadata. Examples of metadata include detector calibration information, which are required for detector response correction and to determine if an event should be rejected, and event characteristics. The latter, which we call event tags, are expected to be the primary entry point into the data. Users will select collections of events based on high-level characteristics--which could be as simple as "Higgs candidates" or "high Pt selection"--or based on properties such as the number of electron candidates. Each event will be tagged by its properties, and users will be able to select subsets of data based on these properties, define new tags and collections, and so on.
      The tag objects will contain references to the main event data and will allow users to use not only the information that is "copied out" into the tag, but also the full information from the event right down to the raw data. This is a major advantage of a global, database-oriented solution over the file-based solutions traditionally used in HEP. Previously, navigating from the subsamples used for analysis to the full event information was an ad-hoc and time-consuming operation. Using the database approach, we can store not just the data but also the selection criteria used to produce (logical) subsamples, as well as final results, in a consistent, managed fashion.
 
PRODUCTION USAGE
Although our main goal is to find solutions that meet the requirements of the LHC experiments, an early demonstration of the feasibility of such an approach is clearly necessary. Our initial studies used legacy data from the LEP and other HEP experiments. However, in late 1995, we were contacted by a running CERN project called Ceres that had just made the decision to move to C++. This development was thought to be an excellent opportunity for both parties--Ceres was looking for help in converting to C++, and we were looking for people willing to test elements of the proposed strategy.
      During 1996, the Ceres group reimplemented much of its software in C++ using an ODBMS for persistence. Although the total amount of data stored was small--a few tens of gigabytes--it proved that such a system could be used for storing physics data, such as by a demonstration of writing from 32 processes concurrently into the database. This latter point is particularly important for the LHC, where widespread use of parallelism will be essential for managing the required data rates.
      Encouraged by this success, Ceres plans to use the ODBMS-based solution for future data-taking runs, storing some 30TB of data in 1999. Several other experiments have also adopted this solution, including CERN's Compass experiment, which will store over 300TB of data per year from 2000 on at data rates of around 35MB/second, and the BaBar project at the Stanford Linear Accelerator Center in the United States. If all proceeds according to plan, at least 1PB/year of HEP data will be stored in ODBMSs starting in 2000. Arguably, this is even more of a challenge than the 5PB/year expected at LHC, as these projects will not be able to benefit from long-term technology advances.
 
MIXED LANGUAGE SUPPORT AND JAVA
Given our project timeline, perhaps the only certainty is that we had better be prepared for change. In the past, it was possible to choose a computing environment at the beginning of a project and stick with it throughout the lifecycle. This approach has already broken down at CERN. During the construction phase of the LEP, the bulk of the offline computing was performed using Fortran on mainframes. Now, with perhaps three more years of data capture followed by another two years of analysis to go, mainframes are almost history. Considerable changes have had to be made to move from centralized to distributed computing; fortunately, much more development is already being performed in C or C++ than Fortran.
       Given LHC's long timeline and the increased rate of change in the computing industry, even more changes can be expected. Although C++ is currently the programming language of choice, more and more developments are being done in Java. On the database side, it is important to note that many products already offer multiple language bindings, Java being the most recent. To date, our experience with the Java binding has primarily been limited to its use for developing the user interface to database administration tools. Here, any performance arguments are hardly relevant and the standard "write-once, run-anywhere" arguments prevail. We have yet to perform detailed cross-language tests, but it is clear that we need to understand to what extent legacy, persistent C++ objects can be accessed by Java applications. Of course, this is a particularly easy case--the two languages are similar and coexistent. How we can plan for the languages that will inevitably come in the next 25 years is a much harder problem.
 
FUTURE DIRECTIONS
Our future plans can be stated in a single word: production. Although this work has already been used in production by Ceres and other experiments, in the next two years we'll see dramatic growth in the volume of data that will be stored in ODBMSs at CERN. We expect to remain at the order of a few terabytes during 1998, reaching perhaps even 100TB in 1999, and enter full swing in the year 2000; in the meantime, Compass will be storing one-third to one-half a petabyte per year. HEP experiments at other laboratories around the world are expected to store similar amounts of data on the same time scale.
 
Jamie Shiers is the project leader of the RD45 project at CERN. He can be reached at Jamie.Shiers@cern.ch.
 

 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!