| |
Managing a Universe of Data at CERN
By Jamie Shiers
With its Large Hadron Collider expected to generate
100PB of data, CERN is looking to a modified, off-the-shelf ODBMS to manage a "big bang" of
data
At the European Laboratory for Particle for Physics (better known by its
acronym CERN), we're pushing all kinds of limits. CERN's Large Hadron
Collider (LHC)--which, when completed in 2005, will be the largest
particle accelerator ever built, is designed to further our
understanding of forces and matter. But equally exciting may be our
comparatively unknown efforts to redefine the limits of data storage,
management, and access, considering that the LHC is expected to produce
nearly 100PB of high-energy physics (HEP) data over its operational
lifetime.
In 1995, the RD45 project was
launched by CERN to address these
issues. The project was explicitly directed toward the management of
persistent objects under the assumption that the LHC experiments would
use object-oriented solutions and C++ as an implementation language.
(Not surprisingly, we are now including Java in our investigations; more
on this later.) In addition, the project was directed to focus on
commercial, open-system solutions. Although we considered many
alternatives, including language extensions, so-called lightweight
object managers, and others, we rapidly reached the conclusion that only
a full-fledged ODBMS could come close to meeting our initial
requirements.
We have successfully deployed an off-the-shelf ODBMS together
with application specific enhancements in a production environment.
Using this technology, we anticipate being able to manage the vast
volumes of data that will be generated at the LHC, using essentially the
same amount of labor as was required for a smaller project that produces
three orders of magnitude less data. Although a small amount of code has
been written to facilitate the use of this product, the cost savings
associated through the adoption of object database technology have been
extremely impressive. For example, one of our existing database
applications took approximately 10 person-years to write in Fortran. The
equivalent application based on an ODBMS was implemented last year by a
single graduate student.
BIRTH OF A COLLIDER
CERN was established in Switzerland in the mid-1950s to rekindle the
spirit of fundamental scientific research in war-ravaged Europe. From
the outset, it has been an excellent example of international
collaboration; some 20 nations jointly fund CERN, which is staffed
primarily, but not exclusively, by personnel from these countries.
Over the years, CERN has built several accelerators that have
been reused as infrastructure for subsequent machines. The largest
accelerator currently in operation at CERN is the Large Electron
Positron (LEP) collider, a 17-mile ring located just outside Geneva.
Following the CERN tradition of reuse, the LHC will be housed in the LEP
tunnel and fed by the existing complex of accelerators.
Four major experiments are planned for the LHC. Two of these,
Atlas and CMS, will record about 1PB of data per year at data rates
reaching 100MB/second. A third experiment, Alice, will also store about
1PB/year but record data at even higher rates--up to 1.5GB/second. A
fourth experiment, LHC-B, will produce a total data volume of 5PB per
year. Based on LHC's projected operational lifespan of 20 years, the
total volume of data that is expected to be stored will reach
100PB--approximately twice the amount of what would be on record had one
bit been stored every second since the "Big Bang" 15 billion years ago!
OBJECT DATABASES AND STORAGE
One of the many challenges of a project with such a long timeline is the
uncertainties presented by the evolution of technology. It is not enough
to be blindly optimistic; you must create realistic, conservative
scenarios based on which products are likely to be available, as well as
fallback possibilities should technology not advance as expected. For
example, 25 years ago, many of today's giants in the computing industry
either did not exist or hadn't entered the mainstream--hardware
architectures such as Digital Equipment Corp.'s VAX were yet to be
invented (and replaced), Codasyl databases were all the rage, and ODBMSs
were virtually unknown.
When planning for the future, being too specific is clearly
unrealistic. You shouldn't depend on a given implementation or on
precise characteristics; it is more meaningful to predict whether there
will be a market for a product with certain functionality. For example,
it's hard to believe that the storage market will do anything but grow,
but the exact form factor of disks 10 years from now cannot be
predicted. This analysis suggests that the basic hardware building
blocks necessary to create a multipetabyte store will be available as
well as affordable, although the relative costs of disk versus tape (or
its replacement) are much harder to predict. Indeed, the long-term
market for tape as a medium may simply disappear.
In our storage strategy for RD45, we anticipate the need for a
mass storage system (see sidebar, "Mass Storage Systems") because it is
far from clear that it will be financially possible to store all the
data on disk. Clearely, such a system must be transparently integrated
with the overall data management solution. Latency aside, a user should
not (and need not) be aware if data is stored on disk, tape, or anywhere
else.
CHOOSING AN ODBMS
The first thing to realize about choosing an ODBMS is that there is no
"ideal" product. The available systems differ widely in architecture and
functionality, so the question is more one of matching your requirements
against several potential solutions. In a perfect world, several
alternatives would meet all of the mandatory requirements and you could
select the primary supplier based on noncritical issues. In reality,
this is typically not the case for large projects; requirements must be
prioritized and compromises made.
In CERN's case, the most critical requirement is that of scale.
A solution that will not with a reasonable number of enhancements scale
to meet our data volume needs is out of the question. Clearly, a single
physical database of 100PB is currently impossible, so some form of
distributed database is necessary. This fact raises even more
scalability issues: Even if physical ODBMSs could be as large as
100GB--already very large by today's standards--then one million of such
databases would be required to accommodate 100PB. These databases would
have to appear to the user as a single, logical database with
transparent, cross-database references and a single (ideally
replicated), shared, and hence consistent, set of schema. We describe
below how such features are supported in our preferred ODBMS, namely
Objectivity/DB.
Performance plays a close second to scalability in our ranking
of criteria. This requirement can be expressed in terms of the raw data
rate that must be supported--from 100MB/second for Atlas and CMS to
1.5GB/second for Alice--and the aggregate data rate needed to perform a
typical physics analysis. Of course, absolute performance depends on
many factors, including the layers that reside below the database.
Consequently, we prefer to express performance as a percentage of the
performance of the underlying systems.
FALL-BACK STRATEGY
In order to ensure maximum vendor independence, we adhere to the
programming interfaces defined by the ODMG. In principle, an application
that is built using the ODMG C++ binding can be ported from one
conforming implementation to another simply by recompiling. A data
interchange format is even specified for converting the data. However,
adhering to this spec is unlikely to help us completely avoid vendor
lock-in. The ODMG does not specify implementation, only interface, so
the various products offering various degrees of conformance to the ODMG
bindings have widely differing architectures. Apart from trivial example
programs, production applications will almost certainly be influenced by
the architecture of the product on which they are deployed.
Which is best for your application, object server or page server?
What level of locking granularity is required? How is object clustering
implemented? These questions imply that a quick recompile and conversion
of existing data are far from sufficient should you need to change
vendors. This conversion should not be undertaken lightly, but it can be
necessary and its probability will not reduce over time.
Including an escrow clause into the contract is not necessarily
the answer. (Can you afford to take over the development and support of
the product in-house?) Nevertheless, the need for a fall-back strategy
is clear; you cannot afford to cancel the whole program simply because
of a problem in one area. However, that doesn't necessarily mean that
two given solutions are equal in functionality or performance. The
fall-back solution almost certainly implies reduced functionality,
perhaps a more restrictive computing model, and maybe even a delay in
results.
As discussed previously, the best protection against such problems
is a large, healthy market--in other words, adopt commodity solutions
wherever you can. Unfortunately, the demand for fully distributed,
highly scalable ODBMSs has yet to reach a commodity level, and the HEP
market will probably not be large enough to sustain such products by
itself. However, some of the markets in which ODBMSs are currently
successful, such as the telecomm market, can be assumed to be large
enough to support at least one ODBMS product. Performance and
distribution are also critical in this area, so the better question is:
Can we exploit technology "designed" for (and supported by) this market?
TESTS OF SCALABILITY
As our key requirement is scalability, particularly in terms of data
volume, we designed a series of tests to gauge the suitability of the
Objectivity/DB ODBMS from Objectivity Inc. These tests are described in
detail at the RD45 Web page at
http://wwwinfo.cern.ch/asd/cernlib/rd45/index.shtml.
Objectivity/DB supports distributed databases with shared,
consistent schema; as many as 216 physical databases form a federation
in Objectivity's terminology. Currently, each of these databases maps to
a normal file in the underlying file system. Hence, when using 32-bit
file systems, databases are limited to 2GB, whereas in 64-bit file
systems their size for all practical purposes is unlimited. Databases in
turn comprise containers, each of which is composed of database "pages"
on which the objects themselves are stored. Essentially, the four 16-bit
fields of Objectivity/DB's 64-bit object identifier map to the database
ID, the container ID, the logical page, and the slot on that page.
Our tests hit every documented limit in this architecture (although
not simultaneously) in order to find the undocumented limits. In these
tests, due to limited disk space, the largest federation we were able to
create was a mere 500GB. We were, however, able to demonstrate physical
databases well beyond the 2GB limit from 32-bit file systems and found
that a federation containing the maximum number of databases was indeed
possible. By exploiting both of these potentialities--which remain to be
verified--distributed databases up to the petabyte range are possible.
Unfortunately, as our requirements go far beyond this range, the
architecture should, on paper, scale to at least one order of magnitude
beyond this limit to avoid arbitrary constraints such as filling every
slot on every page, every container up to its limit, and so on. Based on
these requirements, we required three main enhancements to the ODBMS:
full ODMG compliance, including support for STL collection classes;
architectural changes to permit scaling to at least 100PB; and
integration with a mass storage system.
ODBMSs typically store their data in files, using the native
file system on the host machine. In other words, some level of the
containment hierarchy maps to a standard file, albeit with complex
internal structure. Although 64-bit file systems are becoming fairly
widespread, a maximum realistic file size must still be imposed. For
example, if inactive data is stored offline on tape, the basic quantum
that is moved to/from tape should be no larger than a single tape volume
and be recallable within a reasonable period of time (1,000 seconds
would be acceptable, 100 seconds preferable). Although other reasons
exist for limiting the file size, such rule-of-thumb arguments suggest
that a maximum file size of a few gigabytes is reasonable today, 10GB
will be acceptable in a year or so, and perhaps 100GB will be acceptable
in 2005. Using such a maximum file size, it can be argued that at least
two ODBMS architectures permit distributed databases in the petabyte
range. However, as 100PB is our requirement, the architecture should
permit much larger databases to avoid any arbitrary constraints.
Basically, if the maximum file size is limited, there is only
one way to make a very large object database possible: by increasing the
maximum allowable number of files per logical database. A simple way
this goal can be met is by increasing the object identifier. However,
this solution is not necessarily ideal as it has a direct impact on
storage overhead, which can be significant for small objects.
An alternative solution involves logical-physical mapping. For
example, in some systems, one database can be mapped to multiple files
(volumes). In the case of Objectivity/DB, this tactic could be
implemented by mapping containers, rather than databases, to files.
Essentially, as this approach would permit 232 files--if these files are
kept relatively small (a few gigabytes)--it is still architecturally
possible to reach the EB region. In fact, extending the object
identifier theoretically makes it possible to reach the zettabyte--1021
bytes--range!
As for integrating an ODBMS with a mass storage system, suffice
it to say that this integration is performed at the file-system level;
the database issues "standard" I/O calls and the underlying system calls
formerly inactive files that have been staged out to tape. This
approach, of course, requires efficient clustering because the latency
involved in such operations is significant, and you want to avoid
reloading a file to retrieve just a few objects. Thus objects that will
probably be accessed together should be stored together, at least for
the major access patterns.
DATA CLUSTERING
Data clustering is another issue that will clearly become important.
Already complex in its own right, the combination of an object database
with a mass storage system adds further complications. The latency
involved in bringing and object into memory from a tape-resident
database is significantly larger than that of a disk resident or
server-side cached object. Furthermore, while some architectures
transfer individual objects from server to client, this will clearly not
occur directly from tertiary storage; much larger data volumes than a
single object must be moved to and from tape.
A related issue is that of metadata. If much of the data is
tape-resident, maintaining consistent metadata such as schema across
large numbers of databases is essentially impossible. For example, if a
single schema change required every database to be restored to disk,
modified, and flushed out to tape again, the system could never work in
production.
In our environment, the nature of our data is a helpful factor
here. The data corresponding to individual events--the results of
interactions among the accelerated particles that are subsequently
observed by the detector--is independent; no interevent relationships
occur. Furthermore, individual events have a natural hierarchy. The
information read from the detectors--the raw data that forms the bulk of
the total information--is processed sequentially and infrequently (once
per year or so). Information derived from the raw data can also be
divided into several categories, leading to small subsets that are used
for physics analysis. It is also possible to divide events into
different categories, such as by physics "channel."
Clearly, these data subsets will overlap and it will be impossible
to define a clustering strategy that is ideal for all types of access.
However, we believe that it will be possible to improve simple
time-based clustering substantially, and that we will often, if not
always, be able to use the random-access capabilities of the database to
avoid duplicating data in multiple streams.
Some classes of data will remain permanently disk-resident. This
data includes "hot" physics channels such as Higgs-event candidates and
metadata. Examples of metadata include detector calibration information,
which are required for detector response correction and to determine if
an event should be rejected, and event characteristics. The latter,
which we call event tags, are expected to be the primary entry point
into the data. Users will select collections of events based on
high-level characteristics--which could be as simple as "Higgs
candidates" or "high Pt selection"--or based on properties such as the
number of electron candidates. Each event will be tagged by its
properties, and users will be able to select subsets of data based on
these properties, define new tags and collections, and so on.
The tag objects will contain references to the main event data
and will allow users to use not only the information that is "copied
out" into the tag, but also the full information from the event right
down to the raw data. This is a major advantage of a global,
database-oriented solution over the file-based solutions traditionally
used in HEP. Previously, navigating from the subsamples used for
analysis to the full event information was an ad-hoc and time-consuming
operation. Using the database approach, we can store not just the data
but also the selection criteria used to produce (logical) subsamples, as
well as final results, in a consistent, managed fashion.
PRODUCTION USAGE
Although our main goal is to find solutions that meet the requirements
of the LHC experiments, an early demonstration of the feasibility of
such an approach is clearly necessary. Our initial studies used legacy
data from the LEP and other HEP experiments. However, in late 1995, we
were contacted by a running CERN project called Ceres that had just made
the decision to move to C++. This development was thought to be an
excellent opportunity for both parties--Ceres was looking for help in
converting to C++, and we were looking for people willing to test
elements of the proposed strategy.
During 1996, the Ceres group reimplemented much of its software
in C++ using an ODBMS for persistence. Although the total amount of data
stored was small--a few tens of gigabytes--it proved that such a system
could be used for storing physics data, such as by a demonstration of
writing from 32 processes concurrently into the database. This latter
point is particularly important for the LHC, where widespread use of
parallelism will be essential for managing the required data rates.
Encouraged by this success, Ceres plans to use the ODBMS-based
solution for future data-taking runs, storing some 30TB of data in 1999.
Several other experiments have also adopted this solution, including
CERN's Compass experiment, which will store over 300TB of data per year
from 2000 on at data rates of around 35MB/second, and the BaBar project
at the Stanford Linear Accelerator Center in the United States. If all
proceeds according to plan, at least 1PB/year of HEP data will be stored
in ODBMSs starting in 2000. Arguably, this is even more of a challenge
than the 5PB/year expected at LHC, as these projects will not be able to
benefit from long-term technology advances.
MIXED LANGUAGE SUPPORT AND JAVA
Given our project timeline, perhaps the only certainty is that we had
better be prepared for change. In the past, it was possible to choose a
computing environment at the beginning of a project and stick with it
throughout the lifecycle. This approach has already broken down at CERN.
During the construction phase of the LEP, the bulk of the offline
computing was performed using Fortran on mainframes. Now, with perhaps
three more years of data capture followed by another two years of
analysis to go, mainframes are almost history. Considerable changes have
had to be made to move from centralized to distributed computing;
fortunately, much more development is already being performed in C or
C++ than Fortran.
Given LHC's long timeline and the increased rate of change in
the computing industry, even more changes can be expected. Although C++
is currently the programming language of choice, more and more
developments are being done in Java. On the database side, it is
important to note that many products already offer multiple language
bindings, Java being the most recent. To date, our experience with the
Java binding has primarily been limited to its use for developing the
user interface to database administration tools. Here, any performance
arguments are hardly relevant and the standard "write-once,
run-anywhere" arguments prevail. We have yet to perform detailed
cross-language tests, but it is clear that we need to understand to what
extent legacy, persistent C++ objects can be accessed by Java
applications. Of course, this is a particularly easy case--the two
languages are similar and coexistent. How we can plan for the languages
that will inevitably come in the next 25 years is a much harder problem.
FUTURE DIRECTIONS
Our future plans can be stated in a single word: production. Although
this work has already been used in production by Ceres and other
experiments, in the next two years we'll see dramatic growth in the
volume of data that will be stored in ODBMSs at CERN. We expect to
remain at the order of a few terabytes during 1998, reaching perhaps
even 100TB in 1999, and enter full swing in the year 2000; in the
meantime, Compass will be storing one-third to one-half a petabyte per
year. HEP experiments at other laboratories around the world are
expected to store similar amounts of data on the same time scale.
Jamie Shiers is the project leader of the RD45 project at CERN. He can
be reached at Jamie.Shiers@cern.ch.
| |