
Data marts are a bit like children. Sometimes they come along whether those in charge of the warehouse program plan them or not.
Once you have a data mart in the "family," it's pretty much a permanent thing. And as they do with children, people get quite attached to their data marts and after a while really can't picture life without them.
Data marts, like children, have certain needs that must be met. For example, they have to be fed daily, and they all seem to need attention at the same time. But they can also contribute to the "family." Just as children can take responsibility for certain chores, data marts can handle some query, analysis, and reporting needs, which can relieve the load on the enterprise data warehouse.
Also like children, data marts develop and change in unpredictable ways. They are sometimes demanding and unruly. If their needs are ignored, they can cause real trouble--for themselves and the entire family. Even if things are going smoothly, the collective growth and changing needs of a half dozen data marts (or children) can result in dramatic management challenges every few months.
There is a certain amount of flexibility in how many data marts you can have and still make things work, a bit like the number of children in a family. If you have a small number of data marts, adding one more may not change things much. But if your family infrastructure is only equipped to meet the needs of three children and you "drift" into having 20, there are going to be some problems. It's like that with data marts, too.
As fascinating as the analogy to children may be, at least to those of us who are involved with both children and data marts, let's focus for a moment on data marts per se. The issue here, which seems to be confronting more large-scale warehouse users every month, is the way data marts just keep popping up. It is a remarkable phenomenon in a large, information-intensive company: Somebody in the company wants a new one every month!
This demand quickly leads to issues of infrastructure, architecture, and policy. In fact, if you believe in "Winter's Theory of Acceleration" that every successful information technology phenomenon accelerates beyond the point where it can be rationally understood, you will agree that soon we will have the "Data Mart of the Week" followed by the "Data Mart du Jour."
Perhaps this is an exaggeration, but you can find real companies today trying to figure out just how many dozens of good sized data marts are likely to exist in their organizations in two to three years.
Organizations of all sizes are struggling with data mart issues, but VLDB users, as usual, get the fun of dealing with a few special issues:
The data marts tend to be big enough to have VLDB issues of their own. A lot of data marts are created to get away from VLDB issues. The idea is to extract only the data needed for a single, narrowly focused application. Ordinarily, that means data mart size is no big deal. But if you have data on 25 million customers and a few billion transactions in the enterprise warehouse, the marts can get mighty big.
When there are lots of marts, or even just a few big ones, the demands of the daily cycle at the enterprise level can quickly escalate to an unmanageable level. Think about what happens to the enterprise warehouse every night in most companies now: daily data transformations, updates, and refreshes at the enterprise level; backups; extracts for the marts; and downloads. Often, the same constricted (or virtually nonexistent) nightly batch window must accommodate enterprise reporting, analysis, or data mining. Few users expect that extracts and downloads for the marts will grow to comprise a really large workload, but once those marts catch on, that workload grows by leaps and bounds--fueled by end users flushed with the success of their first-generation data mart triumphs.
Problems of independent marts are exacerbated. Just as mountain climbing becomes more dangerous at higher altitudes, independent data marts--a problem at any size--are a particularly nasty phenomenon on a large scale. Independent data marts get their data directly from the source systems rather than from the enterprise warehouse. They usually get created for historical or organizational reasons. But they exist in many places and, once you build them, it is extremely difficult to migrate them to feed off the enterprise warehouse. But large-scale independent marts are worse because their variances from any corporate standards for data are costly, time consuming, and often impractical to correct, and the impact of those variances on the organization is often large.
The volume of metadata and the number of data sources is often large and the collective rate of change is high. Sure, there a few enterprise data warehouses that just contain a handful of giant tables, but many on the VLDB scene also contain a few thousand other, more normal-sized tables as well. Suppose there are a mere 500 source files feeding the enterprise warehouse and that source files average 50 fields per record; then you have 25,000 source data fields. If you change the source applications on average only once a year and the yearly change affects on the average only 10 percent of the data fields, that is still about 50 changes per business week. Many large-scale data warehouse operations cope with much higher rates of change in the semantics of the inbound data. Some of these changes are "transformed away" on the inbound side of the enterprise warehouse. But others cannot be and become visible to the dependent data marts. In any case, the relationship between tens of daily changes in source data and dozens of changing data marts is a complex matter.
Beyond that, it is an enormous problem to provide data mart designers enough information and support concerning the data in the enterprise data warehouse. This problem is exacerbated by the rate of change but is rooted in the sheer volume of information.
The enterprise data warehouse managers still own a lot of "data mart" problems. This takes me back to the children analogy; it's like having teenage or college-age children. Just because they are old enough to go out in the world and get themselves involved in very "adult" trouble doesn't mean they have an adult's knowledge and resources for getting out of it. When teenage kids get in trouble, their parents usually are, or should be, involved in dealing with it.
Similarly, when a data mart project gets in trouble, the enterprise warehouse managers usually get involved. For one thing, they control the environment and infrastructure within which the mart lives. Some issues, concerning data quality, integration, or timeliness, inevitably lead back to the enterprise warehouse. Others, such as data mart performance, usually end up there as a matter of expertise, if nothing else.
While a data mart program does allow users a certain measure of control and independence, it doesn't eliminate the need for ongoing support. Most crises lead back to the office of the enterprise warehouse manager.
The answer, of course, is organization and infrastructure. A best seller in my parents' day, Cheaper by the Dozen, is an entertaining memoir about life in a family in which there actually were a dozen children. According to the book, it was a happy, eventful family life in which things worked rather well. Even with the immense turbulence that 12 growing children can create, all the important needs were met, albeit with a great deal of flexibility and tolerance for the unexpected. And it happened, in part, through superb organization and infrastructure. That is what you need to create, feed, support, and manage even a half dozen data marts.
The most fundamental aspect of infrastructure in a world of many data marts has to do with the process of creating them.
There has to be a well-organized process whereby users propose and gain approval for the creation of data marts. They must develop a business case, a concept, and a direction for the data mart because a data mart is a long-term claim on the central information technology resources of an organization. It has a long-term impact on the health and feeding of other data marts--and on the quality and integrity of information in use across the organization for decision making purposes.
You need balance here: Users create data marts partly in response to compelling business needs and partly because they legitimately want independence and control over the systems on which they depend for business results. Too much central control defeats the purpose of a data mart program, which is to give users a measure of independence in applying decision-support capabilities to their business objectives. To much control loads users with baggage they resent and pushes them in the direction of completely independent data mart efforts, a nightmare for the organization. On the other hand, too little central control exacerbates the data mart problems described above.
An important issue, as I noted previously, is that data marts usually stick around once they have been created. So it's important to address certain questions before the "point of no return." This is sometimes difficult, because it is often not what the user wants in the heat of the moment. But here are the questions that really ought to be considered:
Could this need best be addressed by an existing data mart?
Could it best be addressed by direct access to the enterprise warehouse?
Could it best be addressed by a "virtual" data mart? (I'll explain this in a moment.)
Is it really best addressed by the creation of a new, physical data mart?
A "virtual" data mart is a set of views defined on the enterprise warehouse that appear to the user as a separate, self-contained database organized for one specific purpose. When sufficient performance can be obtained with a virtual data mart, this may be advantageous, as:
It eliminates the need for an additional copy of the data.
It eliminates the need for daily extracts and downloads.
It eliminates the need to acquire and operate a separate system.
It provides the opportunity to share load across multiple virtual data marts.
It is much easier to handle many types of changes to the virtual data mart, as long as the enterprise warehouse is able to deliver the necessary performance and scalability. As an example, providing access to an additional table is simply a matter of defining a new view--no changes to the extract process, download process or network workload are necessary.
And why does a new data mart affect others? Because there is a finite amount of capacity, over any given period of time, to extract data from source systems, transform it, load it into the warehouse, extract it from the warehouse, and feed it to the data marts. Each of these links in the pipeline is, in principle, scalable to meet all the needs of the organization. Certainly, the manufacturers of data warehouse products are always working to come closer to that ideal, and there is always some elasticity to the limits. But the organization will only pay for so much of it, even if there are no technical obstacles to further scaling.
And this is another point on which it is a bit different for the VLDB scene: If all the new data marts on the table are small operations, you really don't have to worry much about exceeding practical limits on capacity. But if every department wants its own copy of the 10 billion-row transaction table, updated daily, then maybe you do have to think through the needs, priorities, and potentials for sharing.
Finally, decision makers need to be aware that there are limits related to the human resources involved as well. If too many departments get the green light to go build a data mart at the same time, the enterprise organization will not be able to support them.
Designing a data mart well requires a good job of sourcing the data. You must understand the data requirements and the sources below the surface: For the intended uses, what do you need in the way of timeliness, precision, and 20 other characteristics of the data? If you need sales figures, do you need international sales as well as domestic? Do you need sales figures net of returns and discounts or do you need the gross figures? The questions go on and on. This is part of the reason people have data marts. But they end up with the wrong data in them if they can't get good enough support in designing them.
You can't just give them all access to the enterprise warehouse and turn them loose. Even if they are going to run on their own copies of the data, most users need some help if they are going to achieve their goals. One view is that data mart developers should not have to know anything about the data sources or even the enterprise view of the data. They should simply be able to specify what they want delivered and have an enterprise organization source the data, design the extract process, implement it, and deliver the information called for on the schedule. Such an approach would free data mart personnel to focus on their business objectives and should facilitate business progress. But it means in turn that there is a central service with a limited capacity over any brief period of time. That means data mart projects must be prioritized--and some may not qualify for the available resources in any given quarter or year.
Executives planning data mart programs need to aim for an environment in which creating a data mart is like a capital investment. It makes a claim on scarce resources in the organization; it requires a business case; it calls for some consideration of whether the best approach is being chosen. Then after an orderly process in which approval to proceed is obtained, there needs to be some control to provide the service levels on which all data marts--and all other users of the enterprise warehouse--depend. The best time to shape direction and avoid major errors is before the mart has actually been created--when it is being planned and proposed. And all these considerations are far more significant when the scale of the mart--in data volume, workload, or update volume--is large.
With this approach, data marts ought to come into the world with what they need to succeed and develop. And that is good data warehouse family planning.
Richard Winter is a specialist in large database technology and implementation and
president of Boston-based Winter Corp. You can reach him via email at
richard.winter@wintercorp.com or
by fax at (617) 338-4499.