VLDB Vision

Is Your Data Warehouse Headed For VLDB Trouble?

Richard Winter

How to tell--and what to do about it

We have all been "up the creek without a paddle" at one time or another, and most of us have at least once looked at our companions and said something like, "Do you hear a roaring sound that seems to be getting louder?"

Unless you handle the VLDB issues skillfully, a big data warehouse project is like a waterfall, and if you need to take action or get help, you want to be sure to do that before the roaring gets too loud.

Many database professionals I encounter are unfamiliar with how fast the momentum builds in the later stages of the canoe trip, and they ask the important VLDB questions too late. In this column I am offering a test you can take to help you determine whether you are reasonably on course--do you have that metaphorical paddle, or are you too close to the waterfall?

WHAT YOU NEED TO MANAGE THE VLDB ASPECT OF A LARGE DATA WAREHOUSE

When I look at a large-scale data warehouse project from the engineering management perspective, I always ask myself the same question: If I had to call the shots, what information would I need? I am going to refer to these items as the crucial VLDB engineering inputs.

To deal with performance and availability issues, you need to know:

• Response time requirements

• Throughput requirements

• Data freshness requirements

• Data availability requirements (under what circumstances, if any, can the data be unavailable for use)

• Operating schedule (for example, what are the batch windows)

• A database concept

• An estimated workload.

To deal with scalability issues, you need to know how the above items are likely to change over the timeframe of the scaling plan (usually three to five years). It's a good idea to look at the highest likely, lowest likely, and most likely projections in the scaling analysis.

Under some circumstances, you need to look at extreme high-growth scenarios as well. For example, suppose the data warehouse is going to support an Internet offering that is most likely to have 1,000 concurrent users in three years. Suppose the lowest likely number of concurrent users in three years is 100, and the highest likely number is 2,500. But the business planners believe there is 0.1 probability that the service will be a smash and attract 5,000 concurrent users, and they want to be sure they can handle that number if they should have the good fortune to get there. In my view, it usually pays to do at least some examination and analysis of the upside success scenarios. I would consider 5,000 concurrent users an extreme growth scenario and carry it along, at least for the early stages of the analysis, which are not costly. At some point you will face questions of whether to devote major resources to testing the extreme growth scenarios. Then you can engage the business sponsor in the question of whether these scenarios are likely enough to warrant resource-intensive testing at that stage of the project. If the answer comes out "no," that's fine, but with this approach you won't develop yourself or your client out of a big success before you need to. This kind of question can be important, for example, in regard to whether you consider MPPs and SMP clusters or whether you zero in on SMP and NUMA only.

THE WINTER "CAN I STAY OUT OF TROUBLE HERE?" TEST

Ask yourself:

1. Do I have the items mentioned before?

2. Under the likely scenarios, do I have a rational plan to meet the requirements?

3. Is management signed up to provide the resources so we can implement the plan?

4. If the answer to number 3 is "no," can I convince management to sign up in time to avoid compromising the business value of the data warehouse?

Apply this test as a rough guide to any data warehouse involving more than 100GB of data. In NT, you should apply it to any data warehouse with more than 10GB. Apply it early and often to any data warehouse project anywhere near the frontier (see my column "The VLDB Death Zone," June 1998).

To feel confident that your data warehouse project will meet its VLDB challenge, you should be able to answer both 1 and 2 and either 3 or 4 with a "yes." Think of these questions as similar to the question (before jumping), "Do I believe this bungee cord is up to the job and firmly attached?" In other words, don't kid yourself.

And don't forget: When you answer "yes, I have the requirements" to number 1, there must be some reality to those requirements--that is, they must tie in to business needs.

WHY IS THIS TEST IMPORTANT?

You need the crucial VLDB engineering inputs to do any of the following on a rational basis for a VLDB warehouse:

• Select a platform

• Configure a system

• Create a physical database design

• Perform a benchmark to assist in any of the above

• Design applications or queries

• Select end-user tools

• Test the performance of a pilot system

• Test the performance of a production system

• If you skip all the above, figure out what to do when your production system hits its first VLDB crisis.

If you don't have the crucial VLDB engineering inputs, then you have to fake it and just guess about the platform, the configuration, the design--what to do as the roaring gets louder--and what to do when the system won't perform.

There is a trap here. In a garden-variety database project with a normal-sized scale it works just fine to fake it--with an ordinary-sized database, platform selection and configuration is mostly guesswork. And who cares? If you get it wrong, there isn't all that much at stake. Did you choose the wrong size server for your 1GB database? You can fix the problem for $10,000 or less.

Some lost souls, having faked it on these issues for years, take their canoes down the VLDB river and expect the same approach to work. These are the ones we must pray for: We pray they will notice that roaring sound before it gets too loud--maybe even that some will read this column and try the test.

WHEN DO YOU APPLY THE TEST?

My list of situations for which you need this information answers this question. The smartest thing to do is apply the test over and over as you face each new VLDB challenge--and it's best to start early.

In serious platform selection situations where millions (sometimes billions) of dollars of business benefits are at stake, I insist on obtaining the crucial VLDB engineering inputs before selection--that is really the right time to create the first version of them anyway. But if your project is already in progress, it's never too late to start.

WHAT IF THE INPUTS ARE WRONG?

If someone is driving toward a VLDB solution using a concrete set of engineering inputs, then the project team will tend to focus on the right issues, help refine the inputs until they are reasonable, and find the solutions before it's too late. However, when you first begin the process, you must be willing to work with rough estimates and rough concepts of what the database might ultimately look like. These early inputs are rough estimates that are subject to change. Horrible as this may sound, you could be wrong, and others may know that if you advertise your early inputs. But in my experience, the willingness to work with rough inputs--inputs that you will probably revise--is the critical success factor.

If you are part of the senior management of a data warehouse program, keep in mind that it helps to create an environment in which people estimate the engineering inputs without fear of criticism if they turn out to be wrong. The idea is to get the best estimate as early as you can, base your early decisions on it, then revise it as better information becomes available--and manage your implementation with an appreciation that these estimates will change.

WHAT IF YOU CAN'T GET THE INPUTS?

This is the hard part. In many organizations, it seems impossible to the data warehouse architects and DBAs to get these inputs. No one offers requirements, no one has a concrete idea of how to get them, and no one believes it is possible to get them. Part of the difficulty is that if the data warehouse program has no real business direction, then there are no requirements. To use the old analogy, if your ship has no destination, any compass setting will do. (My February 1997 column deals with this topic of how to establish business direction for the large-scale data warehouse.)

If you can't get the inputs, you have only two options:

• Establish a real business direction for the program. This approach gets the best results; it leads to a situation in which you can organically develop all the VLDB engineering inputs.

• If there is some business context, assume reasonable values for the inputs and play them back to the business audience along with some likely scenarios for how they will use the system. With a certain amount of skill and luck, you can do this successfully, and it's far better to do an analysis based on assumed inputs than to guess at the answers.

With both approaches, it is crucial that you repeat the process of refining the inputs, then refining analysis or design as the data warehouse progresses until you feel the inputs are reasonably sound. My recommendation is to pursue the first option. If you can't, go for a brief period with the second. When it becomes obvious to all involved that you need to know more about the business needs, use that experience to lead your organization back to the first option. That is, build support for the development of better business requirements so your organization can make better engineering decisions.

The bottom line is that there is always a way to come up with some set of inputs to guide the VLDB engineering process. The more basis they have in a business reality the better. Even a set of assumptions about the requirements shared by the technical team is a better basis than sheer guesswork. At the very least, the technical team can assume values, publish them, and invite all parties to comment.

WHAT IF YOU FAIL THE TEST?

If you fail the test, you are in the canoe without a paddle, and the current is going to carry you toward the waterfall. The longer you wait, the faster the current will be and the harder it will be to avoid real trouble. Of course, of the hundreds who go over the waterfall each year, a few survive. But for most, it's an awful conclusion to their data warehousing days.

And when are you in trouble? You are headed for trouble in a data warehouse more than 100GB (more than 10GB if you are using NT) if you fail the test during platform selection. You should choose your platform selection approach based on these engineering inputs and an appraisal of how close you are to the frontier.

If you fail the test after platform selection, you are in trouble now! Listen: Perhaps you can hear the roar of the waterfall in the distance!

If you fail the test, my advice is to go get yourself a paddle; one way or another, get those engineering inputs established, then start working on a plan that meets the requirements. As you progress, you can always improve your knowledge of the requirements. As usage builds up, you have the opportunity to improve your requirements knowledge greatly.

There is always a lot of uncertainty in data warehousing, and you will never have perfect requirements. You are much better off engineering to a set of requirements that involve some guesswork but are advertised than you are leaving the requirements issues unresolved. If you fail the test, you must get the requirements, develop an engineering plan, and build management support.

Get the requirements. The crucial VLDB engineering inputs--the requirements--are the answer to question number 1 in my test. If you don't have them, you are headed for trouble. You must get them.

Develop an engineering plan. Once you have them, then what do you do? You must construct a rational plan for how to meet them. You must have an argument--one that you believe--that you will meet the performance, scaling, and availability goals. The key to that is a combination of testing, measurement, and quantitative design. As the project progresses through its stages, you need to strengthen your argument with better measurements and better proofs. By building the plan, the concrete arguments, and the proofs, you will find you have the ammunition to develop the third major element of success: management support.

Build management support. When you have a concrete plan, you can usually get management support. If you go to management and say, "I am afraid our data warehouse is not going to perform; let's spend three million dollars to upgrade it," you are not likely to get much support. But when you have requirements tied to business needs, an analysis, benchmark tests, and a timetable that shows when the upgrade needs to be in place, then you have something that compels management attention.

In future columns, I will say more about how to accomplish these steps as well as more about what the crucial VLDB engineering inputs look like. In the meantime, if you use your own experience in these matters, you will be far better off than if you simply drift along in the canoe.

Take the test now. If you find your project is lacking with respect to VLDB requirements, a VLDB engineering plan, or management support, then go after these things. You are going to need them to be successful.
 

Richard Winter is a specialist in large database technology and implementation and president of Boston-based Winter Corp. You can reach him via email at richard.winter@wintercorp.com or by fax at (617) 338-4499.
 


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!