
When you venture into the unknown, you must be prepared for trouble, and the best way to prepare for trouble is to avoid getting into it. If you're a lead climber, the biggest trouble you face is getting hurt from falling a great distance; you protect yourself by securing pitons into the rock face so you won't ever fall too far. If you're Hansel or Gretel, the biggest trouble you face is getting lost, so you leave breadcrumbs behind you (actually, little white stones would have worked better). If you're a DSS analyst, one of the biggest troubles you face is a calculation error. Ideally, like our story book characters, you would protect yourself by leaving dependency crumbs in your decision-support models. The problem is that very few tools let you do this.
The OLAP portion of DSS is often portrayed as a standard process. Data is brought in; dimensions are created. Data is aggregated in advance or on the fly, and end users and analysts are free to slice, dice, and drill their way to analytical query happiness.
Most of those queries are against derived data, which is after all, one of the hallmarks of DSS. End users query derived data, such as regional totals, yearly totals, monthly flow rates, and end-of-statement balances. Of course, derived numbers mean calculated numbers, and these numbers may or may not have been calculated correctly. (Input numbers may also be incorrect, but the reasons why they are incorrect and the ways of detecting them are substantially different from that of incorrect calculations.)
Most of you probably remember having some experience debugging a macro-based spreadsheet model and trying to figure out where the calculations went wrong. OLAP was supposed to make this whole thing easier, and in many ways it has. Calculations that used to require complex spreadsheet macros can be defined in a few lines of an OLAP calculation facility. But OLAP has introduced its own problems. Calculations frequently traverse multiple dimensions and multiple levels of granularity, and they may contain data or position-based conditionals.
So it's ironic that OLAP, which is all about derived data, is so weak when it comes to querying how a derived data value was calculated. Whether you're analyzing sales or costs, calculating customer or store profitability, making projections, or developing promotional campaigns, from the moment you are responsible for the information you pass along to your coworkers, your superiors, or your own decisions, you need to have confidence in and be able to test or query the validity of the derived data with which you are working. Can you imagine going into an executive meeting, presenting your forecasted sales report, being challenged on your forecast, and not being able to say how you got your numbers. much less demonstrate why you believed in them? Amazingly, most OLAP products are mute in this area. Many can't even tell you if a data point was derived, much less how. This a gaping hole in the OLAP space needs to be addressed.
Think about it for a minute. Take a popular, declaratively oriented OLAP product such as Essbase and look at a client view of the data. Then try to answer the question "Which numbers are calculated and which are input?" Except for special circumstances--such as dynamic nonstored data--you can't! (Now it may be the case that other products, such as Express, make it easier to see if a number was input or derived because derived numbers are defined differently, namely in terms of formulas or models instead of variables. But this doesn't always work, and when it comes to figuring out how a derived number was calculated, it can be even more difficult in Express because of its more procedural nature.)
Or look at an obviously aggregated number such as the all-product profit variance in a product profitability model, and ask yourself how the number was calculated. If it's a simple dimensional calculation and you know which dimensional formula took precedence, you can discover it (in some products, such as TM1 or Essbase) by looking at the dimensional outline; but what if its more complicated? What if the calculation was based on interdimensional references? Or what if it was based on runtime dependencies? Then it is most likely based on a calculation script and won't show up by perusing the dimension outline.
If you are the application developer, you can go back and look at the scripting code to figure out how something was calculated, but that process may take you through multiple lines scattered around the script. The experience would be similar to debugging a program. And again, if there are data dependencies, such as a commission rate being based on sales, you will have to look at the data.
But if you're the end user or analyst, you may neither want nor be able to go mucking around the application code. Calculation auditing must to be functionality that you have direct access to as an end user or analyst. So why are OLAP products so deficient in this area?
Data modeling was originally focused on application-independent designs, which generally meant nonderived data. Derivations were application dependent and could be created through views. Most of the work done in the '70s and '80s on data modeling was, again, about modeling raw data. Entity-relationship models, for example, focus on data structure and such things as cardinality relationships between identifiers within different entities, as opposed to processes, whereas derivations are decidedly processes.
In line with this focus on data structuring, most RDBMSs do not even capture formulas as metadata that a user can browse. To follow a calculation, someone would most likely have to look at some form of stored procedure code.
While so many folks were thinking about how to model base data rigorously, another rather large group of people was banging away on data derivations. Of course, we all know who they were: the spreadsheet users. But like the way two kids in a family may identify themselves differently, the spreadsheet crowd operated in a near total absence of any structural model.
In that sense, OLAP is like a hybrid of databases and spreadsheets. (It's no mystery that several OLAP companies, Arbor included, defined themselves as a combination spreadsheet/database--there was even a product that came out on the Mac called Spreadbase.)
Unfortunately, most OLAP vendors did not see the need to offer the kind of auditing capabilities necessary to ensure that the complex calculations they let users define are, themselves, correct. Nor did they let users simply trace their calculations.
At the heart of this OLAP gap are the methods by which tools track dependencies. These methods are generally a core component of the tools' calculation engines. Many tools only track intradimensional dependencies to provide a minimum level of navigation functionality. In other words, there's no way to drill down on a member without knowing the children of that family member. The tools would have to add considerable smarts to be able to track formulaic dependencies because these require runtime, cell-level dependency tracking.
Of course, the OLAP space is highly differentiated and different, vendors would have to perform different amounts of work to solve this problem. The more declarative a vendor's calculation engine is, the easier it is (in principle) to include dependency tracking. Some vendors' engines probably track most or all the required dependency information. They just need to bring the information to the surface. The more procedural the engine is--and the more flexible the tool--the harder it will be to build the dependency tracking without building a transaction log that is used to determine all the changes that have occurred to the data structures during the course of a calculation.
There are five degrees of calculation auditing:
Knowing which cells are calculated vs. input
For those that are calculated, knowing the formulas used to calculate the cell
Knowing the cells used as inputs to the formula
Knowing the cells that depend on the cell
Knowing that the formula was the right one to use, that is, that it calculates what we think it is calculating.
The most basic capability would be knowing whether a cell was input or derived. Because a user may override a formula with a manual entry, to calculate a formula properly, the system would still need to provide for runtime knowledge.
The next most basic capability is knowing what formula was used. If you calculate adjusted income differently depending on the state or province, if you're wondering why your compensation was lower than expected, or if you're trying to analyze the indirect costs imputed to a certain class of product, you need to be able to verify in each case that the correct formula was used. If you are a developer of the tool you are using, then tracking down the right formula may not be too difficult (although a noticeable expense of time). If you are just a user, though, you may not have the expertise or the time to track down the formula.
More advanced auditing capabilities include the knowledge of the actual inputs for the function as well. For example, anytime you want to know the elements that go into "total indirect costs," the "rate" cell referenced by the formula used to calculate a particular commission, or the data values that went into some profitability calculation, you need to be able to see the cells referenced by the formula that calculated the cell of interest. Again, while all OLAP tools are good at providing drill-down from a parent hierarchy member to its children, they generally do not support cell-based dependency drill downs. The one exception that I am aware of is Whitelight's cell auditor. It takes a big step in the right direction by letting you query the input values to a cell. (Actually, Power Thinking Tools'--my first company--product FreeThink, provided this functionality almost 10 years ago, but it is no longer commercially available.) The forthcoming Microsoft OLAP Server promises to allow descriptions of how a measure is aggregated, but not how it is derived if it was not aggregated--nor if any given cell was computed or derived.
Knowing which cells depend on a particular cell isn't necessarily more advanced than knowing the inputs, but some OLAP engines that track dependencies internally only do so in one direction (presumably because they support a pull-based approach to dynamic calculations).
The final capability--and the most difficult--is knowing that the calculation used was the right one, which requires semantic knowledge. In fact, some OLAP tools provide a fair bit of heuristics. They may guess, for example, that members of a measures dimension ought to get calculated last, or that ratios ought to follow from aggregations.
The auditing problem for OLAP tools is compounded when you are working with multiple tools. For example, when you are connecting a data mining tool to an OLAP tool, all the calculation metadata is lost. By the time the predicted values from a mining tool are brought into an OLAP tool, they are treated as input data, when in fact, they are highly derived. Also, when calculated data from an OLAP tool is stored back in a database, all calculation methods are lost. The same also occurs when a data mining tool uses calculated data from an OLAP tool; the mining tool has no record of the calculation methods used by the OLAP tool.
As OLAP becomes increasingly embedded within DSS applications, more end users and analysts will be dependent on these tools both alone and in conjunction with other tools for their day-to-day information. To avoid a market backlash and to let end users and analysts be as responsible for their OLAP-derived data as for their spreadsheet-derived data, vendors need to provide for dimensional auditing and for the intertool sharing of auditing information.
Erik Thomsen is an author, lecturer, researcher, and consultant focusing on OLAP and
decision-support applications. He is cofounder of the Cambridge, Mass.-based consultancy Dimensional
Systems and author of the book OLAP Solutions (John Wiley & Sons, 1997). You can reach him via email
at erik@dimsys.com.