
Decision SupportErik ThomsenComparing different approaches to OLAP calculations as revealed in benchmarksDarwin Would Be ProudWhether it's genes, sports, or business, we love competition. It's a Darwinian world out there. So it's about time that OLAP vendors have started to compete openly within the framework of the OLAP Council's Analytical Processing Benchmark (APB-1). In this column, I'll briefly describe that benchmark (you can get the full text at www.olapcouncil.org) and use it as a vehicle for highlighting substantial differences between the ways that different OLAP tools work to solve multidimensional problems.The OLAP Council was founded in 1995 by four vendors--IRI Inc. (whose technology is now owned by Oracle), Comshare Inc., Pilot Software Inc., and Arbor Software Corp.--to raise market awareness and define performance and interoperability standards. Now that this market awareness has been raised, interoperability standardization (in the form of APIs) is imminent. Performance standardization is covered by the benchmark, whose disclosure requirements give me the opportunity to unveil the inner workings of OLAP products. You can walk into any computer bookstore and purchase a book on how Oracle, Sybase, or Microsoft SQL Server works. The language constructs used by RDBMS products is public knowledge. RDBMS vendors compete on the basis of performance, price, marketing, and support. The same is not true of OLAP vendors. I suspect that a lot of the reason why OLAP language constructs are not public knowledge has to do with the "elf quotient." For all the end-user appeal of an OLAP solution, the reality is that it takes a lot of behind-the-scenes work for an OLAP tool to perform--work that can look pretty ugly at an application-programming level and that rarely enters into the purchasing organization's decision criteria. Regardless of the reason, the OLAP languages in use by the 40 or so OLAP tool vendors are not made publicly available. (Occasionally, I even have to sign a nondisclosure agreement before getting an evaluation copy of a vendor's software.) And there are substantial differences in the way different tools approach the same problem. One of the benefits of the APB-1 benchmark is that it forces participating vendors to disclose all the code used to build an application. Arbor was the first company to throw down the gauntlet by releasing an audited benchmark result; Oracle followed suit. So now there are two products whose application language is sufficiently in the public domain to merit a comparative discussion. What follows is a comparison of an illustrative section of the Oracle Express and Arbor Essbase code required to perform the benchmark and a few remarks about the usefulness of the benchmark itself. The APB-1 benchmark has six model dimensions: Time, Customer, Product, Channel, Measure, and Scenario. Figure 1 shows the dimensions, levels, and their relative cardinalities according to the diagramming style I call multidimensional structuring (MDS). The most instructive way to highlight the differences between products is to look at how they handle the more complicated forecasting calculations. | ||
| ||
|
The benchmark defines the forecasting requirements as follows: "Calculate total 1996 forecast values by the retailer level of the customer hierarchy as the annualized values of the second half actuals of 1995 increased by 15 percent. Allocate to the months in 1996 based on the 1996 budget. Allocate to the stores from the retailer based on the year-to-date actuals. Allocate to products based on year-to-date budget." Because one goal of the benchmark is to identify real-world business needs, and because it may seem confusing to use budgeted numbers to define forecast numbers, here is a story that is reasonably consistent with the calculations requested. (I say reasonably because I couldn't think of any business justification for using only the second half actuals, given that the entire year's actuals exist, to predict the subsequent year's values.) A company has its managers predict sales on a month-by-month basis for each product for budgeting purposes. These budget-based predictions, called "budget," might be made, say, midway through a fiscal year. Then, as the year gets closer to the end, a revised "forecast" is created at the corporate level for all products over the whole year. These globally forecasted numbers are used to adjust the local budget-based predictions while still respecting the relative differences in local values. The forecasting algorithm then allocates them to the local level as a function of the ratio of the locally budget-based number to the budget aggregate. As I described above, there are three separate allocations. I will focus here on the time- and customer-based allocations. Also, I should mention that the total effort required to perform these calculations is distributed across the tools application logic. Some effort is put into the immediate formula; some is put into dimensional formulas; some is encoded directly into the dimensional structures; and some is dispersed throughout the application. I'll begin by looking at the immediate formulas. The Arbor formula for allocating the year's total forecast to the month (leaf) level of the time hierarchy as a function of monthly variances in the budgeted values looks as follows:
FIX
(Product,@ICHILDREN(Customer),@DESCENDANTS ("1996"),
@DESCENDANTS("PTD's & Variances"))
Forecast
(
Forecast = Forecast->"1996" * (Budget->Chan
Input / Budget->ChanInput->"1996");
)
ENDFIX
The first set of statements sets the range for the formula. It selects a range defined by the top member of the Product dimension (or "Product total") and the top member of the Customer dimension and all its children (or "Customer total and all retailers") for all members below the year 1996 (that is, the quarters and months of 1996) and for all the particular Period-To-Date and Variance time members. No measures or channels are specified in the range, so the Forecast formula ranges over all measures and all channels. The statement within the parentheses is the forecast formula. It says that the value of the data associated with the forecast member (that is, sales forecast, units sold forecast, and so on) is computed to be the value of the forecast member for 1996 times the ratio of the budget value for the Channel input member at this quarter or year of 1996 to the budget value total for 1996. The relevant Oracle formula looks as follows: Limit TIME to descendants using TIME_TIME '1996' _TimePer1 = '1996' FCST = FCST(TIME _Per1) * (BUDGET/BUDGET(TIME _TimePer1)) across PRODCUST_CP Like the Essbase formula, Oracle's Express formula begins by identifying a scope (through the LIMIT command), and then defines the forecasting allocation. Scope, or range specification, is a crucial concept for all OLAP applications. Although the Express formula looks at first glance to require fewer statements to define the scope, this is because the overall scope was specified higher up in the application code with a series of limit commands, one for each dimension. The only limit command that needed to change for this formula was time. The key formula for this stage of allocating forecast values looks very similar between the two products. Now let's look at the formulas for allocating forecast values from the retailer to the store levels based on year-to-date actuals. In Essbase:
FIX(Product, @LEVMBRS(Customer,0),@DESCENDANTS ("1996"),
@DESCENDANTS("PTD's & Variances"),"Units Sold", Cost, "Dollar Sales", Margin)
Forecast
(
Forecast = @PARENTVAL(Customer, Forecast) *
(Actual->"199606YTD" / @PARENTVAL(Customer, Actual->"199606YTD"));
)
ENDFIX
As you can see, the code looks roughly equivalent in complexity. (The scope of measures is reduced to just "Units Sold," "Cost," "Dollar Sales," and "Margin" for this calculation.) Again, the essential part is the allocation formula. Now let's look at the Express formulation of the same problem: limit TIME add '1996' limit CUST to CUST_LEVEL 'Retailer' limit CUST to children using CUST_CUST " The Stores for these retailers. push TIME limit TIME to '1996Ytd' T_RATIO_TC = ACTUAL/ACTUAL(CUST CUST_CUST) pop TIME FCST = FCST(CUST CUST_CUST) * T_RATIO_TC " Compute SmoothedSales and MarginDollars here .. limit CUST add ancestors using CUST_CUST " Get back retailers, CUST_TOP push TIME MEASURE limit TIME keep TIME_LEVEL 'Month' limit MEASURE to 'SmoothedSales' 'MarginDollars' PreAgg_Model FCST rollup FCST over TIME using TIME_TIME pop TIME MEASURE limit MEASURE add 'SmoothedSales' 'MarginDollars' update Here you can see what appear to be some real differences between the two products. As you may have already guessed--and can clearly see from the snippet above--Express takes a largely procedural approach in contrast to Essbase's approach, which is decidedly more declarative. For example, notice the use of the limit command in Express, which serves to identify those members of a dimension that are in the scope. First the member "1996" is added to time, then it is pushed onto a stack while a separate time range "1996YTD" is temporarily added. Later, time is restricted to months. While this procedural approach is powerful and capable of useful optimizations, the application code created with a procedural approach is very hard to analyze for application errors. Notice also that something called PreAgg_Model is invoked for Forecast. This is the name of an Express "model" whose specification can be found at the end of the Express disclosure. A model is a separate method of calculation that operates along dimensions. And while models are powerful in the sense that their compiler can detect and resolve simultaneous equations, this extra step (and metadata object), to compute two measures makes it harder to trace calculations. This Express formula's procedural nature notwithstanding, some of you may be scratching your heads at this point, wondering why it is so much longer than the one from Essbase. If you look closely you will notice that the Express formula uses a different technique for calculating "SmoothedSales" and "MarginDollars" than it uses for calculating the other variables, whereas Essbase uses the same technique for all variables. The reason for this discrepancy is that the OLAP benchmark is ambiguous in a number of places. In directing an implementation to "calculate ... forecast values," it does not specify, for example, how dimensional precedence is to be resolved. Should margins be forecast by the same aggregation/allocation method as Dollar Sales, or should they be computed at the end as Dollar Sales minus Cost? You obtain different numbers depending on the method. Yet the very meaning of the forecast margin is its method of calculation. Essbase hides some of its complexity in the "PTDs and Variances" hierarchy of the time dimension, which controls the calculation of year-to-date aggregations and year-to-year and period-to-period comparisons. In Essbase, notions of cumulative and rolling aggregates are encoded into the dimension along which the range function is implemented, which is usually (but not always) the time dimension. The result of Essbase's approach is a substantial number of cells computed that contain irrelevant or meaningless data. For the purposes of the benchmark, year-to-date shipping cost, production cost, inventory, and smoothed sales are not called for, nor are period-to-period or year-to-year comparisons thereof; yet storage and time are taken up for them. And while they are expressed relatively cleanly here, a time dimension that incorporates multiple time hierarchies (calendar and fiscal time, for example) would have a corresponding growth in structural complexity and unnecessary calculation. Express provides very fine control over how computations are carried out, but it also demands a lot of thought on the part of the application developer. Examining the entire forecasting code requires tracing a set of limit steps both for the current stage of computation and for prior stages. Additionally, there are a large number of different places where you can find information about how an Express implementation works, such as the programs that define metadata structures and formulas, the programs that maintain metadata and perform calculations, and the models. Attempting to audit a particular forecast number requires scanning quite a bit of code. The OLAP benchmark, while a useful first step toward providing a public set of metrics for comparing OLAP products, needs to be differentiated to reflect the hurdles that are specific to different industries and functions. For example, the benchmark assumes a certain cardinality ratio between the different dimensions and thus could not be used for comparing how products handle very large dimensions, as might be the case in customer analyses for banking, insurance, and telecommunications companies. Nor does the benchmark count the initial load and calculation time. While this may be acceptable when the database is refreshed on a monthly basis, all time counts for applications that are built on the fly. Erik Thomsen is an author, lecturer, researcher, and consultant focusing on OLAP and decision-support applications. He is cofounder of the Cambridge, Mass.-based consultancy Dimensional Systems. You can reach him via email at erik@dimsys.com.
|