Modeling History for the Data Warehouse

Tim Quinlan, Roger Gilbert, and Matthew Ferguson

At long last--a model that standardizes the creation and maintenance of history tables in the data warehouse

As database design steadily becomes more information oriented, designers are building more generalized data warehouses. Users addressing trend analysis and pattern identification issues expect this capability, which means they want to be able to access data history over prolonged timeframes. The cost of data has fallen quickly, and it's better to extend data structure timeframes than to archive. To meet business needs, designers have to build a standardized data history model.

The Reasons Behind Standards

Why is it beneficial to have a database design standard? There are significant benefits, for example, in defining an optimal logical and physical design for each class of history and making it a standard. Physical design standards let you prepare and reuse documentation, as well as provide significant improvement in a data project's time to market. Additionally, standards provide easy data access. Your goal is to provide a logically sound and open model collection for capturing information on each class of history required by your business. When you implement each class of history as a physical design for specific data mart or data warehouse development projects, users expect that it will operate in the same fashion as every other member of that class. Such activity reduces training time spent learning a new set of databases. The result is rapid database system delivery that business clients will likely exploit successfully.

The business community seems to require only two types of history: A chronologically defined set of changes to an entity presumed to have a constant identity, such as customer or product, and a chronologically indexed event collection, such as sales or shipments. Each type is significantly different, particularly when you look at them with respect to related, postnormalization entities.

The Logical Model Phase

We'll analyze modeling historical data for the data warehouse by viewing it from a logical modeling perspective. It's common practice to leave history discussions to the physical design stage, but there are several reasons for addressing the logical modeling phase.

First, there are generally a number of people involved in this stage-- the data administrator, the database administrator and most important, business users. The latter will actually transform the data into information and then knowledge. At this stage, addressing history ensures DA task involvement.

Second, there's a better chance for reuse and data extension. You can reuse the logical model for other implementations of the same data and extend it to handle future business opportunities. You can also reuse the history template you develop for other history implementations. This type of modeling uses entity-relational (ER) diagramming techniques, and you can use such modeling to model the data warehouse. This is also the basis for data mart modeling and implementation. It's important to synchronize the data mart star schema models with the ER diagrams so future revisions are consistent.

You don't have to add historical entities to the logical data model; of course, you can do so to preserve the logical model's clarity. But we're considering history at an early stage of modeling, and the business perspective is crucial. By examining history at this stage, we're looking for significant advantages in capacity planning without forcing notation and model management changes on the architects and data modelers.

Two History Types

The two most important types of logical model history are event data and reference data. There's history built in event data history because there's generally a transaction date associated with an event. An example of event data history is an invoice with an invoice date as an attribute. An invoice history involves invoices collected over a specified period of time. You can also refer to these as transactions in an operational environment or Facts in a Star schema model.

You have to treat enduring reference data differently. We view the history of reference data itself from two perspectives. History for an entity means keeping track of all instances, past and present. It also means tracking and recording all changes to an instance.

History Meets the Logical Model

Figure 1 introduces a sample logical model. The model is simplified and consists of two entity types: Manufacturing Site (Mfg_Site) and Product. The model diagrams a business rule in which each Mfg_Site is the source for zero or more Products and you source each Product from only one Mfg_Site. This cardinality in Figure 1 (one to many) doesn't represent the fact that a Product may change Mfg Sites; over time, a Product may associate with more than one Mfg Site. The cardinality represents the Mfg_Site's potential Products.



Figure 1. The logical model.

In Figure 2, we introduce History to our model through two new entities: History for Manufacturing Site (Mfg Site History) and Product History. Mfg Site History contains all the site's changes or versions of an Mfg Site instance. Product History represents the same for a Product instance. The Mfg Site History key can't be the Mfg Site ID alone, because there may be more than one instance of history for each Mfg Site ID. When you analyze history, you consider other candidate attributes at the logical modeling stage: Effective Date (in these examples, Date refers to Date as well as Time), Expiry Date, Version Number, and Current Previous Indicator. We've only included Effective Date and Expiry Date in the logical model because Version Number and Current Previous Indicator are derivable; it's possible for you to add them when it comes to designing and implementing the physical model. The Effective Date and Expiry Date, however, are required to place the Mfg Site History instance in time context. One candidate key for Mfg Site History is the combination of the Mfg Site ID and Effective Date, but we've chosen to add a surrogate (meaningless) key called Mfg Site History ID to identify Mfg Site History.

 


Sample Values

MFG_SITE
Mfg_Site_Hist_ID Mfg_Site_ID Short_Name Description Effective_Dt Expiry_Dt
1 15 AA sitex 1990-01-31 1997-11-30
2 16 BB sitey 1995-02-03 1999-12-31
3 15 AA sitex2 1997-12-01 9999-12-31
PRODUCT
Product_Hist_ID Product_ID Description Effective_Dt Expiry_Dt Fk_Mfg_Site_Hist_ID
7 17 GGG 1990-01-31 1995-10-31 1
8 18 HHH 1995-02-03 9999-12-31 1
9 19 MMM 1998-01-20 9999-12-31 3
10 17 GGG2 1995-11-01 9999-12-31 1

Figure 2. Implementing current and historical data in the same table.

Figure 2 shows no direct relationship between Mfg Site History and Product History. To relate a Product History instance with the particular Mfg Site History instance, you can use the Mfg Site ID on both tables combined with the date you want to analyze. The date must fall between the historical entity types' Effective and Expiry Dates. You may want to implement the Mfg Site History and Product History association at a later time, depending on the business's access requirements.

Accessing Historical Data for Business Benefit

Up to this point, we've concentrated on storing historical data with integrity. However, it's one thing to get the data into the database but a completely different thing to get the data out quickly and consistently. You need to understand the user requirements in terms of access.

Accessing the current data is typically more important to the business, and the model supports this. The business needs answers to the complex questions the model supports. You have to build these models and systems so that users can directly access the data.

Through business discussions, you'll find that users are more concerned with current data. However, if you meet users' business needs, historical data requirements will emerge, often deriving from surprising places. When in doubt, store history, because it'll be required. It's extremely difficult, time-consuming, and costly to build history when you have to analyze historical data immediately. You could lose your opportunity to do so by the time the data group gathers the tapes from an offsite location, deals with changed file formats over time, develops the tables, loads the data, teaches the user how to access the data, and answers users' questions.

And such user questions are varied and unpredictable. How many sales involve a specific product? How many sales have involved a specific product historically, using different timeframes (month, week, quarter)? How do weekends and holidays affect the sales of specific products? Will Mfg Sites produce products to handle future sales? How is Mfg Site use changing over time? Answers to these questions may highlight requirements that you can document and use to test your new model.

Does Figure 1 address all of the above questions? Yes, though you need to add more detailed capacity information to the model in order to deal with this information, which can be left for the modeling's physical design stage.

You haven't completed your job yet. Once you test the model in this manner, you need to do several more things to help the business answer users' questions. Key elements to consider are:

Metadata. We can't overstate metadata's importance. Creating entity and attribute definitions, as well as documenting warehouse data's operational source and transformation rules should be readily available to the user. The business relies on consistent, accurate answers to users' questions. After all, they're making key data decisions; they need to be self sufficient. IT must facilitate metadata's development, maintenance, and accessibility.

Predicting historical data's future needs. Assume that all entities need history and implement those that will deliver a payback.

Performance. Provide ample physical server, network, and client capacity to handle peak data volumes and performance.

Separate current and historical data access for query purposes. You can do this physically by separating tables' "current" and "active" data from the historical data. In other words, implement the physical tables so that they're modeled logically. In this case, you can implement logically correct database referential integrity in the database at load time, as well as at query time. You can also separate current and historical data logically by combining such data in the same tables and providing information that lets you view each separately. No matter which approach you use, it is very important from a query point of view that you're able to find the reference table's "current" row. It's also important for you to be able to traverse history using date ranges (effective and expiry dates) when needed. You may break data out by a timeframe, such as year or quarter, logically or physically. Such activity can cause problems when users start analyzing data with varying timeframes. Whichever approach you choose, be consistent in your implementation.

Data-level security. This is an item not frequently discussed with respect to the data warehouse. In some cases, you might have to deliver a solution through the database, an end-user access tool, or a middle tier. These solutions aren't as straightforward in a data warehouse as they are for an operational application.

Access tools. You should be able to accomplish all of the above without force-fitting the database to get the access tool to work. There should be a layered level of independence between the database, metadata, data models, and access tool.

Let's Get Physical

You now need a way to take the logical models and implement them physically. We'll begin by implementing current and historical data in the same table, separated by effective and expiry dates. Figure 2 shows the model for these tables and some sample values.

These tables are really the historical implementation for Mfg_Site, and history for Product. Each new version of a Mfg_Site row gets a new Identifier. This indicates that the primary key doesn't contain the normalized key value of Mfg_Site; it contains a unique, historical data value. We've also implemented a separate Mfg_Site_ID that will remain the same across versions. For example, the Short_Name "AA," which has two versions, has two Mfg_Site_History_ID values (one and three) and a single Mfg_Site_ID value used to identify these values as different versions of the same row. This is the true Mfg_Site_ID--the key from the normalized, nonhistorical table. In other words, we've embedded a key from a table not created. It is very important to distinguish between these two keys.

Let's now turn our attention to the foreign key on the Product table. The Fk_Mfg_Site_Hist_ID is the foreign key to column Mfg_Site_Hist_ID of the Mfg_Hist table. This relationship between tables using the Mfg_Site's historical key may appear to be correct on the surface, but there are underlying inconsistencies in the data. This relationship doesn't account for effective and expiry dates overlapping on Mfg_Site and Product. The Product row identified by Product_Hist_ID 10 has foreign key fk_mfg_site_hist_id, which points to the first row in Mfg_Site. This Mfg_Site row has an effective_date of 1990-01-31 and an expiry date of 1997-11-30. On the other hand, the related Product has an effective_date of 1995-11-01 and an expiry_date of 9999-12-31.

The row Product_ID of 10 should be split into two rows. The first will have an effective/expiry date range of 1995-11-01 through 1997-11-30 and a foreign key which points to the first row on Mfg_Site. The second row will have an effective/expiry date range of 1997-12-01 through 9999-12-31 and a foreign key which points to row three on Mfg_Site.

The previous example highlights the fact that the relationship between tables storing historical data must account for time interval overlaps between related tables. In fact, the related tables have a many-to-many relationship that could be resolved by an intersection table containing the date interval overlaps between tables. Figure 3 shows this scheme.

 


Sample Values

MFG_SITE_HISTORY
Mfg_Site_Hist_ID Mfg_Site_ID Short_Name Description Effective_Dt Expiry_Dt
1 15 AA sitex 1990-01-31 1997-11-30
2 16 BB sitey 1995-02-03 1999-12-31
3 15 AA sitex2 1997-12-01 9999-12-31
MFG_SITE_PRODUCT_INTERSECTION_TABLE
Mfg_Site_Product_ID FK_Product_Hist_ID FK_Mfg_Site_Hist_Tbl_ID Effective_Dt Expiry_Dt  
30 7 1 1990-01-31 1995-10-31  
31 8 2 1995-02-03 9999-12-31  
32 9 3 1998-01-20 1997-11-30  
33 10 1 1995-11-01 1997-11-30  
34 10 3 1997-12-01 9999-12-31  
PRODUCT_HISTORY
Product_Hist_ID Product_ID Description Effective_Dt Expiry_Dt Fk_Mfg_Site_Hist_ID
7 17 GGG 1990-01-31 1995-10-31 1
8 18 HHH 1995-02-03 9999-12-31 1
9 19 MMM 1998-01-20 9999-12-31 3
10 17 GGG2 1995-11-01 9999-12-31 1

Figure 3. The many-to-many relationship.

There are unpleasant consequences from implementing historical data this way. You must either create an intersection table between every pair of related historical tables or denormalize the nonkey columns from the related table to the intersection table. With any new version you create on a parent table, you have to propagate to all related tables. You'll want a simpler implementation for most of your historical data.

Preserving Data Integrity

When you implement relationships between history tables, you can cause data integrity problems. There are two ways to solve this problem:

Implement the logical model. You must implement the physical database in precisely the same format as the logical model. In other words, you implement the five tables Figure 1 displays. You may add database-implemented referential integrity between the current tables (Mfg_Site and Product Table), as well as from the current tables to the historical tables (that is, from Mfg_Site to Mfg_Site_History) and event table (Sales). You can't implement database referential integrity between the historical tables.

This solution has its good and bad points. On the advantage side, you're implementing the logical model in the physical database. You can also maintain a separation of current and historical data. In addition, you can implement database-enforced referential integrity between the "current" tables, as well as between the "current" and "historical" tables.

Unfortunately, implementing the logical model this way requires maintaining twice as many tables in the database. Also, you have to perform the data population on twice as many tables. We believe these significant disadvantages call for a compromise in your solution implementation.

Implement a limited logical model.You can implement a limited, yet consistent logical model version by simply removing the "current" tables Mfg_Site and Product, and implementing the historical tables Mfg_Site_History and Product_History. Figure 4 shows an example of such activity.


 

Figure 4. Compromised physical implementation of historical data.


 

You implement the historical tables as though the current tables also exist. Don't implement database-enforced referential integrity between the historical tables. You can implement application-level referential integrity by the columns Mfg_Site_ID and FK_Mfg_Site_ID, which would have existed on the two current tables and have been brought over to the historical tables. You can create background tasks to manage the referential integrity dependencies between the historical tables. You'll need to separate the current rows from the historical rows with views that imbed date logic and provide end users with separate views of the data.

There are advantages and disadvantages to this type of logical model implementation. The upside is that you can maintain separate current and historical data views. You can also implement one table set, which provides the same functionality as twice the tables previously implemented. In addition, you have less overhead when populating data stemming from one table set. A disadvantage is that you can't implement database-enforced referential integrity between historical tables or between historical and current tables. But you can overcome the disadvantage by providing background tasks that give you a referential integrity error report.

Stay True to Your Model

When considering the best approach to use when you implement historical data in a warehouse, remember to stay true to the logical model. Use caution when creating relationships among the data. Because as soon as you overlook these key points, you'll confront integrity problems. Come to think of it, staying true to your model is a good rule to follow for any physical database implementer.

Tim Quinlan is a database consultant in the Toronto area specializing in relational warehousing and OLTP design and implementation. You can reach him at tquinlan@ican.net.

Roger Gilbert is the system architect at Eli Lilly Canada and has been working on enterprise scale information systems for the last 15 years. You can reach him at rgilbert@lilly.com.

Matthew Ferguson is a data analyst at Eli Lilly Canada with experience in multiple case tools, end-user access tools and warehouse implementations. You can reach him at ferguson_matthew@lilly.com.


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!