
Online ExtraMike Lonigro
| ||
Figure 1. Eliminating the Effect of Change in Key Value. | ||
|
The second option is to change the value of the order's primary key. In this case, we must also change every foreign key reference to the order within the same commit unit! Each new reference must also be included in this update. Would it not be more correct and efficient simply to change the order's customer and continue processing the order? A surrogate key independent of all the order's attributes or relationships allows us to correct the order and maintain the information already contained in the database easily. Another problem illustrated here is that of concatenated primary keys. Because the order's primary key in our example has been migrated into the primary key of the line item, all references to the order line item must also change with a change in the order's key. The problem grows as we chain together more keys in the database. Migration of primary keys in this fashion has terrible implications for database flexibility. These examples illustrate how primary keys must be insulated from changes in value and so should not comprise attributes, relationships, or primary keys migrated from other tables. But even if its value is truly stable, the primary key must never be the same as the logical unique key because the set of attributes and relationships that make an entity unique may itself change. Associative entities provide a good example of this (see Figure 2). An associative entity contains attributes about a many-to-many relationship between two or more occurrences. Because the entity is defined by its relationships, we are tempted to use the foreign key columns in the primary key. For example, our company has a PART OFFERING associative entity, which stores pricing and configuration information for a PART as it is sold or offered by a VENDOR. A PART OFFERING is unique on the PART id, VENDOR id, and offering date. Why not use these columns as the primary key? 2A Here the PART OFFERING associative entity uses the logical unique key as the primary key. A change requiring PART OFFERINGs to be unique additionally on TIME requires changes to all related foreign key structures (TABLE A and TABLE B). 2B The PART OFFERING has a surrogate key different from the logical unique key. The new requirement to make PART OFFERING unique on an additional TIME attribute does not impact any foreign key structures (TABLE A and TABLE B). | ||
Figure 2. Eliminating the Effect of Change in Key Structure. | ||
|
The reason is that over time, the business may wish to capture PART OFFERINGs at a finer level of detail. We may begin to track multiple PART OFFERINGs for fuel offered by the same vendor at more than one price in a day. If we had used the unique key as the primary key, we would need to modify all programs and tables that reference unique PART OFFERINGs to include the offering time. A more stable design would give the PART OFFERING a surrogate key. The new business requirement would require only changes to the unique key index on the PART OFFERING table and any programs that insert or update its rows. The programs and other tables referencing the PART OFFERING by the surrogate key would be unaffected. The third issue I raised involves primary keys that contain attributes unrelated to the logical unique key. This type of key appears in many forms and is subject to the same problems I addressed above. Additionally, these keys are especially inflexible to expansion of the definition or usage of data. New subtypes of orders or accidents, for example, are typically handled by assigning different ranges or formats of primary key values. As these applications increase, smaller key ranges cause rollover and archiving issues while applications are forced to come up with imaginative new identification schemes just to insert a row in the database. Another example of this kind of key occurs when a key value implicitly represents a hierarchy among rows in a table. This type of key is inflexible to additional levels and limits the number of occurrences at any given level. For example, standard commodity codes used across the transportation industry are assigned key values whose format indicates an arbitrary level in a hierarchy of commodity classifications. A broad commodity classification is given a key of 100000; the classification level immediately beneath it has keys of the form 101000, 102000, and so on; and the next level keys of the form 101010, 101020. A better design would give each occurrence a surrogate primary key; child occurrences would have a foreign key back to the parent. In this way, classifications can have a varying and unlimited number of levels and occurrences. Finally, primary key values must be internally assigned. The first reason is, again, due to stability. Social Security numbers may be miskeyed, unknown, or changed; business users may invent new coding schemes to identify customers; and outside agencies may reformat or reuse identifiers at their own will. Another reason the values must be assigned internally is that we may define entities differently than the outside agencies that provide our data. For example, a vendor that sells us uniforms uses a different ITEM number to identify each different size of uniform. In our database, however, we create one ITEM with many sizes. This standard definition allows us to draw consistent information out of the database independent of vendor numbering schemes. Using the vendor's assigned ITEM number would cause us to create ITEMs based on each vendor's definition and not ours. The best way to avoid the pitfalls indicated by these illustrations is to choose a primary key that is logically meaningless; this is the surrogate primary key. Doing this means, above all, that the primary key must be owned by the data administrator. Intelligent, problematic primary keys will be assigned when application developers or users begin to take ownership of the primary key. When this happens, the real purpose of the primary key is lost; it becomes a "good idea" number that tells a story about the row it identifies. The primary key is not a "good idea" number; its purpose is to physically identify a row to the database. To summarize, the primary key must be:
THE CRITERIADoes every table require a surrogate primary key? The answer is no. The prerequisites derive from the definition of the primary key. Because the primary key identifies a row after it is in the database, it follows that a table requires a primary key if its rows will be referenced individually by other tables after entry into the database.There are two alternative views of this. The logical view asks: does the table participate in more than one relationship from the logical data model? For example, in a human resources database, an EMPLOYEE table will likely participate in more than one relationship; we may relate the EMPLOYEE to BENEFITs, POSITIONs, LOCATIONs, EVALUATIONs, and many other entities in our data model. For this reason, the EMPLOYEE gets a surrogate key (see Figure 3). But now suppose we need a table to store the EMPLOYEE DEPENDENTs. We must determine whether or not the DEPENDENT participates in more than one relationship from the data model. If we care to relate the DEPENDENT only to the EMPLOYEE, then the DEPENDENT table represents a repeating attribute of the EMPLOYEE entity. It does not require a surrogate key, but it may have a unique key consisting of the EMPLOYEE key and some combination of attributes of the DEPENDENT. If, however, we also wish to relate DEPENDENTs to BENEFITs, then the DEPENDENT participates in a second relationship, becomes an entity in the logical data model, and gets a surrogate key. 3A The DEPENDENT is related only to the EMPLOYEE, and has a unique key but no surrogate key. 3B The DEPENDENT has its own relationships (TABLE A and TABLE B) and gets a surrogate key. | ||
Figure 3. The Data Model in Primary Key Assignment. | ||
|
The physical view asks: Do we need to reference the rows individually for physical processing reasons? For example, we may wish to trigger certain DEPENDENTs to be processed through some application. Rather than reading all rows in the table looking for a few process flags, we may give the DEPENDENT a surrogate key and store the values of the "flagged" DEPENDENTs in a trigger table. After the DEPENDENT is processed, the row representing the DEPENDENT in the trigger table is deleted. This may be simpler than storing the series of columns that reference a single DEPENDENT. A SUGGESTED STANDARDOur shop has enjoyed success with the implementation of a simple primary key standard containing two critical elements. First, all our primary keys are sequentially assigned positive integers. In DB2, we use a large INTEGER column. This allows us to store more than 2.1 billion occurrences of any entity in our model. We can create an occurrence every second of every day for 68 years and still not run out of values. Certainly few business entities are created this frequently.Second, all primary keys are assigned through an enterprise module that receives a logical table name, obtains an exclusive lock on a row containing that table's next primary key value, increments the value, and passes it back to the caller. The primary key is assigned by logical table because we sometimes split a logical table into two or more physical tables for performance. The rows that form the one logical row share their primary key value. THE BENEFITSThe consistent format of primary keys throughout our database provides benefits in addition to flexibility. One is in the implementation of optional relationships (see Figure 4). For example, we create INVOICE ITEMs but may not relate them to an INVOICE until perhaps several days later. The INVOICE foreign key in the INVOICE ITEM table thus needs to be logically null. Physically, however, a zero or dummy INVOICE value would heavily skew the foreign key index. Instead, the null foreign key receives a value equal to the row's own primary key multiplied by negative one. This distributes the foreign key index values evenly across all rows, whether or not the foreign key relationship exists.4 The INVOICE ITEM has an optional relationship to the INVOICE. ITEMs without a related INVOICE receive a foreign key value equal to their primary key multiplied by -1. The index over the INVOICE foreign key is evenly distributed. | ||
Figure 4. Handling Absent Foreign Keys. | ||
|
A second benefit is that "Exclusive OR" relationships can be handled with one well-indexed column (see Figure 5). An "Exclusive OR" relationship exists when an entity occurrence may relate to at most one of a number of possible occurrences. A given INVOICE ITEM, for example, may pertain to one of any number of different possible entities in our model; we may invoice SHIPMENTs, STOPs, ACTIVITIES, or others. Different primary key domains would require a foreign key for each possible relationship, all but one of which would contain a value. Instead, we use a single indexed foreign key column. A second column names the table referenced by the foreign key. 5 The INVOICE ITEM can bill a SHIPMENT, STOP, or ACTIVITY, exclusively. A single foreign key handles all relationships. A second column names the table containing the primary key value. | ||
Figure 5. Handlign Exclusive or Foreign Keys. | ||
|
Poor primary key design is perhaps the most repeated and damaging flaw in database design. We would do well to remember that a smart key is a dumb thing to do, while a dumb key is a smart thing to do. Mike Lonigro is the senior data administrator responsible for all conceptual, logical, and physical database design at J. B. Hunt Transport Inc., the nation's largest publicly held truckload carrier. J.B. Hunt Transport runs on DB2 for OS/390 at more than 3 million transactions per day. Previously, Lonigro was a consultant with Price Waterhouse MCS in the Petroleum Special Practice Unit. You can reach him via email at lonigro@mail.jbhunt.com. |