According To Date

C.J. Date

But it's a lot better than the alternative!

Normalization Is No Panacea

This month, I complete my miniseries on normalization by discussing some of the things that normalization can't do. (It's always important to understand the limitations of any technology on which we rely heavily.) I must immediately make it clear that I don't want my remarks to be seen as any kind of attack! Au contraire: In the past, I've often characterized normalization as the one piece of true science in "an otherwise artistic endeavor"--the endeavor in question being, of course, database design. (Actually, we have a little more science available to us than we used to, but the basic point--that database design is still a very much a matter of subjective decisions--is still valid.) However, although normalization does have a good claim to being objective instead of subjective, the fact remains that there are several aspects of database design that it doesn't help with at all.

PRESERVING FUNCTIONAL DEPENDENCIES

I'll begin by introducing the notion of functional dependency (FD) preservation. It's often the case that a given relvar can be nonloss-decomposed in a variety of different ways--but some ways are better than others. For example, consider the relvar EMP { EMP#, DEPT#, BUDGET }, with candidate key EMP# and FDs EMP# ---> DEPT# and DEPT# ---> BUDGET. Note that, thanks to the transitivity property of FDs discussed in an earlier installment1, the FD EMP# ---> BUDGET also holds (see Figure 1, where the transitive FD is shown as a broken arrow).
     Now, it's well known that relvars like EMP suffer from certain update anomalies, and further, that those anomalies can be avoided by nonloss-decomposing the relvar into certain projections. In the case at hand, suitable projections (both 5NF) are:

ED { EMP#, DEPT# } 
CANDIDATE KEY { EMP# }
DB { DEPT#, BUDGET }
CANDIDATE KEY { DEPT# }

I'll call this decomposition "decomposition A." Now, it's easy to see there's another possible nonloss decomposition of EMP into 5NF projections ("decomposition B"):

ED { EMP#, DEPT# } 
CANDIDATE KEY { EMP# }
EB { EMP#, BUDGET }
CANDIDATE KEY { EMP# }


     Observe that projection ED is the same for both A and B. Note: Of course the third possibility, replacing EMP by its two projections on { EMP#, BUDGET } and { DEPT#, BUDGET }, isn't a valid decomposition, because it's not nonloss.
     I now claim that for several reasons, decomposition A is more satisfactory than decomposition B. For example, in B it's impossible to represent the fact that a given department has a budget unless that department also has at least one employee.
     Let's examine the example a little more closely. First, note that the projections in decomposition A correspond to the solid arrows in Figure 1. As a direct result of that fact, updates can be made to either projection without regard for the other (so long as the referential constraint from ED to DB is satisfied; this latter consideration, though important, is not germane to the present discussion). In other words, provided only that the update in question is legal within the context of the projection concerned--which means only that it must not violate the candidate key constraint for that projection--then the join of the two projections after the update will still be a valid value for relvar EMP (that is, the join cannot possibly violate the FD constraints on relvar EMP).
     In decomposition B, by contrast, one of the two projections corresponds to the broken arrow in Figure 1. As a result, updates to either projection must at least be monitored to ensure that the FD DEPT# ---> BUDGET isn't violated. (If two employees have the same department, they must have the same budget! Consider, for example, what's involved in decomposition B in moving an employee from department D1 to department D2.)
     The basic problem with decomposition B is that the FD DEPT#---> BUDGET spans two relvars. In decomposition A, by contrast, it's the transitive FD EMP#---> BUDGET that spans relvars, and that FD will be enforced automatically if the two FDs EMP# ---> DEPT# and DEPT# ---> BUDGET (which don't span relvars) are enforced. And enforcing these two latter FDs is very simple, of course, involving as it does nothing more than the enforcement of the corresponding candidate key uniqueness constraints.
     The net of this discussion is that we should generally decompose relvars in such a way as to preserve FDs. More precisely, let R be the relvar we're trying to normalize, and let F be a conical cover for the FDs satisfied by R. Then the decomposition of R, D say, should be such that every FD in F corresponds to just one projection in D, and there aren't any other projections in D.
     Note: I discussed the conical cover concept in my November 1997 column. Loosely speaking, a conical cover for a given set S of FDs is a set of FDs that a) implies all the FDs in S (and no others), b) includes no FDs that are unnecessary for a), and c) contains at most one FD with a given left-hand side. In the case of EMP, for example, with its FDs EMP# ---> DEPT#, DEPT# ---> BUDGET, and EMP# ---> BUDGET, it's easy to see that the set of FDs:

EMP# ---> DEPT# 
DEPT# ---> BUDGET

is an irreducible cover. (In fact, it's the only one.)

AN UNFORTUNATE CONFLICT

Consider a relvar SJT { S, J, T }, where S, J, and T stand for student, subject, and teacher, respectively. The meaning of an SJT row (s,j,t) is that student s is taught subject j by teacher t. Assume that:
     1. For each subject, each student of that subject is taught by only one teacher.
     2. Each teacher teaches only one subject (but each subject is taught by several teachers).
     A sample value for this relvar that conforms to these constraints is given in Figure 2. Note: For reasons that will quickly become apparent, I deliberately depart in that figure from my usual convention of indicating primary key columns by double underlining.
     What FDs hold in SJT? From 1 we have { S, J } ---> T; from 2 we have T ---> J. Also, the fact that each subject is taught by several teachers tells us that the FD J ---> T does not hold. So the situation is as shown in Figure 3. Note the nasty involuted structure of that figure!
     Observe that SJT has two candidate keys, { S, J } and { S, T }; both of these combinations have the property that all columns of SJT are functionally dependent on them, and this property no longer holds if any column is discarded from the combination in question. Yet SJT also satisfies the additional FD T ---> J, which isn't implied by either of those candidate keys; so SJT isn't in BCNF. And--like all non-BCNF relvars--SJT suffers from certain update anomalies; for example, if we wish to delete the information that Jones is studying physics, we cannot do so without at the same time losing the information that Professor Brown teaches physics.
     As usual, we can avoid these update anomalies by performing a nonloss decomposition into projections, in this case the projections:

ST { S, T }
CANDIDATE KEY { S, T }
TJ { T, J } 
CANDIDATE KEY { T }

(both of which are 5NF). Now we can delete the information that Jones is studying physics by deleting the row for Jones and Professor Brown from projection ST without losing the row for Professor Brown and Physics from projection TJ.
     The problem is, however, that this decomposition violates the FD preservation objective discussed in the previous section. To be specific, the FD { S, J } ---> T now spans two relvars (ST and TJ); it cannot be deduced from the FD T ---> J (which is the only nontrivial FD represented in ST and TJ). As a result, ST and TJ cannot be independently updated. For example, an attempt to insert a row for Smith and Professor Brown into ST must be rejected, because Professor Brown teaches physics and Smith is already being taught physics by Professor Green; yet this fact cannot be detected without examining TJ.
     The point of this discussion is that--sadly--the two objectives of a) decomposing relvars to the ultimate normal form (or indeed just to BCNF), and b) decomposing relvars in such a way as to preserve FDs, can occasionally be in conflict; that is, it isn't always possible to achieve both simultaneously.
     Other points arise from this example:

FORALL s IN S, t1,t2 IN T, j1,j2 IN J
IF EXISTS { S:s, T:t1 } IN ST 
AND EXISTS { S:s, T:t2 } IN ST 
AND EXISTS { T:t1, J:j1 } IN TJ 
AND EXISTS { T:t2, J:j2 } IN TJ 
THEN j1 ý j2 


     (I'm assuming here that the student, teacher, and subject domains are called S, T, and J, respectively.) This declaration is quite complex! But suppose we were allowed to declare candidate keys for views (or, more generally, for relational expressions of any kind). Then we might write:

CREATE VIEW SJT AS ( ST JOIN TJ )
                CANDIDATE KEY { S, J }


     This candidate key declaration serves very nicely to define the relvar-spanning FD! Thus the example provides another argument in support of a position I've argued elsewhere, to the effect that it should be possible to declare candidate keys for views as well as for base relvars.
     •In case you might be thinking that the SJT example is very contrived, and that involuted dependency structures such as the one shown in Figure 3 never occur in practice, let me give a much more familiar example:

ADDRESS { STREET, CITY, STATE, ZIP }


     This relvar satisfies the FDs ZIP ---> { CITY, STATE } and { STREET, CITY, STATE } ---> ZIP. In other words, it's the SJT example! (Take S as STREET, CITY and STATE as J, and ZIP as T.)

ELIMINATING REDUNDANCY

The broad objective of normalization is to reduce redundancy. But consider a relvar CTXD (a variation on relvar CTX from last month), with columns C (course), T (teacher), X (textbook), and D (days). The meaning of row (c,t,x,d) is that course c can be taught by teacher t and uses textbook x, and further, that d is the number of days spent with textbook x by teacher t on course c. Assume too that:
     1. Teachers and textbooks are independent of one another (as for CTX last month).
     2. A given teacher/textbook combination can occur in association with any number of courses (different from CTX last month).
     Figure 4 shows a sample CTXD relation value that's consistent with these assumptions.        


     Relvar CTXD has just one candidate key, the combination { C, T, X }, and it satisfies just one nontrivial dependency, the FD { C, T, X } ---> D. So it's 5NF! And yet it clearly involves a lot of redundancy. The trouble is that redundancy can't be eliminated by taking projections; D depends on all three of C, T, and X, and so can't appear in a relvar with anything less than all three. So the message is: Just because a relvar is in the ultimate (fifth) normal form, it doesn't mean there's no redundancy or no update anomalies.
     Note: Although CTXD doesn't involve any nontrivial MVDs (apart from the FD discussed above), it does involve two "embedded" MVDs (of T on C and X on C). A relvar R { A, B, ... } is said to be subject to the embedded MVD of B on A if the "ordinary" MVD A ---> ---> B holds in some projection of R. In the example, the two embedded MVDs would have to be stated as explicit constraints on CTXD.

WHY NORMALIZATION IS NO PANACEA

Normalization is clearly not a panacea, even when we limit our attention to purely logical considerations. (Naturally, I reject arguments based on physical considerations--that is, performance arguments--for reasons discussed in detail a few months back.2) To summarize the reasons why this is so, let's think about the objectives of normalization, which might be stated as follows:


     The problem is, however, that a) not all constraints are JDs or MVDs or FDs, and in any case b) as the SJT example shows, some JDs or MVDs or FDs can become relvar-spanning constraints--and hence, technically, no longer JDs or MVDs or FDs--during the normalization process. In other words, the two objectives of normalization and FD preservation can sometimes conflict with one another (as I explained earlier).
     I must mention too that:

DENORMALIZATION CAN BE HARMFUL

Despite the problems listed in the foregoing section, I must close by repeating my belief that, almost always, anything less than full normalization is strongly contraindicated. A fully normalized design can be regarded as a "good" representation of the real world--one that's intuitively easy to understand and is a good base for future growth. Incidentally, it's worth noting in this connection that good top-down design methodologies tend to generate fully normalized designs; why do you think this is?
     One last point: We all know that denormalization is bad for update (logically bad, I mean). But it can be bad for retrieval too; that is, it can make certain queries harder to formulate. What is more--even with today's less than perfect DBMSs, in which "denormalizing for performance" is, regrettably, sometimes necessary--denormalization can be bad for performance as well! In fact, "denormalizing for performance" usually means improving the performance of one application at the expense of others.

References

     1. Date, C. J. "Functional Dependencies Are Fun, Part 1," Database Programming & Design, 8(11), Nov. 1995; Part 2, Database Programming & Design, 8(12), Dec. 1995.
     2. C. J. Date: "The Normal Is So ý Interesting, Part 1," Database Programming & Design, 11(12), Dec. 1997; Part 2, Database Programming & Design, 12(1), Jan. 1998.

C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent books, all published by Addison-Wesley, are An Introduction to Database Systems (6th edition, 1995); Relational Database Writings 1991-1994 (1995); and (with Hugh Darwen) A Guide to the SQL Standard (4th edition, 1997). Correspondence may be sent to him in care of Database Programming & Design, 411 Borel Ave., Ste. 100, San Mateo, CA 94402.
 
 


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!