
This month, I complete my miniseries on normalization by discussing some of the things that normalization can't do. (It's always important to understand the limitations of any technology on which we rely heavily.) I must immediately make it clear that I don't want my remarks to be seen as any kind of attack! Au contraire: In the past, I've often characterized normalization as the one piece of true science in "an otherwise artistic endeavor"--the endeavor in question being, of course, database design. (Actually, we have a little more science available to us than we used to, but the basic point--that database design is still a very much a matter of subjective decisions--is still valid.) However, although normalization does have a good claim to being objective instead of subjective, the fact remains that there are several aspects of database design that it doesn't help with at all.
I'll begin by introducing the notion of functional dependency (FD)
preservation. It's often the case that a given relvar can be
nonloss-decomposed in a variety of different ways--but some ways are
better than others. For example, consider the relvar EMP { EMP#, DEPT#,
BUDGET }, with candidate key EMP# and FDs EMP# ---> DEPT# and DEPT#
---> BUDGET. Note that, thanks to the transitivity property of FDs
discussed in an earlier installment1, the FD EMP# ---> BUDGET also
holds (see Figure 1, where the transitive FD is shown as a broken
arrow).
Now, it's well known that relvars like
EMP suffer from certain update anomalies, and further, that those
anomalies can be avoided by nonloss-decomposing the relvar into certain
projections. In the case at hand, suitable projections (both 5NF) are:
ED { EMP#, DEPT# }
CANDIDATE KEY { EMP# }
DB { DEPT#, BUDGET }
CANDIDATE KEY { DEPT# }
I'll call this decomposition "decomposition A." Now, it's easy to see there's another possible nonloss decomposition of EMP into 5NF projections ("decomposition B"):
ED { EMP#, DEPT# }
CANDIDATE KEY { EMP# }
EB { EMP#, BUDGET }
CANDIDATE KEY { EMP# }
Observe that projection ED is the same
for both A and B. Note: Of course the third possibility, replacing EMP
by its two projections on { EMP#, BUDGET } and { DEPT#, BUDGET }, isn't
a valid decomposition, because it's not nonloss.
I now claim that for several reasons,
decomposition A is more satisfactory than decomposition B.
For example, in B it's impossible to represent the fact that a given
department has a budget unless that department also has at least one
employee.
Let's examine the example a little
more closely. First, note that the projections in decomposition A
correspond to the solid arrows in Figure 1. As a direct result of
that fact, updates can be made to either projection without regard for
the other (so long as the referential constraint from ED to DB is
satisfied; this latter consideration, though important, is not germane
to the present discussion). In other words, provided only that the
update in question is legal within the context of the projection
concerned--which means only that it must not violate the candidate key
constraint for that projection--then the join of the two projections
after the update will still be a valid value for relvar EMP (that
is, the join cannot possibly violate the FD constraints on relvar EMP).
In decomposition B, by contrast, one
of the two projections corresponds to the broken arrow in Figure 1. As a
result, updates to either projection must at least be monitored to
ensure that the FD DEPT# ---> BUDGET isn't violated. (If two
employees have the same department, they must have the same budget!
Consider, for example, what's involved in decomposition B in moving an
employee from department D1 to department D2.)
The basic problem with decomposition B
is that the FD DEPT#---> BUDGET spans two relvars. In
decomposition A, by contrast, it's the transitive FD EMP#--->
BUDGET that spans relvars, and that FD will be enforced automatically if
the two FDs EMP# ---> DEPT# and DEPT# ---> BUDGET (which don't
span relvars) are enforced. And enforcing these two latter FDs is very
simple, of course, involving as it does nothing more than the
enforcement of the corresponding candidate key uniqueness constraints.
The net of this discussion is that we
should generally decompose relvars in such a way as to preserve
FDs. More precisely, let R be the relvar we're trying to normalize, and
let F be a conical cover for the FDs satisfied by R. Then the
decomposition of R, D say, should be such that every FD in F corresponds
to just one projection in D, and there aren't any other projections in
D.
Note: I discussed the conical
cover concept in my November 1997 column. Loosely speaking, a conical
cover for a given set S of FDs is a set of FDs that a) implies all the
FDs in S (and no others), b) includes no FDs that are unnecessary for
a), and c) contains at most one FD with a given left-hand side. In the
case of EMP, for example, with its FDs EMP# ---> DEPT#, DEPT#
---> BUDGET, and EMP# ---> BUDGET, it's easy to see that the set
of FDs:
EMP# ---> DEPT# DEPT# ---> BUDGET
is an irreducible cover. (In fact, it's the only one.)
Consider a relvar SJT { S, J, T }, where S, J, and T stand for student,
subject, and teacher, respectively. The meaning of an SJT row
(s,j,t) is that student s is taught subject j by teacher t.
Assume that:
1. For each subject, each student of
that subject is taught by only one teacher.
2. Each teacher teaches only one
subject (but each subject is taught by several teachers).
A sample value for this relvar that
conforms to these constraints is given in Figure 2. Note: For
reasons that will quickly become apparent, I deliberately depart in that
figure from my usual convention of indicating primary key columns by
double underlining.
What FDs hold in SJT? From 1 we have {
S, J } ---> T; from 2 we have T ---> J. Also, the fact that
each subject is taught by several teachers tells us that the FD J
---> T does not hold. So the situation is as shown in Figure 3.
Note the nasty involuted structure of that figure!
Observe that SJT has two candidate
keys, { S, J } and { S, T }; both of these combinations have the
property that all columns of SJT are functionally dependent on them, and
this property no longer holds if any column is discarded from the
combination in question. Yet SJT also satisfies the additional FD T
---> J, which isn't implied by either of those candidate keys; so
SJT isn't in BCNF. And--like all non-BCNF relvars--SJT suffers from
certain update anomalies; for example, if we wish to delete the
information that Jones is studying physics, we cannot do so without at
the same time losing the information that Professor Brown teaches
physics.
As usual, we can avoid these update
anomalies by performing a nonloss decomposition into projections, in
this case the projections:
ST { S, T }
CANDIDATE KEY { S, T }
TJ { T, J }
CANDIDATE KEY { T }
(both of which are 5NF). Now we can delete the information that Jones is
studying physics by deleting the row for Jones and Professor Brown from
projection ST without losing the row for Professor Brown and Physics
from projection TJ.
The problem is, however, that this
decomposition violates the FD preservation objective discussed in the
previous section. To be specific, the FD { S, J } ---> T now spans
two relvars (ST and TJ); it cannot be deduced from the FD T ---> J
(which is the only nontrivial FD represented in ST and TJ). As a result,
ST and TJ cannot be independently updated. For example, an attempt to
insert a row for Smith and Professor Brown into ST must be rejected,
because Professor Brown teaches physics and Smith is already being
taught physics by Professor Green; yet this fact cannot be detected
without examining TJ.
The point of this discussion is
that--sadly--the two objectives of a) decomposing relvars to the
ultimate normal form (or indeed just to BCNF), and b) decomposing
relvars in such a way as to preserve FDs, can occasionally be in
conflict; that is, it isn't always possible to achieve both
simultaneously.
Other points arise from this example:
FORALL s IN S, t1,t2 IN T, j1,j2 IN J
IF EXISTS { S:s, T:t1 } IN ST
AND EXISTS { S:s, T:t2 } IN ST
AND EXISTS { T:t1, J:j1 } IN TJ
AND EXISTS { T:t2, J:j2 } IN TJ
THEN j1 ý j2
(I'm assuming here that the student,
teacher, and subject domains are called S, T, and J, respectively.) This
declaration is quite complex! But suppose we were allowed to declare
candidate keys for views (or, more generally, for relational expressions
of any kind). Then we might write:
CREATE VIEW SJT AS ( ST JOIN TJ )
CANDIDATE KEY { S, J }
This candidate key declaration serves
very nicely to define the relvar-spanning FD! Thus the example provides
another argument in support of a position I've argued elsewhere, to the
effect that it should be possible to declare candidate keys for views as
well as for base relvars.
In case you might be thinking
that the SJT example is very contrived, and that involuted dependency
structures such as the one shown in Figure 3 never occur in practice,
let me give a much more familiar example:
ADDRESS { STREET, CITY, STATE, ZIP }
This relvar satisfies the FDs ZIP
---> { CITY, STATE } and { STREET, CITY, STATE } ---> ZIP. In
other words, it's the SJT example! (Take S as STREET, CITY and STATE as
J, and ZIP as T.)
The broad objective of normalization is to reduce redundancy. But
consider a relvar CTXD (a variation on relvar CTX from last month),
with columns C (course), T (teacher), X (textbook), and D (days). The
meaning of row (c,t,x,d) is that course c can be taught by teacher t and
uses textbook x, and further, that d is the number of days spent with
textbook x by teacher t on course c. Assume too that:
1. Teachers and textbooks are
independent of one another (as for CTX last month).
2. A given teacher/textbook
combination can occur in association with any number of courses
(different from CTX last month).
Figure 4 shows a sample CTXD relation
value that's consistent with these assumptions.
Relvar CTXD has just one candidate
key, the combination { C, T, X }, and it satisfies just one nontrivial
dependency, the FD { C, T, X } ---> D. So it's 5NF! And yet it
clearly involves a lot of redundancy. The trouble is that redundancy
can't be eliminated by taking projections; D depends on all three of C,
T, and X, and so can't appear in a relvar with anything less than all
three. So the message is: Just because a relvar is in the ultimate
(fifth) normal form, it doesn't mean there's no redundancy or no update
anomalies.
Note: Although CTXD doesn't involve
any nontrivial MVDs (apart from the FD discussed above), it does involve
two "embedded" MVDs (of T on C and X on C). A relvar R { A, B, ... } is
said to be subject to the embedded MVD of B on A if the "ordinary" MVD A
---> ---> B holds in some projection of R. In the example, the
two embedded MVDs would have to be stated as explicit constraints on
CTXD.
Normalization is clearly not a panacea, even when we limit our attention to purely logical considerations. (Naturally, I reject arguments based on physical considerations--that is, performance arguments--for reasons discussed in detail a few months back.2) To summarize the reasons why this is so, let's think about the objectives of normalization, which might be stated as follows:
The problem is, however, that a) not all constraints are JDs or MVDs or FDs, and in any case b) as the SJT example shows, some JDs or MVDs or FDs can become relvar-spanning constraints--and hence, technically, no longer JDs or MVDs or FDs--during the normalization process. In other words, the two objectives of normalization and FD preservation can sometimes conflict with one another (as I explained earlier).
I must mention too that:
Despite the problems listed in the foregoing section, I must close by repeating my belief that, almost always, anything less than full normalization is strongly contraindicated. A fully normalized design can be regarded as a "good" representation of the real world--one that's intuitively easy to understand and is a good base for future growth. Incidentally, it's worth noting in this connection that good top-down design methodologies tend to generate fully normalized designs; why do you think this is?
One last point: We all know that denormalization is bad for update (logically bad, I mean). But it can be bad for retrieval too; that is, it can make certain queries harder to formulate. What is more--even with today's less than perfect DBMSs, in which "denormalizing for performance" is, regrettably, sometimes necessary--denormalization can be bad for performance as well! In fact, "denormalizing for performance" usually means improving the performance of one application at the expense of others.
References
1. Date, C. J. "Functional Dependencies Are Fun, Part 1," Database Programming & Design, 8(11), Nov. 1995; Part 2, Database Programming & Design, 8(12), Dec. 1995.
2. C. J. Date: "The Normal Is So ý Interesting, Part 1," Database Programming & Design, 11(12), Dec. 1997; Part 2, Database Programming & Design, 12(1), Jan. 1998.
C. J. Date is an independent author, lecturer, researcher, and consultant,
specializing in relational database systems. His most recent books, all published by Addison-Wesley,
are An Introduction to Database Systems (6th edition, 1995); Relational Database Writings 1991-1994
(1995); and (with Hugh Darwen) A Guide to the SQL Standard (4th edition, 1997).
Correspondence may be sent to him in care of Database Programming & Design, 411 Borel Ave., Ste. 100,
San Mateo, CA 94402.