The Fault with Defaults

By Tom Johnston

A response to C. J. Date in the default vs. special values debate

      I was looking forward to reading C. J. Date's position on default values because I was anticipating something that many of us could have used instead of the nulls and multivalued logic (MVL) provided by SQL. We would then have had two ways available to us to handle missing information. On the one hand, with a well-designed scheme expressing a two-valued logic approach to default values, the more conservative among us could have used this less expressive but also less complex and error-prone way of handling missing information. On the other hand, for those of us who felt comfortable dealing with the additional complexity that increased expressive power inevitably brings with it, we could have used nulls and MVL.
      But unfortunately, this is not what Date provided in his five-part "Faults and Defaults" series (November and December 1996 and January, February, and April 1997). He instead provided a "special values" scheme (as he calls it) that no current product supports and that would require SQL to abjure its commitment to MVL and build in support for this alternative approach. Date acknowledges this, while trying to put the shoe on the other foot, when he says that "today's SQL products might have difficulty with such a notion," that is, his special values scheme, "but that's their problem" (December 1996).
 
UNK and Equality
Let me move directly to the heart of the matter concerning Date's special values scheme.
      After explaining that his new scheme will require us to replace every domain--say domain XXXX--with a new domain XXXX_OR_UNK, Date goes on to consider the use of various operators with the UNK special value. (I find Date's name for the new domains misleading because the domains in question contain both the real values and UNK, not either one or the other. I suggest that readers will find Date's discussion a little clearer if they think of these domains with the names XXXX_OR_UNK as domains containing all the XXXX values and also the value UNK.) He begins with the equality operator and after defining its use with the UNK special value, says: "Note that this definition implies that the comparison 'UNK=UNK'ýreturns true. ýNo three-valued logic here!" (January 1998)
      Right here, on this clear and explicit point, we could have no better example of the need for three-valued logic! And because this goes right to the heart of whether a special values approach can be as expressive as an MVL approach, we must consider this claim of Date's about UNK and equality very carefully.
      Consider a two-column table of 100 rows, each of which is a pair of UNK values. Suppose the domain for each column is a) integers from one to 100 and b) UNK. Therefore, for each row with a real value in each column (a number from one to 100) the odds are only one in 10,000 that the row contains a pair of matching values (100 x 100). For all 100 rows, the odds are even worse--one in a million--that they all contain matching values (10,000 x 100).
      Let's issue the following query to a database interface using Date's special values scheme: "How many rows in this table contain matching values?" The answer we will get with Date's explicit definition is: "All 100 of them." But because we don't know the real values for the pairs of numbers in any row, the odds of this being the right answer are, literally, one in a million! Is this is a scheme that any of us would want to use when querying a database for information?
      How serious the real-world consequences of the inadequacies of Date's approach depends on the situation. So to make them serious, suppose that the columns of numbers are the firing coordinates for each of 100 targets computed by independent targeting mechanisms for a cruise missile launching system. Next, let the fail-safe rule be that a missile will launch if and only if both targeting mechanisms agree on the coordinates. And, finally, let us suppose that conditions for targeting are so bad in all 100 cases that both targeting mechanisms are unable to compute coordinates. Therefore, because the domain each is working with is TARGETING_COORDINATE_OR_UNK, they put the special value UNK into their column for each of the 100 rows.
      We now use an interface based on Date's new scheme to determine at which of the 100 targets we will fire. We say, "For each of the 100 targets, fire if and only if the comparison 'Does coordinate 1 = coordinate 2?' returns true." And lo, all 100 cruise missiles lift into the air--even though no coordinates are provided by either targeting mechanism for any of the targets!
      Of course, if we had only had the foresight to have said, instead, "For each of the 100 targets, fire if and only if the comparisons 'Does coordinate 1 = coordinate 2?' and 'Does coordinate 1 not = UNK'?' return true," then our missiles would not have fired.
      Now Date may call this revised query a "solution" if he wants to, but I think it is evidence that something is wrong. The user has been forced to compensate for an error in the semantics Date's scheme assigns to the equality operator. What has gone wrong, as we all know, is that the answer to the question "Are two unknown values equal?" is not "Yes, they are."
      And if Date's reply to this point is that he defined equality for his UNK special value not for "unknown values," then he has merely conceded that his scheme using UNK values does not represent the semantics of the "real world's" unknown values.
 
UNK and Inequality: The Basics
After this unfortunate attempt at defining equality, Date goes on to try to define greater than and less than. Running into immediate problems, he suggests that it is "debatable" that we would ever need to use these operators with an UNK operand. After struggling with a possible interpretation that would be as flawed as his interpretation of equality, he settles on the position that to use these two inequality operators when one or both operands is UNK "makes no sense."
      You can now see what I meant by saying that "what's wrong with the default value approach, from a semantic perspective, is that it has the semantics wrong, and so the ability of a database using default values to inform the inquirer fully is correspondingly hampered" (quoted in November 1996 and again, while saying that he still "completely disagrees" in April 1997). This position of Date's is precisely that his special values scheme cannot return any information to the user when either of these two operators is used in the presence of UNK. So together with his incorrect handling of equality, I suggest that Date has now demonstrated the truth of this claim of mine.
 
UNK, Inequality, and the Real World
Up to this point I have simply assumed that it makes sense to use the inequality operators (greater than and less than) with the UNK special value. But I think that Date would now attempt to argue that the basic question is what we would want to do in the real world when confronted with unknown values. For if refusing to give any answer to a question of the form "Is X greater than (or less than)Y," where X, Y or both may be unknown, is indeed what we would do in the real world, then we would have sacrificed nothing with Date's approach to these two operators. On the contrary, we would have accurately captured the semantics of unknown values with UNK and two-valued logic (for greater than and less than, at least).
      I shall now argue that the right response to a query using greater than or less than when a value for one or both operands is unknown is that the question certainly does make sense, but the answer is unknown because one or both of the operands is unknown. (And if this seems so obvious to you that you are wondering why I am trying to build an argument to demonstrate it, let me remind you that this is precisely what Date's scheme denies!)
      Let's look at exactly how much information is inexpressible with Date's scheme, even though the information that scheme can't give us is contained in the database!
      Suppose now that our table of pairs of numbers represents 100 cases in which two players cut cards--the numbers two through 14 representing card values two through ace, the first player being Jones, and the second player being Smith. Let's suppose next that a) in 10 cases, both values are unknown, b) in 23 cases one but not both values are unknown, and c) in the remaining 67 cases, both values are known. Of those 67 cases, let's assume that four were ties, that Jones beat Smith in 42 of those cases and, therefore, that Smith beat Jones in the remaining 21. (To avoid some unnecessary complexity relating to the highest and lowest values in a deck of cards, assume that in case b), the known value is never two or 14 because if it were the former, the player could tie but not win the card cut, and if it were the latter, the player could tie but not lose the card cut.)

Now consider the following questions:
1. How many times did Jones definitely beat Smith?
2. How many times may Jones have beaten Smith?
3. How many times did Jones and Smith definitely tie?
4. How many times may Jones and Smith have tied?

Clearly, the answers to the questions are these: 1. 42
2. 75 (42 + 10 + 23)
3. 4
4. 37 (4 + 10 + 23)

The answers that Date's special values scheme, would give, however, are these: 1. 42
2. At least 42, and no more than 65--but with 23 instances of "invalid question"
3. 14 (4 + 10)
4. At least 14 (4 + 10), but with 23 instances of "invalid question"

Now let's compare results.

For question 1, both approaches give the same correct result.

For question 2, Date's scheme should have been able to determine that the correct answer is 75, but it couldn't because it couldn't count the 23 cases in which one of the operands was UNK--Date's scheme will return "undefined" (or "raise a runtime error") whenever it encounters an UNK value while evaluating a greater than or less than expression. However, in all 23 cases, Jones may indeed have beaten Smith. This demonstrates my point that there is information in the database that, given Date's scheme, he cannot provide to the user.
      Note also that the correct answer (75) lies outside the range given by Date's scheme, so the answer we get using his scheme is not only vague, but actually incorrect. Date's scheme must count the 10 cases in which both numbers are UNK as ties, that is, as cases where the two values are equal! Being equal, they are not possible cases in which Jones beat Smith in Date's scheme. But in the real world (unfortunately for Date), because they are instances of unknown values, they are possible cases in which Jones beat Smith.
      For question 3, Date's scheme also gives an incorrect answer. Jones and Smith definitely tied in four cases, not in 14! And why did Date's scheme give so obviously incorrect an answer? Again, because of Date's rule "that the comparison UNK=UNK ý returns true ý No three-valued logic here!"
      Of course, understanding the limitations of Date's scheme, we could have phrased our query looking for definite ties as follows: "where column 1 = column 2, and neither column = UNK." This query would have returned the correct result (4). But this doesn't solve any problem in Date's scheme. It just puts the burden of compensating for the fundamental semantic errors in his scheme directly on the user.
      For question 4, Date's scheme gives a correct answer. However, this answer is less informative than it could be because in all 23 cases it could be the case that the two players tied.
      Note that the answer that a database interface using nulls and MVL would give are exactly those which I indicated are the clearly correct answers. With a three-valued logic, X = Y returns unknown when either or both operands are unknown, and similarly for X > Y and X < Y.
      So for question 4, for example, a database interface using nulls and MVL can reason as follows: "I know that Jones and Smith tied in four cases. In 10 cases, I don't know either value, so they may have tied in all 10 of those cases as well. And in 23 cases, I don't know one value, so they may have tied in all of those cases, too. The total of all these cases is 37."
      Date's attempt to convince us that in the real world we don't use three-valued logic, flies in the face of this kind of everyday reasoning. Yet again, Date has provided further illustration of the truth of my statement quoted earlier.
 
Complexity
In his final installment, Date anticipates an obvious objection: that his special values scheme makes queries a lot more complicated than they were before. I have three comments about this feature of his new scheme.
      First, a cursory glance at Date's extensive criticisms of multivalued logic will show that underlying them is the repeated refrain that MVL is too complex, and hence too prone to generate errors in the real world. I therefore find it interesting that Date seems to be now hoist on his own petard of complexity. Perhaps he will explain to us why the complexity of MVL is bad, leading to errors in the real world, while the complexity his special values scheme requires the user to master is good--supposedly not leading to such errors.
      Second, all this additional complexity in queries required by Date's scheme exists for only one reason: because the semantics of Date's scheme are either incorrect (for equals) or incapable of returning information that does exist in the database (for greater than and less than).
      Finally, I do not doubt that Date will have many words to offer in reply. But I would suggest to the reader that he who publishes the last word does not necessarily have the better position. Let us remember that the Copernican revolution did not succeed because it proved to the Ptolemaic theory's most ardent advocates that their theory was wrong. It succeeded because the complexities of epicycles within epicycles, which new observations forced the Ptolemaic theory into, convinced everyone else but those ardent advocates that the theory was fatally flawed. I am content, therefore, to stand back and watch Date propose additional epicycles to these already quite elaborate query-related epicycles he requires for his new special values scheme.
 
A Practical Special Values Scheme
In my introduction, I expressed a desire to see a two-valued logic default values scheme whose clarity and ease of use was judged a good trade-off for reduced expressive power. Since Date has not given us one, let me briefly outline such a scheme myself.
      First of all, I agree with Date that we should eliminate nullable attributes in our databases, as much as possible. With some reservations (which I will not explain here), I also agree that a good way to do this is to subtype entities which contain a nullable attribute into two subtype entities--one which contains the attribute as nonnullable, and the other which does not contain the attribute at all.
      Second, I suggest that we reserve one or more values from an attribute's usual domain for the special values we wish to identify. A single value, meaning "real value missing" may be all we need. Perhaps a distinction between inapplicable, unknown, or "one or the other, not known which" is more appropriate. And in response to Date's frequently expressed argument that there are any number of different kinds of nulls and, therefore, any number of truth values which an MVL would have to accommodate, I would reply that it seems to me somewhat naive to argue that every different way of stating a value-missing condition, in English, must correspond to a formally distinct kind of null. For the position that all the different ways in which a value-missing condition can be stated in English (or any natural language) can be reduced to the three mentioned here, see Atzeni and DeAntonellis, Relational Database Theory (Benjamin/Cummings, 1993).
      If possible, we should establish business rules that ensure that the value we choose is not a homonym. For example, if an enterprise enforces a rule that invoice amounts cannot be zero, then we can use zeroes in the invoice amount attribute to mean "real value missing."
      However, sometimes this is not possible. Perhaps our enterprise must allow zero-amount invoices. In that case, "$000,000.00" in the attribute means both a) a zero-amount invoice, and b) "real value missing."
      If we must make a homonym out of a value like this, we should be aware of the cost. Here, the cost is being unable to distinguish zero-amount invoices from invoices in which the actual amount is missing. This may well be too high a cost. In that case, we have other options, such as adding a flag to distinguish zero-amount invoices from all others or shifting to a less critical and very infrequently used real value--say "$999,999.98"--to mean "real value missing."
      In this case, we have compensated for the homonymity, for we can now distinguish zero-amount invoices from invoices in which the invoice-amount is missing. What we now can't do is distinguish the (presumably less common) invoices for exactly $999,999.98 from invoices in which the amount is missing. Although we have not eliminated the cost of the homonymity; we have significantly reduced it.
      We have been using this kind of reduced-cost strategy for decades, actually, and in doing so we have decided to let the semantic problem be resolved outside the system by users who understand the convention and the possibility of actually encountering instances of these homonyms.
      Finally, note that the strategy of avoiding homonyms, using a value to mean "real value missing" that does not mean anything else, gives us exactly the expressive power (and exactly the complexities) of Date's special values scheme. It does so because it amounts to finding a representation of Date's UNK within the standard domain for an attribute, instead of requiring vendors to extend that domain.
      With Date's new special values scheme, we have an alternative to MVL that can't get equality right, can't handle greater than or less than at all, consequently returns both incorrect results and less than fully informative results to queries, and turns simple queries into rocket-scientist queries.
      Moreover, now that we understand how far from being implementable Date's scheme is, we can see that this extended debate between Date and me has little immediate applicability to the practical work of developing databases and interfaces to them. As designers and users, we will continue to work within the constraints provided by our dialects of SQL.
      The more conservative among us will avoid nulls and MVL because of understandable caution at the complexity. As I argued throughout my articles on this topic, the use of the additional inferencing power of MVL is a foray into a more complex dialogue with the database--a foray which should not be undertaken unless one is comfortable handling the complexity.
      To assist the conservatives, therefore, I have sketched a two-valued logic approach to missing information that is practical and can be used right now with today's products. However, I claim no originality here. Indeed, I offer this sketch as little more than an existing best practice, dressed up in some fancy logical terminology, and accompanied by some explicit caveats.
      As for the more adventurous readers, they will make use of nulls, as provided by our SQL dialects and will probably get burned sometimes. But in the process, they will forge an understanding of nulls and MVL, which they might not obtain from any textbook study. They will begin the process of rendering the complexity concomitant with the greater expressive power of nulls and MVL less daunting to us all.

Tom Johnston is an independent consultant located in Atlanta. With a doctorate in philosophy and a strong background in relational theory and database technology, he provides logical and physical data modeling services to major corporate clients. You can reach him via email at tjohnston@acm.org.  


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!