
Now consider the following questions:
1. How many times did Jones definitely beat Smith?
2. How many times may Jones have beaten Smith?
3. How many times did Jones and Smith definitely tie?
4. How many times may Jones and Smith have tied?
Clearly, the answers to the questions are these:
1. 42
2. 75 (42 + 10 + 23)
3. 4
4. 37 (4 + 10 + 23)
The answers that Date's special values scheme, would give, however, are these:
1. 42
2. At least 42, and no more than 65--but with 23 instances of "invalid question"
3. 14 (4 + 10)
4. At least 14 (4 + 10), but with 23 instances of "invalid question"
Now let's compare results.
For question 1, both approaches give the same correct result.
For question 2, Date's scheme should have been able to determine that the correct answer is 75, but it couldn't because it couldn't count the 23 cases in which one of the operands was UNK--Date's scheme will return "undefined" (or "raise a runtime error") whenever it encounters an UNK value while evaluating a greater than or less than expression. However, in all 23 cases, Jones may indeed have beaten Smith. This demonstrates my point that there is information in the database that, given Date's scheme, he cannot provide to the user.
Note also that the correct answer (75) lies outside the range given by Date's scheme, so the answer we get using his scheme is not only vague, but actually incorrect. Date's scheme must count the 10 cases in which both numbers are UNK as ties, that is, as cases where the two values are equal! Being equal, they are not possible cases in which Jones beat Smith in Date's scheme. But in the real world (unfortunately for Date), because they are instances of unknown values, they are possible cases in which Jones beat Smith.
For question 3, Date's scheme also gives an incorrect answer. Jones and Smith definitely tied in four cases, not in 14! And why did Date's scheme give so obviously incorrect an answer? Again, because of Date's rule "that the comparison UNK=UNK ý returns true ý No three-valued logic here!"
Of course, understanding the limitations of Date's scheme, we could have phrased our query looking for definite ties as follows: "where column 1 = column 2, and neither column = UNK." This query would have returned the correct result (4). But this doesn't solve any problem in Date's scheme. It just puts the burden of compensating for the fundamental semantic errors in his scheme directly on the user.
For question 4, Date's scheme gives a correct answer. However, this answer is less informative than it could be because in all 23 cases it could be the case that the two players tied.
Note that the answer that a database interface using nulls and MVL would give are exactly those which I indicated are the clearly correct answers. With a three-valued logic, X = Y returns unknown when either or both operands are unknown, and similarly for X > Y and X < Y.
So for question 4, for example, a database interface using nulls and MVL can reason as follows: "I know that Jones and Smith tied in four cases. In 10 cases, I don't know either value, so they may have tied in all 10 of those cases as well. And in 23 cases, I don't know one value, so they may have tied in all of those cases, too. The total of all these cases is 37."
Date's attempt to convince us that in the real world we don't use three-valued logic, flies in the face of this kind of everyday reasoning. Yet again, Date has provided further illustration of the truth of my statement quoted earlier.
Complexity
In his final installment, Date anticipates an obvious objection: that his special values scheme makes queries a lot more complicated than they were before. I have three comments about this feature of his new scheme.
First, a cursory glance at Date's extensive criticisms of multivalued logic will show that underlying them is the repeated refrain that MVL is too complex, and hence too prone to generate errors in the real world. I therefore find it interesting that Date seems to be now hoist on his own petard of complexity. Perhaps he will explain to us why the complexity of MVL is bad, leading to errors in the real world, while the complexity his special values scheme requires the user to master is good--supposedly not leading to such errors.
Second, all this additional complexity in queries required by Date's scheme exists for only one reason: because the semantics of Date's scheme are either incorrect (for equals) or incapable of returning information that does exist in the database (for greater than and less than).
Finally, I do not doubt that Date will have many words to offer in reply. But I would suggest to the reader that he who publishes the last word does not necessarily have the better position. Let us remember that the Copernican revolution did not succeed because it proved to the Ptolemaic theory's most ardent advocates that their theory was wrong. It succeeded because the complexities of epicycles within epicycles, which new observations forced the Ptolemaic theory into, convinced everyone else but those ardent advocates that the theory was fatally flawed. I am content, therefore, to stand back and watch Date propose additional epicycles to these already quite elaborate query-related epicycles he requires for his new special values scheme.
A Practical Special Values Scheme
In my introduction, I expressed a desire to see a two-valued logic default values scheme whose clarity and ease of use was judged a good trade-off for reduced expressive power. Since Date has not given us one, let me briefly outline such a scheme myself.
First of all, I agree with Date that we should eliminate nullable attributes in our databases, as much as possible. With some reservations (which I will not explain here), I also agree that a good way to do this is to subtype entities which contain a nullable attribute into two subtype entities--one which contains the attribute as nonnullable, and the other which does not contain the attribute at all.
Second, I suggest that we reserve one or more values from an attribute's usual domain for the special values we wish to identify. A single value, meaning "real value missing" may be all we need. Perhaps a distinction between inapplicable, unknown, or "one or the other, not known which" is more appropriate. And in response to Date's frequently expressed argument that there are any number of different kinds of nulls and, therefore, any number of truth values which an MVL would have to accommodate, I would reply that it seems to me somewhat naive to argue that every different way of stating a value-missing condition, in English, must correspond to a formally distinct kind of null. For the position that all the different ways in which a value-missing condition can be stated in English (or any natural language) can be reduced to the three mentioned here, see Atzeni and DeAntonellis, Relational Database Theory (Benjamin/Cummings, 1993).
If possible, we should establish business rules that ensure that the value we choose is not a homonym. For example, if an enterprise enforces a rule that invoice amounts cannot be zero, then we can use zeroes in the invoice amount attribute to mean "real value missing."
However, sometimes this is not possible. Perhaps our enterprise must allow zero-amount invoices. In that case, "$000,000.00" in the attribute means both a) a zero-amount invoice, and b) "real value missing."
If we must make a homonym out of a value like this, we should be aware of the cost. Here, the cost is being unable to distinguish zero-amount invoices from invoices in which the actual amount is missing. This may well be too high a cost. In that case, we have other options, such as adding a flag to distinguish zero-amount invoices from all others or shifting to a less critical and very infrequently used real value--say "$999,999.98"--to mean "real value missing."
In this case, we have compensated for the homonymity, for we can now distinguish zero-amount invoices from invoices in which the invoice-amount is missing. What we now can't do is distinguish the (presumably less common) invoices for exactly $999,999.98 from invoices in which the amount is missing. Although we have not eliminated the cost of the homonymity; we have significantly reduced it.
We have been using this kind of reduced-cost strategy for decades, actually, and in doing so we have decided to let the semantic problem be resolved outside the system by users who understand the convention and the possibility of actually encountering instances of these homonyms.
Finally, note that the strategy of avoiding homonyms, using a value to mean "real value missing" that does not mean anything else, gives us exactly the expressive power (and exactly the complexities) of Date's special values scheme. It does so because it amounts to finding a representation of Date's UNK within the standard domain for an attribute, instead of requiring vendors to extend that domain.
With Date's new special values scheme, we have an alternative to MVL that can't get equality right, can't handle greater than or less than at all, consequently returns both incorrect results and less than fully informative results to queries, and turns simple queries into rocket-scientist queries.
Moreover, now that we understand how far from being implementable Date's scheme is, we can see that this extended debate between Date and me has little immediate applicability to the practical work of developing databases and interfaces to them. As designers and users, we will continue to work within the constraints provided by our dialects of SQL.
The more conservative among us will avoid nulls and MVL because of understandable caution at the complexity. As I argued throughout my articles on this topic, the use of the additional inferencing power of MVL is a foray into a more complex dialogue with the database--a foray which should not be undertaken unless one is comfortable handling the complexity.
To assist the conservatives, therefore, I have sketched a two-valued logic approach to missing information that is practical and can be used right now with today's products. However, I claim no originality here. Indeed, I offer this sketch as little more than an existing best practice, dressed up in some fancy logical terminology, and accompanied by some explicit caveats.
As for the more adventurous readers, they will make use of nulls, as provided by our SQL dialects and will probably get burned sometimes. But in the process, they will forge an understanding of nulls and MVL, which they might not obtain from any textbook study. They will begin the process of rendering the complexity concomitant with the greater expressive power of nulls and MVL less daunting to us all.
Tom Johnston is an independent consultant located in Atlanta. With a doctorate in philosophy and a strong background in relational theory and database technology, he provides logical and physical data modeling services to major corporate clients. You can reach him via email at tjohnston@acm.org.