| |
According to Date C.J. Date
Continuing our discussion of redundancy in SQL
Grievous Bodily Harm (Part 2 of 2)
In my last column, I showed that the SQL GROUP BY and HAVING clauses (generically abbreviated GBH) are effectively redundant--meaning that any sensible query that can be expressed in SQL using either or both of these clauses can also be expressed without them. This month I want to say a little more about the HAVING clause. I also want to discuss the redundancy of range variables, and finally, I want to point out some of the implications of such redundancies.
As I did last time, I'll base my examples on the familiar "suppliers and parts" database:
S { S#, SNAME, STATUS, CITY }
PRIMARY KEY { S# }
P { P#, PNAME, COLOR, WEIGHT, CITY }
PRIMARY KEY { P# }
SP { S#, P#, QTY }
PRIMARY KEY { S#, P# }
FOREIGN KEY { S# } REFERENCES S
FOREIGN KEY { P# } REFERENCES P
HAVING Without GROUP BY (I)
It is a little-known (and little-understood) fact that it's legal for a SELECT expression in SQL to include a HAVING clause without a corresponding GROUP BY clause. For example, consider the query:
Q8: Get the total shipment quantity for all parts, if and only if the minimum quantity in which any part is supplied is greater than 50.
(I'm continuing the query numbering from last month's installment.) A possible SQL formulation of this query is:
SELECT SUM ( SP.QTY ) AS TQY
FROM SP
HAVING MIN ( SP.QTY ) > 50 ;
Given the sample data values shown last month, the result of this query is a relation of one column, called TQY, and one row, containing the value 3,100.
Explanation: Because there's no GROUP BY clause, the HAVING clause is regarded as applying to a "grouped" version of SP that contains exactly one group. If the HAVING clause evaluates to true for that group--which, given the sample data values from our last discussion, it does--then the SELECT clause returns the requested maximum value from that group. By contrast, if the HAVING clause had evaluated to false for that group, that group would then have been discarded; there would therefore have been no group to evaluate the SELECT clause against, and the final result would thus have been empty. (More precisely, it would have been a relation of one column, called TQY, containing no rows.)
What about a formulation that avoids the use of HAVING? The obvious one to try (using a simplified "Type 2 transformation") looks like this:
SELECT DISTINCT SUM ( SP.QTY ) AS TQY
FROM SP
WHERE ( SELECT MIN ( SP.QTY ) FROM SP ) > 50 ;
Unfortunately, this latter formulation is not logically equivalent to the previous one. To see why, suppose the minimum quantity had to be greater than 500 instead of 50. As I've already explained, the formulation with the HAVING clause will now produce a result containing one column and no rows (because, given the sample data values from last month, that HAVING clause will now evaluate to false). In contrast, the formulation without a HAVING clause will produce a result containing one column and one row (containing a null) because the WHERE clause evaluates to false for every row of SP, and the SELECT clause is therefore evaluated against an empty relation (instead of against no groups).
Note: The null in that result row is, of course, due to a logical mistake in SQL--namely, SQL's incorrect specification that the SUM of no numbers at all is null (instead of what it should be, zero).
It turns out that a formulation of the query that avoids the use of HAVING but is equivalent to the HAVING formulation does exist. However, it's a little subtle:
SELECT DISTINCT ( SELECT SUM ( SP.QTY ) FROM SP )
AS TQY
FROM SP
WHERE ( SELECT MIN ( SP.QTY ) FROM SP ) > 50 ;
In this formulation, if the (outer) FROM and WHERE clauses together produce an empty relation, then--believe it or not!--so does the overall expression. The reason is that the sole item mentioned in the SELECT clause is not (as it might appear) a reference to the aggregate function SUM, but rather a scalar subquery that happens to contain such a reference. The cardinality of the final result (that is, the number of rows in that result) is thus no different from what it would be if we were to replace that scalar subquery by any other scalar expression, such as SP.P#, or SP.QTY, or even just a literal. To spell it out in detail, what happens is this:
- Assume that the WHERE clause does indeed evaluate to false for every row in SP.
- Then the FROM and WHERE clauses together produce an empty relation (a relation with no rows).
- The subquery in the SELECT clause, of course, yields the value 3,100, as already noted under the discussion of query Q8. (More precisely, it yields a one-column, one-row relation containing that numeric value, but SQL then effectively extracts that numeric value from that relation. This nicety need not concern us too much here, though it does cause problems elsewhere in SQL.)
- So the query formulation under discussion is logically equivalent to the following expression:
SELECT 3100 AS TQY
FROM empty ;
(where empty is an empty relation). And this expression obviously returns a relation of one column (TQY) and no rows.
Now we can formulate another transformation rule, as follows: Given R { A, B ę }, let agg1 and agg2 be aggregate functions that are applicable to R.A and R.B, respectively. Then the expression:
SELECT agg1 ( R.A ) AS C
FROM R
HAVING agg2 ( R.B ) comp scalar ;
(where comp is some scalar comparison operator and scalar is some scalar-valued expression) can be logically transformed into the equivalent expression:
SELECT DISTINCT ( SELECT agg1 ( R.A ) FROM R ) AS C
FROM R
WHERE ( SELECT agg2 ( R.B ) FROM R ) comp scalar ;
In what follows, I'll refer to such a transformation as a Type 3 transformation. Note: Consideration of what happens if the HAVING formulation includes a WHERE clause is left as an exercise for the reader.
HAVING WITHOUT GROUP BY (II)
There's another point to be made in connection with HAVING without GROUP BY. The sad fact is that SQL suffers from a logical error (another one!) in this area. Consider the following expression:
SELECT SUM ( SP.QTY ) AS TQY
FROM SP
HAVING 0 = 0 ;
Suppose SP currently contains no rows at all. Logically then, the "grouped" version of SP to which the SELECT and HAVING clauses are evaluated contains no groups--certainly no groups can be derived from no rows!--and so the correct result is a relation containing one column and no rows. SQL, however, gives a result containing one column and one row (containing a null). The reason is that SQL (quite incorrectly) still regards the "grouped" version of SP as containing exactly one group (albeit an empty one); the HAVING clause evaluates (trivially) to true for that group, so the SELECT clause is evaluated against that group (instead of against no groups).
Exercise for the reader: Here's an "equivalent" formulation of the foregoing query that avoids the use of HAVING, obtained by performing a Type 3 transformation. Do the foregoing criticisms apply to this formulation? If they don't, then the second formulation is preferable, of course.
SELECT DISTINCT ( SELECT SUM ( SP.QTY ) FROM SP )
AS TQY
FROM SP
WHERE 0 = 0 ;
To close the discussions of these two sections, perhaps I should say that it seems to me that--like queries Q3 and Q7 last month--queries that can be formulated with HAVING but without GROUP BY tend not to be very "sensible" ones anyway.
RANGE VARIABLES
Now I want to turn my attention to a different topic. Consider the following query:
Q9: Get all pairs of supplier numbers such that the two suppliers concerned are located in the same city.
A possible SQL formulation for this query is as follows:
SELECT FIRST.S# AS SX,
SECOND.S# AS SY
FROM S AS FIRST,
S AS SECOND
WHERE FIRST.CITY = SECOND.CITY ;
FIRST and SECOND here are examples of what SQL calls correlation names. Note, however, that SQL has never said what it is that such names name! In relational calculus, by contrast, such names are defined specifically to refer to range variables, so I'll use this latter term in what follows.
A range variable is a variable that ranges over some specified relation; in other words, it's a variable whose permitted values are rows of the relation in question. Thus, if range variable r ranges over relation R, then at any given time the expression "r" stands for some row of R. In SQL, range variables are introduced as shown in the example above--namely, by means of an AS specification within the FROM clause. Note: Don't confuse AS specifications in the FROM clause and AS specifications in the SELECT clause! An AS specification in the SELECT clause introduces a name for a column, as many of our earlier examples (both this month and last) have already illustrated. By contrast, an AS specification in the FROM clause introduces a name for a range variable, as I've just explained.
The point I want to make now is that, just like the GROUP BY and HAVING clauses, range variables are also logically redundant in SQL--notwithstanding the fact that I've been making fairly heavy use of them in previous examples (both this month and last). To illustrate the point, here first is a formulation of query Q9 that avoids range variables:
SELECT SX, SY
FROM ( SELECT S.S# AS SX, S.CITY
FROM S ) AS POINTLESS1
NATURAL JOIN
( SELECT S.S# AS SY, S.CITY
FROM S ) AS POINTLESS2
Note: It's true that this formulation in fact does involve two range variables, POINTLESS1 and POINTLESS2, but logically it ought not to. Certainly those range variables are never referenced. They're there only because SQL has a syntax rule saying they must be.
For another example, I return to query Q1 from last month:
Q1: For each part supplied, get the part number, maximum quantity, and minimum quantity supplied of that part.
Here's a formulation that avoids range variables (as well as the GROUP BY and HAVING clauses):
SELECT DISTINCT PZ AS P#,
( SELECT MAX ( SP.QTY )
FROM SP
WHERE SP.P# = PZ ) AS MXQ,
( SELECT MIN ( SP.QTY )
FROM SP
WHERE SP.P# = PZ ) AS MNQ
FROM ( SELECT SP.P# AS PZ
FROM SP ) AS POINTLESS ;
(As with the previous example, this formulation does in fact involve a range variable, POINTLESS, but logically it ought not to. At least that range variable is never referenced.)
To close this section, let me point out that range variables must be logically unnecessary in SQL, because:
- The relational algebra has no range variables.1
- Every relational algebra expression has an exact counterpart in SQL that also involves no range variables (except for ones that are never referenced and are there just because of a weird SQL syntax rule).
- SQL includes no relational functionality that the relational algebra doesn't.
DISCUSSION
I've tried to show that GBH queries and range variables are essentially redundant in SQL.
It's interesting to note in passing that these aspects of SQL are among the hardest to teach,
understand, learn, and apply properly--as I know from direct experience. Be that as it may,
I'd like to focus now on GBH queries in particular; more specifically, Ięd like to suggest
that the alternative formulations are often actually preferable to the GBH versions. Letęs
consider query Q4 again from last month:
Q4: For each part supplied by more than one supplier, get the part number.
Here are the GBH and non-GBH formulations I gave last month:
SELECT SP.P#
FROM SP
GROUP BY SP.P#
HAVING COUNT(*) > 1 ;
SELECT DISTINCT SP.P#
FROM SP
WHERE ( SELECT COUNT(*)
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) > 1 ;
It seems to me that:
- The non-GBH version is at least logically cleaner and simpler than the GBH version in that it does not require the additional language constructs under discussion (the GROUP BY and HAVING clauses).
- Itęs certainly not clear from the original statement of the problem that grouping per se is whatęs needed to express the query in SQL (and indeed it isnęt).
- Itęs far from obvious that a HAVING clause is needed rather than a WHERE clause (and indeed it isnęt).
The GBH version begins to look more like a prescription for solving the problemęthat is, a series of steps that must be gone through in order to find the answer to the problemęinstead of just a description of what the problem is. In other words, it looks procedural rather than declarative. And of course, a general objective for the relational model has always been to prefer declarative formulations over procedural ones, for a variety of well-known reasons (the most important being that declarative means the system does the work, whereas procedural means the user does the work).
Of course, thereęs no denying that the GBH version is more succinct than the non-GBH version. But if the sole advantage of GBH formulations is succinctness, it would have been better if such formulations had been defined as shorthands, which (to my knowledge) they never were. That way, the anomalies identified in our discussions of the transformation rules (Types 1, 2, and 3) this month and last would (probably) never have occurred, and implementation would have been easier too (again probably). Note: I should mention that the relational analog of SQLęs GROUP BY and HAVING, namely the SUMMARIZE operator, is defined as a shorthand.
In closing, I must point out that SQL includes many other redundancies beyond the ones Ięve been discussing (for example, the SQL EXISTS operator is 100 percent redundant). As a direct result, all but the most trivial of queries can be expressed in SQL in a huge number of different ways. Even a query as simple as ęGet names of suppliers who supply part P2ę can easily be expressed in at least eight different ways, all of them at least superficially distinct.2 (And those eight different ways are ways that use only features available in the original SQL standard SQL-86! The number increases dramatically when new features introduced with SQL-92 are taken into account.)
Why is this state of affairs undesirable? First, such redundancy makes the language bigger than it needs to be, with obvious implications for documentation, implementation, teaching, learning, and so on. In particular, the fact that a given query can be formulated in so many different ways has the serious negative consequence that users will often have to spend time and effort trying to find the ębestę formulation (by which I mean the formulation that performs best), which is exactly one of the things the relational model was trying to avoid in the first place.
Of course, this latter criticism wonęt be valid if all formulations perform equally well, but this possibility seems unlikely. (Itęs doubtful whether the optimizer will be that good.) Indeed, the point is worth stressing that all these redundancies make SQL much harder to implement (especially to optimize), as well as to teach, learn, remember, and apply. And this state of affairs is really rather strange, given that the people responsible for the design of SQL are, first and foremost, representatives of the DBMS vendors, who are precisely the ones who have to do the implementing!
Let me note finally that these redundancy problems are not easy to fix. What Hugh Darwen calls the shackle of compatibility means that, once a feature has been incorporated into a language, it can never be taken out againębecause, of course, existing programs will fail if it is.3 Thatęs why itęs so important to get languages right first time! Itęs also one reason why language design is hard.
References
1. Date, C.J. An Introduction to Database Systems (6th edition). Addison-Wesley 1995.
2. Date, C.J. ęWhatęs Wrong with SQL?ę Relational Database Writings 1985-1989. Addison-Wesley 1990.
3. Darwen, H. ęThe Askew Wall: A Personal Perspective.ę Relational Database Writings 1989-1991. Addison-Wesley, 1992.
C. J. Date is an independent author, lecturer, researcher, and consultant specializing in relational database systems. His most recent books, all published by Addison-Wesley, are An Introduction to Database Systems (6th edition, 1995); Relational Database Writings 1991-1994 (1995); and (with Hugh Darwen) A Guide to the SQL Standard (4th edition, 1997). Correspondence may be sent to him in care of Database Programming & Design, 411 Borel Ave., Ste. 100, San Mateo, CA 94402.
| |