
According to DateC.J. DateA discussion of some redundancies in SQLGrievous Bodily Harm
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FIGURE 1. Suppliers and parts:sample data values | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S
|
SP
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
P
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The result of this query, given the sample data of Figure 1, is shown in Figure 2.
Here by contrast is another formulation of the same query that makes no use of GROUP BY:
SELECT DISTINCT SP.P#, ( SELECT MAX ( SPX.QTY ) FROM SP AS SPX WHERE SPX.P# = SP.P# ) AS MXQ, ( SELECT MIN ( SPX.QTY ) FROM SP AS SPX WHERE SPX.P# = SP.P# ) AS MNQ FROM SP ;
| FIGURE 2. Result of query Q1 | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Of course, it's true that this formulation is a little longer than the previous one, but logically the two are equivalent. And it's easy to generalize from this example, as follows: Given R { A, B, ... }, let agg be an aggregate function (such as SUM, MAX, or MIN) that's applicable to R.B. Then the expression:
SELECT R.A, agg ( R.B ) AS C FROM R GROUP BY R.A ;
can be logically transformed into the equivalent expression:
SELECT DISTINCT R.A, ( SELECT agg ( RX.B ) FROM R AS RX WHERE RX.A = R.A ) AS C FROM R ;
In what follows I'll refer to such a transformation as a Type 1 transformation for brevity.
Next we need to consider what happens if the original (GROUP BY) formulation includes a WHERE clause. Let's extend Q1 as follows:
Q2: For each part supplied, get the part number, maximum quantity, and minimum quantity supplied of that part; however, ignore shipments from supplier S1 throughout.
Here first is a GROUP BY formulation:
SELECT SP.P#, MAX ( SP.QTY ) AS MXQ, MIN ( SP.QTY ) AS MNQ FROM SP WHERE SP.S# <> 'S1' GROUP BY SP.P# ;
A formulation without GROUP BY--not the only one possible--is just a little bit more tricky:
SELECT DISTINCT SP.P#, ( SELECT MAX ( SPX.QTY ) FROM SP AS SPX WHERE SPX.P# = SP.P# AND SPX.S# <> 'S1' ) AS MXQ, ( SELECT MIN ( SPX.QTY ) AS MNQ FROM SP AS SPX WHERE SPX.P# = SP.P# AND SPX.S# <> 'S1' ) AS MNQ FROM SP WHERE SP.S# <> 'S1' ;
As you can see, the WHERE clause from the original GROUP BY formulation has to be replicated inside the two nested expressions in the SELECT clause (loosely speaking). The reason is that the WHERE clause in the original formulation governs both the SELECT clause and the GROUP BY clause in that formulation. And the reason for this state of affairs is that SQL requires the clauses to be written in a slightly illogical sequence! In general, the clauses of a given SELECT - FROM - WHERE - GROUP BY expression are evaluated in the sequence FROM - WHERE - GROUP BY - SELECT, and it might thus have made more sense if SQL had required them to be written in that sequence. But it didn't.
Be that as it may, the fact is that (as you can see) the Type 1 transformation rule needs to be extended slightly to take care of WHERE clauses. I'll skip the details here, because they're essentially straightforward.
Now let's modify the example again. Suppose the original query had been:
Q3: For each part supplied, get the maximum quantity supplied of that part, but not the part number.
GROUP BY formulation a): SELECT MAX ( SP.QTY ) AS MXQ FROM SP GROUP BY SP.P# ; Applying the Type 1 transformation rule, we obtain b): SELECT DISTINCT ( SELECT MAX ( SPX.QTY ) FROM SP AS SPX WHERE SPX.P# = SP.P# ) AS MXQ FROM SP ;
The result of evaluating these two SQL expressions is shown in Figure 3.
| FIGURE 3. Result of query Q3 (a) using GROUP BY;(b) avoiding GROUP BY | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(a)
|
(b)
| ||||||||||||
SELECT CITY FROM S ;
produces a result involving "essential duplicates" (in general).
Let's think a little more carefully about Q3. What does that query really mean? I suggest that what it's really asking for is simply the set of maximum shipment quantities in SP. The formulation without GROUP BY correctly produces this information. It's true that it doesn't show which particular parts have which maximum quantities, but that information wasn't requested.
The GROUP BY formulation also produces the same information, of course (that is, the maximum shipment quantities, again without showing which parts correspond to which maximum quantities). To that extent, we might agree that the formulation is acceptable. However, some people--not me!--might go on to argue that it also produces the information that there are six different parts, and hence that it's preferable to the formulation without GROUP BY. Note very carefully, however, that it provides that additional information in a nonrelational form. (It does it by means of "essential duplicates," as already noted.) In my opinion, if we actually wanted to obtain that additional information, then we should have asked a different query, one that would have produced that information in proper relational form (Q1 would have sufficed). I'll come back to this point in a moment.
Note: Of course, we might try to preserve the equivalence of the two formulations by dropping the DISTINCT from the second one (the one without the GROUP BY). But this ploy, fairly obviously, doesn't solve anything. First, the result will now contain 12 rows, not three and not six (so the revised formulation isn't logically equivalent to either of the other two). Second, those 12 rows will again include "essential duplicates," and the formulation (like the original GROUP BY formulation) will thus again not be truly relational.
Finally, I feel bound to point out that (in my opinion) queries such as Q3, though legal, aren't very sensible. I say this because--almost by definition--such queries effectively ask for information to be thrown away that we probably don't want to be thrown away. (In the example, what's being thrown away is the information as to which parts have which maximum quantities.) In conventional SQL terms, such queries typically lead to GROUP BY formulations in which some column is mentioned in the GROUP BY clause and not in the corresponding SELECT clause. And such formulations are precisely the ones for which our Type 1 transformation rule doesn't quite "work right." The rule does work right for "sensible" queries.
Now let's take a look at the HAVING clause. Here's a query for which most people would "naturally" use a HAVING clause:
Q4: For each part supplied by more than one supplier, get the part number.
A GBH formulation of this query might look like this (recall that GBH stands for GROUP BY with HAVING):
SELECT SP.P# FROM SP GROUP BY SP.P# HAVING COUNT(*) > 1 ;
The result of this query, given the sample data of Figure 1, is shown in Figure 4.
| FIGURE 4. Result of Query Q4 | ||||||
|---|---|---|---|---|---|---|
| ||||||
SELECT DISTINCT SP.P# FROM SP WHERE ( SELECT COUNT(*) FROM SP AS SPX WHERE SPX.P# = SP.P# ) > 1 ;
Once again it's true that this formulation is longer than the GBH version, but logically the two are equivalent. And once again it's easy to generalize from this example. Using the same notation as in the previous section, the expression:
SELECT R.A FROM R GROUP BY R.A HAVING agg ( R.B ) comp scalar ;
(where comp is some scalar comparison operator and scalar is some scalar-valued expression) can be logically transformed into the equivalent expression:
SELECT DISTINCT R.A FROM R WHERE ( SELECT agg ( R.B ) FROM R AS RX WHERE RX.A = R.A ) comp scalar ;
In what follows I'll refer to such a transformation as a Type 2 transformation.
Again we need to consider what happens if the original (HAVING) formulation includes a WHERE clause. Let's extend Q4 as follows:
Q5: For each part supplied by more than one supplier (not counting supplier S1), get the part number.
Here's a GBH formulation:
SELECT SP.P# FROM SP WHERE SP.S# <> 'S1' GROUP BY SP.P# HAVING COUNT(*) > 1 ;
Again the non-GBH equivalent is a just a little tricky:
SELECT DISTINCT SP.P# FROM SP WHERE SP.S# <> 'S1' AND ( SELECT COUNT(*) FROM SP AS SPX WHERE SPX.P# = SP.P# AND SPX.S# <> 'S1' ) > 1 ;
Observe that the WHERE clause from the original GBH formulation has to be replicated inside the nested expression in the WHERE clause in the non-GBH formulation (again, loosely speaking). The reason is that the WHERE clause in the GBH formulation governs both the GROUP BY clause and the HAVING clause in that formulation. Thus the Type 2 transformation rule (like the Type 1 transformation rule) needs to be extended to take care of WHERE clauses. Again I'll skip the details here.
Here now is a slightly more complicated example:
Q6: For each part supplied by more than one supplier, get the part number and the total quantity supplied of that part.
Here is the GBH formulation:
SELECT SP.P#, SUM ( SP.QTY ) AS TQY FROM SP GROUP BY SP.P# HAVING COUNT(*) > 1 ;
Applying both the Type 1 and Type 2 transformation rules, we obtain:
SELECT DISTINCT SP.P#, ( SELECT SUM ( SPX.QTY ) FROM SP AS SPX WHERE SPX.P# = SP.P# ) AS TQY FROM SP WHERE ( SELECT COUNT(*) FROM SP AS SPX WHERE SPX.P# = SP.P# ) > 1 ;
And one last example:
Q7: For each part supplied by more than one supplier, get the total quantity supplied of that part, but not the part number.
GBH formulation a):
SELECT SUM ( SP.QTY ) AS TQY FROM SP GROUP BY SP.P# HAVING COUNT(*) > 1 ;
Transformed version b):
SELECT DISTINCT ( SELECT SUM ( SPX.QTY ) FROM SP AS SPX WHERE SPX.P# = SP.P# ) AS TQY FROM SP WHERE ( SELECT COUNT(*) FROM SP AS SPX WHERE SPX.P# = SP.P# ) > 1 ;
The result of evaluating each of these expressions is shown in Figure 5.
| FIGURE 5. Result of query Q7 (a) using GROUP BY and HAVING;(b) not using GROUP BY or HAVING. | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
(a)
|
(b)
| ||||||||||
REFERENCES
C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent book, with Hugh Darwen, is A Guide to the SQL Standard (4th edition, Addison-Wesley, 1997). Correspondence may be sent to him in care of Database Programming & Design, 411 Borel Ave., Ste. 100, San Mateo, CA 94402.