According to Date

C.J. Date


A discussion of some redundancies in SQL

Grievous Bodily Harm
(Part 1 of 2)

Redundant: de trop, diffuse, excessive, extra, inessential, inordinate, padded, periphrastic, pleonastical, prolix, repetitious, supererogatory, superfluous, supernumerary, surplus, tautological, unemployed, unnecessary, unneeded, unwanted, verbose, wordy--from Chambers Twentieth Century Thesaurus (which also gives the following nice list of antonyms: concise, essential, necessary).
      Did you know that the GROUP BY and HAVING clauses (hereafter generically abbreviated to GBH) are redundant in SQL? That is, any sensible query that can be expressed in SQL using either or both of these clauses can also be expressed without them. (I'll explain later what I mean by the qualifier "sensible" here!) This month and next, I want to illustrate this point and discuss some of its implications.
 

THE GROUP BY CLAUSE

As usual, I'll use the familiar suppliers-and-parts database as the basis for my examples:

S { S#, SNAME, STATUS, CITY }
PRIMARY KEY { S# }
P { P#, PNAME, COLOR, WEIGHT, CITY }
PRIMARY KEY { P# }
SP { S#, P#, QTY }
PRIMARY KEY { S#, P# }
FOREIGN KEY { S# } REFERENCES S
FOREIGN KEY { P# } REFERENCES P


      Here is a query against this database for which most people would "naturally" use a GROUP BY clause:
      Q1: For each part supplied, get the part number, maximum quantity, and minimum quantity supplied of that part.
      The "natural" (GROUP BY) formulation of this query might look like this:

SELECT SP.P#, 
MAX ( SP.QTY ) AS MXQ, 
MIN ( SP.QTY ) AS MNQ
FROM SP
GROUP BY SP.P# ;

 
FIGURE 1. Suppliers and parts:sample data values
S
S#
SNAME

STATUS

CITY

S1SMITH20LONDON
S2JONES10PARIS
S3BLAKE30PARIS
S4CLARK20LONDON
S5ADAMS30ATHENS
SP
S#
P#
QTY
S1P1300
S1P2200
S1P3400
S1P4200
S1P5100
S1P6100
S2P1300
S2P2400
S3P2200
S4P2200
S4P4300
S4P5400
P
P#
PNAME
COLOR
WEIGHT
CITY
P1NutRed12London
P2BoltGreen17Paris
P3ScrewBlue17Rome
P4ScrewRed14London
P5CamBlue12Paris
P6CogRed19Rome


      The result of this query, given the sample data of Figure 1, is shown in Figure 2.
      Here by contrast is another formulation of the same query that makes no use of GROUP BY:

SELECT DISTINCT SP.P#, 
( SELECT MAX ( SPX.QTY ) 
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) AS MXQ,
( SELECT MIN ( SPX.QTY ) 
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) AS MNQ
FROM SP ;
FIGURE 2. Result of query Q1
P#
MXQ
MNQ
P1300300
P2400200
P3400400
P4300200
P5400100
P6100100


      Of course, it's true that this formulation is a little longer than the previous one, but logically the two are equivalent. And it's easy to generalize from this example, as follows: Given R { A, B, ... }, let agg be an aggregate function (such as SUM, MAX, or MIN) that's applicable to R.B. Then the expression:

SELECT R.A, 
agg ( R.B ) AS C
FROM R
GROUP BY R.A ;

can be logically transformed into the equivalent expression:

SELECT DISTINCT R.A, 
( SELECT agg ( RX.B ) 
FROM R AS RX
WHERE RX.A = R.A ) AS C
FROM R ;

In what follows I'll refer to such a transformation as a Type 1 transformation for brevity.
      Next we need to consider what happens if the original (GROUP BY) formulation includes a WHERE clause. Let's extend Q1 as follows:
      Q2: For each part supplied, get the part number, maximum quantity, and minimum quantity supplied of that part; however, ignore shipments from supplier S1 throughout.
      Here first is a GROUP BY formulation:

SELECT SP.P#, 
MAX ( SP.QTY ) AS MXQ, 
MIN ( SP.QTY ) AS MNQ
FROM SP
WHERE SP.S# <> 'S1'
GROUP BY SP.P# ;


      A formulation without GROUP BY--not the only one possible--is just a little bit more tricky:

SELECT DISTINCT SP.P#, 
( SELECT MAX ( SPX.QTY ) 
FROM SP AS SPX
WHERE SPX.P# = SP.P# 
AND SPX.S# <> 'S1' ) AS MXQ,
( SELECT MIN ( SPX.QTY ) AS MNQ
FROM SP AS SPX
WHERE SPX.P# = SP.P# 
AND SPX.S# <> 'S1' ) AS MNQ
FROM SP 
WHERE SP.S# <> 'S1' ;


      As you can see, the WHERE clause from the original GROUP BY formulation has to be replicated inside the two nested expressions in the SELECT clause (loosely speaking). The reason is that the WHERE clause in the original formulation governs both the SELECT clause and the GROUP BY clause in that formulation. And the reason for this state of affairs is that SQL requires the clauses to be written in a slightly illogical sequence! In general, the clauses of a given SELECT - FROM - WHERE - GROUP BY expression are evaluated in the sequence FROM - WHERE - GROUP BY - SELECT, and it might thus have made more sense if SQL had required them to be written in that sequence. But it didn't.
      Be that as it may, the fact is that (as you can see) the Type 1 transformation rule needs to be extended slightly to take care of WHERE clauses. I'll skip the details here, because they're essentially straightforward.
      Now let's modify the example again. Suppose the original query had been:
      Q3: For each part supplied, get the maximum quantity supplied of that part, but not the part number.

GROUP BY formulation a):
SELECT MAX ( SP.QTY ) AS MXQ 
FROM SP
GROUP BY SP.P# ;
Applying the Type 1 transformation rule, we obtain b):
SELECT DISTINCT ( SELECT MAX ( SPX.QTY )
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) AS MXQ 
FROM SP ;


      The result of evaluating these two SQL expressions is shown in Figure 3.
 

FIGURE 3. Result of query Q3 (a) using GROUP BY;(b) avoiding GROUP BY
(a)
MXQ

300
400
400
300
400
100
(b)
MXQ
300
400
100

 

      As you can see, the two expressions don't quite yield the same result, so they aren't quite equivalent; in other words, the Type 1 transformation is invalid in this particular case. But the real reason for this lack of equivalence is that the GROUP BY formulation produces a result that's not a relation! It's not a relation because it contains duplicate rows. What's more, those duplicates are essential, in the sense that, for example, the two "300" rows have different meanings:1 One means some part is supplied in a maximum quantity of 300; the other means some other part is also supplied in that same maximum quantity. And "essential duplicates" constitute a very major departure from the precepts of the relational model! In fact, they constitute a violation of the Information Principle, which states that all information must be cast explicitly in terms of values in relations and in no other way.2 Indeed, the fact that SQL includes the notion of "essential duplicates" constitutes very strong evidence that SQL is not now, and never was, truly relational.
      Note: I should point out in passing that "essential duplicates" arise in SQL in many other contexts in addition to the one under discussion. For example, even such a simple expression as:
SELECT CITY
FROM S ;

produces a result involving "essential duplicates" (in general).
      Let's think a little more carefully about Q3. What does that query really mean? I suggest that what it's really asking for is simply the set of maximum shipment quantities in SP. The formulation without GROUP BY correctly produces this information. It's true that it doesn't show which particular parts have which maximum quantities, but that information wasn't requested.
      The GROUP BY formulation also produces the same information, of course (that is, the maximum shipment quantities, again without showing which parts correspond to which maximum quantities). To that extent, we might agree that the formulation is acceptable. However, some people--not me!--might go on to argue that it also produces the information that there are six different parts, and hence that it's preferable to the formulation without GROUP BY. Note very carefully, however, that it provides that additional information in a nonrelational form. (It does it by means of "essential duplicates," as already noted.) In my opinion, if we actually wanted to obtain that additional information, then we should have asked a different query, one that would have produced that information in proper relational form (Q1 would have sufficed). I'll come back to this point in a moment.
      Note: Of course, we might try to preserve the equivalence of the two formulations by dropping the DISTINCT from the second one (the one without the GROUP BY). But this ploy, fairly obviously, doesn't solve anything. First, the result will now contain 12 rows, not three and not six (so the revised formulation isn't logically equivalent to either of the other two). Second, those 12 rows will again include "essential duplicates," and the formulation (like the original GROUP BY formulation) will thus again not be truly relational.
      Finally, I feel bound to point out that (in my opinion) queries such as Q3, though legal, aren't very sensible. I say this because--almost by definition--such queries effectively ask for information to be thrown away that we probably don't want to be thrown away. (In the example, what's being thrown away is the information as to which parts have which maximum quantities.) In conventional SQL terms, such queries typically lead to GROUP BY formulations in which some column is mentioned in the GROUP BY clause and not in the corresponding SELECT clause. And such formulations are precisely the ones for which our Type 1 transformation rule doesn't quite "work right." The rule does work right for "sensible" queries.
 

THE HAVING CLAUSE

Now let's take a look at the HAVING clause. Here's a query for which most people would "naturally" use a HAVING clause:
      Q4: For each part supplied by more than one supplier, get the part number.
      A GBH formulation of this query might look like this (recall that GBH stands for GROUP BY with HAVING):

SELECT SP.P# 
FROM SP
GROUP BY SP.P# 
HAVING COUNT(*) > 1 ;


      The result of this query, given the sample data of Figure 1, is shown in Figure 4.
 

FIGURE 4. Result of Query Q4
P#
P1
P2
P4
P5

 

      Here by contrast is a non-GBH formulation of the same query (that is, one that makes no use of GROUP BY or HAVING):
SELECT DISTINCT SP.P# 
FROM SP
WHERE ( SELECT COUNT(*)
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) > 1 ;


      Once again it's true that this formulation is longer than the GBH version, but logically the two are equivalent. And once again it's easy to generalize from this example. Using the same notation as in the previous section, the expression:

SELECT R.A 
FROM R
GROUP BY R.A 
HAVING agg ( R.B ) comp scalar ;

(where comp is some scalar comparison operator and scalar is some scalar-valued expression) can be logically transformed into the equivalent expression:

SELECT DISTINCT R.A 
FROM R
WHERE ( SELECT agg ( R.B )
FROM R AS RX
WHERE RX.A = R.A ) comp scalar ;


      In what follows I'll refer to such a transformation as a Type 2 transformation.
      Again we need to consider what happens if the original (HAVING) formulation includes a WHERE clause. Let's extend Q4 as follows:
      Q5: For each part supplied by more than one supplier (not counting supplier S1), get the part number.
      Here's a GBH formulation:

SELECT SP.P# 
FROM SP
WHERE SP.S# <> 'S1'
GROUP BY SP.P# 
HAVING COUNT(*) > 1 ;


      Again the non-GBH equivalent is a just a little tricky:

SELECT DISTINCT SP.P# 
FROM SP
WHERE SP.S# <> 'S1' 
AND ( SELECT COUNT(*)
FROM SP AS SPX
WHERE SPX.P# = SP.P# 
AND SPX.S# <> 'S1' ) > 1 ;


      Observe that the WHERE clause from the original GBH formulation has to be replicated inside the nested expression in the WHERE clause in the non-GBH formulation (again, loosely speaking). The reason is that the WHERE clause in the GBH formulation governs both the GROUP BY clause and the HAVING clause in that formulation. Thus the Type 2 transformation rule (like the Type 1 transformation rule) needs to be extended to take care of WHERE clauses. Again I'll skip the details here.
      Here now is a slightly more complicated example:
      Q6: For each part supplied by more than one supplier, get the part number and the total quantity supplied of that part.
      Here is the GBH formulation:

SELECT SP.P#, 
SUM ( SP.QTY ) AS TQY 
FROM SP
GROUP BY SP.P# 
HAVING COUNT(*) > 1 ;


      Applying both the Type 1 and Type 2 transformation rules, we obtain:

SELECT DISTINCT SP.P#, 
( SELECT SUM ( SPX.QTY ) 
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) AS TQY
FROM SP 
WHERE ( SELECT COUNT(*)
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) > 1 ;


      And one last example:
      Q7: For each part supplied by more than one supplier, get the total quantity supplied of that part, but not the part number.
      GBH formulation a):

SELECT SUM ( SP.QTY ) AS TQY 
FROM SP
GROUP BY SP.P# 
HAVING COUNT(*) > 1 ;


      Transformed version b):

SELECT DISTINCT ( SELECT SUM ( SPX.QTY ) 
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) AS TQY
FROM SP 
WHERE ( SELECT COUNT(*)
FROM SP AS SPX
WHERE SPX.P# = SP.P# ) > 1 ;


      The result of evaluating each of these expressions is shown in Figure 5.
 

FIGURE 5. Result of query Q7 (a) using GROUP BY and HAVING;(b) not using GROUP BY or HAVING.
(a)
TQY

600
1000
500
500
(b)
TQY
600
1000
500

 

      Once again the two expressions don't quite yield the same result, and therefore aren't quite equivalent. But--as I'm sure you realize--the situation here is precisely analogous to the one already discussed under Q3. In other words, Q7 suffers from the same kinds of problems as Q3 does; our discussions of Q3 in the previous section are therefore applicable here too, and no further commentary seems necessary.

REFERENCES

  1. Date, C. J. "Essentiality," Relational Database Writings 1991-1994. Addison-Wesley, 1995. This paper includes a definition and an extended discussion of the concept of essentiality. (It doesn't limit itself to just the "essential duplicates" discussed in the body of this month's column.)
  2. Codd, E.F. The Relational Model for Database Management Version 2. Addison-Wesley, 1990.

C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent book, with Hugh Darwen, is A Guide to the SQL Standard (4th edition, Addison-Wesley, 1997). Correspondence may be sent to him in care of Database Programming & Design, 411 Borel Ave., Ste. 100, San Mateo, CA 94402.
 

 

 
search - home - archives - contacts - site index
 

Copyright 1997 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!