Tables containing time-varying information often become quite large. You can sometimes attribute the excess volume to the fragmentation of the entities' histories; each history might be partially represented in each of several rows. In such situations you may want to merge these fragmented histories into maximal periods, to reduce the number of rows necessary to represent the same information. Such an operation is termed "temporal coalescing."
Be aware that the SQL-92 COALESCE operator, a shorthand of CASE that replaces NULL values with other values, is unrelated to temporal coalescing. The term "coalescing" in this column refers solely to temporal coalescing.
Consider Table 1 of an INCUMBENTS table, first introduced in last month's column. All five rows have the same values for the employee's social security number (SSN) and position code number (PCN) columns. In the terminology I introduced in the June 1998 issue of DBPD , these five rows are all value-equivalent. The second and third rows also have the same values for the START_DATE and END_DATE columns; these rows classify as non-sequenced duplicates.
Now consider the third and fourth rows. As just observed, these rows are value-equivalent, yet are not non-sequenced duplicates, because their periods of validity are not identical. Neither are they sequenced duplicates, because their periods of validity do not overlap. You can merge these rows because their periods of validity meet; that is, the END_DATE of row 3 equals the START_DATE of row 4.
The operation of temporal coalescing combines value-equivalent rows into a single row with a maximal period of validity. Coalescing would merge these two rows into a single row valid from April 1, 1996, to December 31, 1997 (note that we use a closed-open representation, where the END_DATE is the day after the last day of the period).
Temporal coalescing and duplicate elimination are somewhat orthogonal operations, because in most cases eliminating duplicates (of whatever variant) does not result in a coalesced table, and coalescing can be done without removing duplicates. This column considers the combined procedure of coalescing with sequenced duplicate elimination. To illustrate, coalescing Table 1 while removing duplicates results in Table 2, merging five rows into one.
Consider a timeslice at January 3, 1996, of Table 1. The snapshot contains a single row, (111223333, 120033). The timeslice at that day on Table 2 also contains that single row. Now consider May 25, 1996. The timeslice at that date on Table 1 contains three rows (the reader should verify this), but the timeslice of Table 2 has a single row. The duplicates present on that day have been eliminated.
When you coalesce and remove sequenced duplicates, only one resulting table is possible. Each row will have a maximal period. Although Table 2 above happens to contain no duplicates of any kind, value-equivalent rows are possible as long as they don't overlap or meet.
I'll explain the intuition behind this first solution before presenting it. The idea is to identify overlapping or adjacent time periods associated with the same SSN and PCN values and then to merge those periods by extending the one with the earliest start date. You need to repeat this process until you've constructed a fully inclusive period for each set of eligible value-equivalent rows. Last, you must remove the rows that contain only part of the maximal period.
This approach first expresses coalescing using the LOOP construct of PSM.
PROCEDURE Do_Coalesce () LANGUAGE SQL BEGIN CREATE TABLE Temp(SSN CHAR(9), PCN INT, START_DATE DATE, END_DATE DATE); INSERT INTO Temp SELECT * FROM INCUMBENTS;
-- Extend each eligible row's end date.BEGIN DECLARE EXIT HANDLER FOR NOT FOUND BEGIN END;
LOOP UPDATE Temp AS T1 SET (T1.END_DATE) = (SELECT MAX(T2.END_DATE) FROM Temp AS T2 WHERE T1.SSN = T2.SSN AND T1.PCN = T2.PCN AND T1.START_DATE < T2.START_DATE AND T1.END_DATE >= T2.START_DATE AND T1.END_DATE < T2.END_DATE)
-- Make sure there is at least one end date to extend the row.WHERE EXISTS (SELECT * FROM Temp AS T2 WHERE T1.SSN = T2.SSN AND T1.PCN = T2.PCN AND T1.START_DATE < T2.START_DATE AND T1.END_DATE >= T2.START_DATE AND T1.END_DATE < T2.END_DATE) END LOOP; END;
DELETE FROM Temp AS T1 WHERE EXISTS (SELECT * FROM Temp AS T2 WHERE T1.SSN = T2.SSN AND T1.PCN = T2.PCN AND ((T1.START_DATE > T2.START_DATE AND T1.END_DATE <= T2.END_DATE) OR (T1.START_DATE >= T2.START_DATE AND T1.END_DATE < T2.END_DATE))
CREATE TABLE Temp2 (SSN CHAR(9), PCN INT, START_DATE DATE, END_DATE DATE);
INSERT INTO Temp2 SELECT DISTINCT * FROM Temp;
The loop maximally extends a row's end date. The exit handler terminates the loop when no rows are updated. At that point the algorithm removes those rows that have non-maximal periods of validity - in other words, those with a validity period that is entirely contained in a value-equivalent row. Finally, we remove non-sequenced duplicate rows.
To illustrate, apply this algorithm to Table 1. The Temp table will initially contain the value-equivalent rows as shown in Figure 1.
After the first iteration of the loop, the Temp table would contain the rows shown in Figure 2.
Note how the procedure extends a row's end time when a value-equivalent row meets or overlaps it.
After the second iteration, some periods are further extended, as shown in Figure 3.
The next iteration does not change any end time, so the repeat-until loop terminates. The DELETE statement removes value-equivalent rows of non-maximal periods, retaining only the first one shown in Figure 3.
One problem with this initial approach is that it uses a PSM procedure. For a time, nobody thought it possible to express this query completely in SQL. Then, Michael Boehlen and, independently, the team of Rozenshtein, Abramovich, and Birger, discovered a solution. You can find the Rozenshtein team's presentation in chapter 22 of Joe Celko's book, SQL for Smarties. This solution involves complex, multiply nested NOT EXISTS subclauses.
CREATE TABLE Temp(SSN CHAR(9), PCN INT, START_DATE DATE, END_DATE DATE) INSERT INTO Temp SELECT * FROM INCUMBENTS;
SELECT DISTINCT F.SSN, F.PCN, F.START_DATE, L.END_DATE FROM Temp AS F, Temp AS L WHERE F.START_DATE < L.END_DATE AND F.SSN = L.SSN AND F.PCN = L.PCN
--There are no holes between F.END_DATE and L.START_DATE.AND NOT EXISTS (SELECT * FROM Temp AS M WHERE M.SSN = F.SSN AND M.PCN = F.PCN AND F.END_DATE < M.START_DATE AND M.START_DATE < L.START_DATE AND NOT EXISTS (SELECT * FROM Temp AS T1 WHERE T1.SSN = F.SSN AND T1.PCN = F.PCN AND T1.START_DATE < M.START_DATE AND M.START_DATE <= T1.END_DATE))
-- No date range can be extended further.AND NOT EXISTS (SELECT * FROM Temp AS T2 WHERE T2.SSN = F.SSN AND T2.PCN = F.PCN AND ((T2.START_DATE < F.START_DATE AND F.START_DATE <= T2.END_DATE) OR (T2.START_DATE <= L.END_DATE AND L.END_DATE < T2.END_DATE)))
In this query, we search for the value-equivalent rows - represented by the correlation names F, for first, and L, for last - that define start point F.START_DATE and end point L.END_DATE. It is possible that one row will satisfy both conditions. The first NOT EXISTS ensures that no gaps exist between F.END_DATE and L.START_DATE. This first NOT EXISTS guarantees that all start points M.START_DATE between F.END_DATE and L.START_DATE of value-equivalent rows are extended (towards F.END_DATE) by a value-equivalent row, T1. (See Figure 4.)
In this subclause, T1might in fact be F. Also, F might overlap L, in which case the NOT EXISTS is certainly true. Finally, M might not overlap L, but the decision to coalesce requires that another M span the gap between M and L.
The second NOT EXISTS ensures that only maximal periods result; that is, F and L cannot be part of a larger value-equivalent row T2.
I recommend that you compare this code fragment with the one for sequenced referential integrity that appeared in my October 1998 DBPD Online column. Both procedures use a nested NOT EXISTS to detect a "hole" in time ranges.
Yet another approach is possible, using a COUNT aggregate in place of the nested NOT EXISTS. This variant uses an idea that Romley proposed and that Celko explicated in his May 1998 DBMS column.
CREATE VIEW V1 (SSN, PCN, START_DATE, END_DATE)
AS SELECT F.SSN, F.PCN, F.START_DATE, L.END_DATE
FROM INCUMBENTS AS F, INCUMBENTS AS L, INCUMBENTS AS E
WHERE F.END_DATE <= L.END_DATE
AND F.SSN = L.SSN AND F.SSN = E.SSN
AND F.PCN = L.PCN AND F.PCN = E.PCN
GROUP BY F.SSN, F.PCN, F.START_DATE, L.END_DATE
HAVING COUNT(CASE
WHEN (E.START_DATE < F.START_DATE
AND F.START_DATE <= E.END_DATE)
OR (E.START_DATE <= L.END_DATE
AND L.END_DATE < E.END_DATE)
THEN 1 END) = 0
CREATE TABLE Temp(SSN CHAR(9), PCN INT,
START_DATE DATE, END_DATE DATE)
INSERT INTO Temp
SELECT SSN, PCN, START_DATE, MIN(END_DATE)
FROM V1
GROUP BY SSN, PCN, START_DATE
The correlation names in the view denote "first" (F), "last" (L), and "extends" (E). The view collects periods that are maximal, in that there is no row E that extends it either to the left (the first part of the WHEN predicate) or to the right (the second part of the WHEN predicate). COUNT = 0 is equivalent to NOT EXISTS. The WHERE predicate ensures that the period is a correct one, and that it considers only those rows with the appropriate SSN and PCN. The NSERT command then ensures that there are no gaps within the period, by selecting the minimum end date.
Although this solution, with 19 lines, is somewhat shorter than the NOT EXISTS version with 27 lines, the three-way join and the grouped aggregate might dampen performance.
A fourth alternative is to use SQL only to open a sorted cursor on the table. This example uses Oracle's PL/SQL to express the cursor and the while loop.
CREATE TABLE Temp(SSN, PCN, START_DATE, END_DATE);
DECLARE CURSOR INC_CURSOR IS
SELECT *
FROM INCUMBENTS
ORDER BY SSN, PCN, START_DATE;
StartRow INC_CURSOR%ROWTYPE;
PrevRow INC_CURSOR%ROWTYPE;
CurrRow INC_CURSOR%ROWTYPE;
BEGIN
IF NOT INC_CURSOR%ISOPEN
THEN
OPEN INC_CURSOR;
END IF;
FETCH INC_CURSOR INTO PrevRow;
StartRow := PrevRow;
FETCH INC_CURSOR INTO CurrRow;
WHILE INC_CURSOR%FOUND
LOOP
IF (StartRow.SSN <> CurrRow.SSN AND StartRow.PCN <> CurrRow.PCN)
OR (PrevRow.END_DATE < CurrRow.START_DATE)
THEN
INSERT INTO Temp
VALUES (StartRow.SSN, StartRow.PCN,
StartRow.START_DATE, PrevRow.END_DATE);
StartRow := CurrRow;
END IF;
PrevRow := CurrRow;
FETCH INC_CURSOR INTO CurrRow;
END LOOP;
INSERT INTO Temp
VALUES (StartRow.SSN, StartRow.PCN,
StartRow.START_DATE, PrevRow.END_DATE);
CLOSE INC_CURSOR;
END;
The loop extracts a row from the sorted INCUMBENTS table. If the current row is value-equivalent with the start row and if the current row overlaps or meets the previous row, then the procedure will coalesce it with the previous row. Otherwise, the procedure will add the coalesced row (from StartRow.START_DATE to PrevRow.END_DATE) to Temp. Because the cursor is sorted, simultaneously coalescing and removing duplicates requires only a single scan of the underlying table. The disadvantage of this approach is that it must pull rows into the application via the cursors, manipulate them, and then push them back into the DBMS.
Deciding which of the four variants presented here is best depends in part upon the degree of reduction you hope to obtain: If the table is already coalesced, the coalescing algorithms go through a great deal of trouble to accomplish nothing. Other considerations are the specific DBMS you use and the size of the underlying table.
Rick Snodgrass is a professor of computer science at the University of Arizona. He chairs ACM SIGMOD, has written many papers and several books on temporal databases, and consults on designing and implementing time-varying databases. He is working with the ANSI and ISO SQL3 committees to add temporal support to that language. You can email Rick at rts@cs.arizona.edu or visit his Web page at http://www.cs.arizona.edu/people/rts.
This material is based on the forthcoming book, Developing Time-Oriented Applications in SQL, by Richard Thomas Snodgrass, ý 1999 Morgan Kaufmann Publishers Inc., http://www.mkp.com.
Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.