According to Date C.J.Date

Persistence Not
Orthogonal to Type

The Third Manifesto disagrees with the object world

 

In the September issue, I explained why I felt the focus on encapsulation in the object world was a little off base. This month, I want to turn my attention to another well-known object dictum--namely, the dictum that persistence should be orthogonal to type (which I'll refer to as POTT for short). POTT means that (a) any data structure that can be created in a conventional application program--for example, an array, a linked list, or a stack--can be stored as an object in an object database, and (b) the structure of such objects is visible to the user. For example, consider the object EX, say, that denotes the collection of employees in a given department. EX could be implemented as a linked list or as an array, and users will have to know which it is (because the access operators will differ accordingly).

One of the earliest papers, if not the earliest, to articulate the POTT position is "Types and Persistence in Database Programming Languages," by Malcolm Atkinson and Peter Buneman (ACM Computing Surveys 19(2), June 1987). Atkinson was also one of the authors of "The Object-Oriented Database System Manifesto,"1 which proposed a set of features that a DBMS must support, in the opinion of those authors, if it's to qualify for the label "object-oriented." POTT, of course, was included among those features. Subsequently, "The Third Generation Database System Manifesto" also endorsed POTT as an objective for future database systems ("Persistent X for a variety of Xs is a good idea").2 And the authors of The Object Database Standard: ODMG 2.0 also agree:

"[We] define an object DBMS to be a DBMS that integrates database capabilities with object-oriented programming language capabilities. An object DBMS makes database objects appear as programming language objects ... [it] extends the language with transparently persistent data ... and other database capabilities." [italics added] 3

The position Hugh Darwen and I take in The Third Manifesto is very different, however:

"Databases (and nothing else) are defined to be persistent.... [Because] the only kind of variable we permit within a database is, very specifically, the [relation variable or] relvar, the only kind of variable that might possess the property of persistence is the relvar."

POTT Violates Data Independence

One reason we reject POTT is that it can lead to a loss of data independence. As I already noted, POTT means that any data structure that can be created in a conventional application program can be stored as an object in an object database and, further, that the structure of such objects is visible to the user. Now, this "anything goes" approach to what can be in the database is, of course, a major point of difference between the object and relational models, so let's take a closer look at it. Note: I assume for the sake of the discussion that the term object model is well defined and well understood, though such an assumption is--to say the least--a little charitable to objects!

Be that as it may, we can characterize the difference between the two approaches as follows:

  • The object model says we can put anything we like in the database (any data structure we can create with the usual programming language mechanisms).
  • The relational model effectively says the same thing--but then goes on to insist that whatever we do put there be presented to the user in pure relational form.

More precisely, the relational model, quite rightly, says nothing about what can be physically stored. It therefore imposes no limits on what data structures are allowed at the physical level; the only requirement is that whatever structures are physically stored must be mapped to relations at the logical level and be hidden from the user. Relational systems makes a clear distinction between logical and physical (that is, between the model and its implementation), while object systems don't.

One consequence of this state of affairs is that, as already claimed--but contrary to conventional wisdom--object systems might very well provide less data independence than relational systems do. For example, suppose the implementation in some object database of the object EX mentioned earlier (denoting the collection of employees in a given department) is changed from an array to a linked list. What are the implications for existing code that accesses that object EX? It breaks.

I should perhaps ask the further question: Why would we want to change the implementation of EX in such a manner? The answer is surely performance. Ideally, the change should not affect anything except performance; in practice, however, that is not the case.

It seems to me, in fact, that the ability to have all these different ways of representing data at the logical level is an example of what I've referred to elsewhere as spurious generality. I would argue further that the whole idea stems from a failure to make a clean separation between model and implementation (we might need lots of different representations at the physical level, but we don't need them at the logical level). Indeed, I remember E. F. Codd once saying (in response to a question during a conference panel discussion): "If you tell me that you have 50 different ways of representing data in your system [at the logical level, that is], then I'll tell you that you have 49 too many."

POTT Causes Additional Complexity

It should be obvious that POTT does lead to additional complexity--and by "complexity" here I mean, primarily, complexity for the user, although life does get more complex for the system too. For example, the relational model supports just one "collection type generator," RELATION, together with a set of operators--join, project, and so forth--that apply to all "collections" of that type (in other words, to all relations). In contrast, the ODMG proposals support four collection type generators, SET, BAG, LIST, and ARRAY, each with a set of operators that apply to all collections of the type in question. And I would argue that the ODMG operators are simultaneously more complicated and less powerful than the analogous relational ones. Here, for example, are the ODMG operators for lists:

IS_EMPTY 
IS_ORDERED
ALLOWS_DUPLICATES
CONTAINS_ELEMENT
INSERT_ELEMENT
REMOVE_ELEMENT
CREATE_ITERATOR
CREATE_BIDIRECTIONAL_ITERATOR
REMOVE_ELEMENT_AT
RETRIEVE_ELEMENT_AT
REPLACE_ELEMENT_AT
INSERT_ELEMENT_AFTER
INSERT_ELEMENT_BEFORE
INSERT_ELEMENT_FIRST
INSERT_ELEMENT_LAST
REMOVE_FIRST_ELEMENT
REMOVE_LAST_ELEMENT
RETRIEVE_FIRST_ELEMENT
RETRIEVE_LAST_ELEMENT
CONCAT 
APPEND 

Incidentally, it's worth pointing out in passing that ODMG does not support a RELATION type generator. The authors of The Object Database Standard: ODMG 2.0 claim that "the ODMG data model encompasses the relational data model by defining a TABLE type" but that TABLE type is severely deficient in many respects; in particular, many of the crucial relational operators--join, for example--are missing. There are many additional problems with claims to the effect that ODMG "encompasses" or "is more powerful than" the relational model, but space precludes detailed examination of those additional problems here.

Now, ODMG supports a query language called OQL, a retrieval-only language (update operators are omitted) that's loosely patterned after SQL. To be more specific, OQL:

  • Provides SQL-style SELECT-FROM-WHERE queries against sets, bags, lists, and arrays
  • Provides analogs of the SQL GROUP BY, HAVING, and ORDER BY constructs
  • Supports union, intersections, and differences, and special operations for lists and arrays (for example, "get the first element")
  • Supports "path expressions" for traversing relationships among objects.

And The Object Database Standard: ODMG 2.0 makes a number of claims regarding OQL. Here are a couple of them (italics added in both cases):

  • "We have used the relational standard SQL as the basis for OQL, where possible, though OQL supports more powerful capabilities."
  • "[OQL] is more powerful [than a relational query language]."

In my opinion, by contrast, OQL illustrates my point very well that POTT leads to additional complexity! That is, I would argue that OQL is more complicated, not more powerful (computer people often seem to confuse these two notions). And the extra complication derives from the fact that so many different data structures are exposed to the user. And that state of affairs is a direct consequence, it seems to me, of a failure to appreciate the advantages of keeping model and implementation rigidly apart.

Let's take a moment to investigate this issue of increased complexity a little more closely. First of all, note that when we talk of lists in the database, arrays in the database, and so on, what we're really talking about is list variables, array variables, and so on--just as, when we talk of relations in the database, we really mean relation variables (relvars). Now, the only kinds of variables we find in the relational model are, of course, relation variables specifically (that is, variables whose values are relations); the relational model doesn't deal with list or array variables or any other kinds of variables. It follows that to introduce list variables, for example, would constitute a major departure from the classical relational model.

Why exactly would that departure be so major? Well, orthogonality would dictate that we'd have to define a whole new query language for lists--that is, a set of list operators (a "list algebra?"), analogous to the operators already defined for relations (the relational algebra). Of course, we'd also have to worry about closure in connection with that language. And we'd have to define a set of list update operators, analogous to the existing relational ones. We'd have to be able to define list integrity and security constraints, and list views. The catalog would have to describe list variables as well as relation variables. (And what would the catalog itself consist of? List variables? Relation variables? A mixture of both?) We'd need a list design theory, analogous to the existing body of relational design theory. We'd also need guidelines as to when to use list variables and when relation variables. And so on, as I'm sure this list of issues isn't exhaustive.

Assuming that such a "list algebra" can be defined, and all the questions raised in the previous paragraph can be answered satisfactorily, we now would have two ways of doing things where one was sufficient before. In other words, as already noted, adding a new kind of variable certainly adds complexity, but it doesn't add any power; there's nothing (at least, nothing useful) that can be done with a mixture of list and relation variables that can't be done with relation variables alone. Thus, the user interface will now be more complex and involve more choices, most likely without good guidelines as to how to make such choices.

As a direct consequence of the foregoing, database applications--including general-purpose applications or "front ends"--will become more difficult to write and more difficult to maintain. Those applications will also become more vulnerable to changes in the database structure; some degree of data independence will be lost. Consider what happens, for example, if the representation of some piece of information is changed from relation variables to list variables, or the other way around.

All these problems are in direct conflict with Codd's Information Principle. I believe Codd has referred to it on occasion as "the fundamental principle of the relational model." It can be stated as follows: All information in the database must be cast explicitly in terms of values in relations and in no other way. In his book, The Relational Model for Database Management Version 2 (Addison-Wesley, 1997), Codd gives a number of arguments in support of this principle (arguments with which I concur, of course). The real point is this: As we've argued in The Third Manifesto, relations are both necessary and sufficient for representing any data we like (at the logical level). In other words, we must have relations, and we don't need anything else.

So where did POTT come from? It seems to me that what we have here is (as so often) a fundamental confusion between model and implementation. To be specific, it has been observed that certain SQL products don't perform very well on certain operations (especially joins); it has further been conjectured that performance would improve if we could use, say, lists or arrays instead of relations. But such thinking is seriously confused; it mixes logical and physical levels. Nobody is arguing that, for example, lists might not be useful at the physical level; the question is whether lists and so forth should be exposed at the logical level. And it's the very strong position of relational advocates in general, and the authors of The Third Manifesto in particular, that the answer to that question is no.

References
1. Atkinson, Malcolm et al. "The Object-Oriented Database System Manifesto." Proc. First International Conference on Deductive and Object-Oriented Databases, Kyoto, Japan (1989). Elsevier Science, 1990.
2. Stonebraker, Michael et al. "Third Generation Database System Manifesto." ACM SIGMOD Record 19, No. 3. September, 1990.
3. Cattell, R. G. G. and Douglas K. Barry (eds.). The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.

C. J. Date is an independent author, lecturer, researcher, and consultant, specializing in relational database systems. His most recent books are Foundation for Object/Relational Databases: The Third Manifesto, coauthored with Hugh Darwen, and Relational Database Writings 1994-1997, both published by Addison-Wesley in 1998. You may send correspondence to him in care of Database Programming & Design Online.


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!