E-Database

XML: Extensible Markup Language

Pratik Patel

Much more than HTML on steroids

Web developers have been lamenting for years about the inadequacy of HTML. Whether it was a lack of a controlling style elements or limited flexibility for marking text, HTML slingers had to resort to using tricks and fancy images to achieve the desired look and feel of their Web sites. Application developers were forced to display complex data in tables or write code to take alphanumeric data and build an image of a chart. Earlier this year however, Web developers were handed a new standard for developing content: extensible markup language (XML). XML is poised to revolutionize the way Web developers--from designers to application developers--work. For example, Microsoft is using XML in its push software; the CDF (Channel Definition Format) is an XML-based encoding used in the Active Channel product. In this column, we'll see how you can use XML to organize data in the form of documents. XML is only part of the solution to Web developers' woes, however. In my next column, we'll look at Dynamic HTML, which lets developers create Web pages that have dynamic content and dynamic positioning of content.

WHY XML?

XML is, simply put, standard generalized markup language (SGML) for the Web. For those of you already familiar with the complexities of SGML, don't worry; XML is much less complicated. XML lets you develop structured documents similar to SGML documents but removes many of the complex, less-used features of SGML, such as complex entity references. XML also makes it easier to create document type definitions (DTDs) and share them over the Web. For those of you not already familiar with SGML, it is an international standard for defining the structure and content of documents. More specifically, it's used for defining the description of documents.

If you're still confused about how SGML, XML, and HTML relate, don't panic. SGML and XML are known as metalanguages, which means that they define other languages--specifically, they define document types and the elemental structure of documents. HTML is one of the document types defined by SGML; it defines a specific class of documents used primarily for the Web and intranets, though others are using it for help files and other nonnetworked documents.

XML is friendly to developers version of SGML, making it easier for content developers to create DTDs. XML provides rules to define a set of tags that can be used in a document, rather than providing the tags themselves. Remember that XML is a metalanguage: It doesn't define a specific document type; it provides a set of rules for defining documents. For example, a basic rule of XML is that data elements need to have a start and end tag to be considered XML-compliant. The tags delimit XML entities, attributes, and content, as well as the syntax of the entity. An XML parser reads these tags and passes these elements to an application that can do what it wants with this data. See Listing 1 for an example XML document. In it, I have defined the DTD to be located at a specific URL and prompted the XML parsers to become aware of it using the standalone="no" directive. I've also specified the encoding format.

<?XML version="1.0" standalone="no" encoding
="UTF-8"?>
<!DOCTYPE book SYSTEM "http://www.books.
com/books.dtd"> 
<book>
<title>Java Database Programming with 
JDBC</title>
<author>Pratik Patel</author>
</book>
LISTING 1. An example XML document. 

One important detail to note about XML is that an XML document doesn't necessarily need a DTD as an SGML document does. This is one of the shortcuts in the specification (as compared to SGML) that gives developers some flexibility. This flexibility lies in the fact that you use a DTD to check a document rigidly to validate its compliance with the DTD. If a DTD is not present, the document need only be properly formed XML. Developers can add tags or make other shortcuts as they see fit without having to worry about adherence to a DTD. Now that you have an idea of where XML fits into the scheme of things, let's examine the possible impact of XML.

SQL AND XML

XML and SQL are similar in purpose. SQL is a standardized language categorized into one of three broad sections: DDL, DML, and DQL. You use DDL to define your data, DQL to query your defined data, and DML to manipulate your data. Data storage is not important to the SQL user because the database engine deals with the actual physical storage of the data. There are certain rules to define the data, and the data contained within the defined structure is readily accessible.

XML also lets you define the members in a dataset, but this dataset is called a DTD, rather than a schema. Like SQL, these members have specific attributes and rules for the content contained within the DTD. Of course, the difference is that SQL uses a physical storage manifest via a database engine, while XML uses documents to store this data.

The close conceptual similarities between XML and SQL bring about interesting possibilities. You can generate XML documents from a database to make data interchange possible between different data schemas on different databases. Additionally, XML can be the intermediary between distributed systems; it lets you create a class of documents which have a specific purpose, like HTML does. If the auto parts industry agreed on an XML-based DTD for saving parts data, programmers could write a whole series of applications to let customers browse and search for parts across all vendors. This auto parts DTD could also be applied to existing databases so current systems could have an XML gateway (basically, some type of middleware). Ordering systems could then use the DTD for developing client-side browsers of parts databases. As you can see, you can build an entire architecture for distributed systems around XML.

IF IT AIN'T BROKE...

Before we go any deeper into XML, let's examine exactly why we need XML to continue the growth of Web technologies. XML allows great flexibility because you're not restricted to using the tags in a specific DTD, as you are with HTML; designers can create a DTD that best fits the application they are developing, instead of trying to make their applications fit HTML (the modern example of the problem of trying to shove a square peg into a round hole). XML doesn't have this dependence on a single, inflexible document type.

Also, because you can create new DTDs with specialized tags, you can extend the XML parser to have special meaning for these tags. The XML parser can also do precision layout on the basis of new tags. For example, a new tag such as "vertical text" would prompt an XML parser to layout the text vertically, instead of horizontally.

Another key problem with HTML is the searchability of HTML documents. With XML you can formulate content-based searching. In a DTD for a library, which we'll call "books-DTD," we can use a generic XML search engine to do searches across documents of this DTD. The search engine would index on the custom tags in the documents, such as "author" or "publisher." You can, theoretically, feed the search engine any DTD to use as the template for developing search screens.

This also brings about another important distinction between HTML and XML, a distinction which lies squarely on the fact that XML is a metalanguage. XML separates the presentation of a document from its encoding so that the way an XML document is displayed is up to the developer of the application. This means our "books-DTD" could have specialized viewers that take documents of the "books-DTD" type and display them in a special format, perhaps resembling a card catalog. If we had a DTD for financial data, it might have a tag called "chart" containing the data for a chart and chart type. You also see this customization in action on some Web sites: A Java applet is a charting package contained in a Web page on which the chart data is encoded in the HTML "applet" tag. Using XML, you could build an XML parser and corresponding application to compose a chart whenever this "chart" tag in encountered in an XML document.

WHAT EXACTLY IS A DTD?

A DTD is a definition of the type of document. It defines what names can be used for elements, where they can be located in the document (relative to other elements, of course), and how the elements fit together. The XML processor uses the DTD to verify that a document matches the definition. A DTD is also a high-level syntax-checking rule book. While there are many DTDs available for SGML, they are not directly usable with XML. Look for these SGML DTDs to be converted to XML DTDs soon by the people who maintain them.

XML DRAWBACKS

Many people cite the simplicity of HTML as one of the reasons why the Web has been so successful. Anyone with some patience and a small tutorial on HTML can throw together a page. You don't need a fancy document editor, just some examples and some drive. Is XML forsaking this simplicity, thereby undermining the basic appeal of the Web? Absolutely not--XML will not replace HTML. For most Web developers, HTML is inflexible, but still usable. HTML is by no means going away.

The other major problem with XML is the need to build applications to handle XML documents in Web browsers. While this doesn't seem like a big deal, think about this: Do different Web browsers support a specific XML DTD? Will you need a plug-in to view someone's XML documents? This issue of XML viewing is clearly an issue that hasn't been resolved at this point in time. There are many ways to implement an application (like the aforementioned viewers) that use parsed XML: client-side scripting (VBScript, JavaScript), plug-ins, or Java. For example, not everyone visiting your site will be able to view your XML documents that use JavaScript to present the XML data because they may not support the latest version of JavaScript. These implementation and deployment issues may be resolved in time, but it appears that fragmentation of implementations on how to handle XML parsed data is inevitable.

Even with these drawbacks, XML has already attracted attention. Microsoft and Marimba have proposed Open Software Description (OSD), a technology that will let you control software downloads and installations over the network. This OSD technology is based on XML. Expect to hear about XML quite a bit in the near future. XML is full of promise, but only time will tell if it will live up to its potential.

Pratik R. Patel is a researcher in the UNC-Chapel Hill Medical Informatics research program and is developing Web-based systems for healthcare. He is also coauthor of Visual Developer Java Database Programming with JDBC (The Coriolis Group, 1997). You can reach him via email at prpatel@sunsite.unc.edu.
 


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!