
Bigger Than a Database
By Fernando Martinez-Campos
The Internet's wide-open network is fostering a dynamic convergence of what were once exclusive technologies. As companies race to engage in electronic commerce, database architectures expand beyond traditional boundaries.
The dramatic success of the Internet and the World Wide Web has brought virtually every segment of society into the era of open networking. Exciting new applications in the entertainment, publishing, education, and travel industries are using access statistics to customize their data for individual users. Through collaborative filtering, servers can then send masses of data condensed according to individual preferences.[1]
Mainstays of information dissemination that relied on "old-fashioned" paper are rapidly converting to this new electronic medium on a global basis. For example, the amount of Yellow Pages and classified advertising in the United States is exploding because almost all data about individual preferences is already captured in machine-readable form and can be reformatted on Web sites and searched at a very low cost. In addition, novel new applications are now being deployed with computer-telephony integration, kiosks, and remotely operated devices from browsers (see Table 1).
Web Approaches
The simplest form of Web access brings in HTML pages that statically reference
text, images, and audio. This approach is easily implemented but somewhat
inflexible, because the information displayed on the page doesn't change.
The next level of sophistication consists of page-embedded scripts, which
can format pages differently depending on input parameters.
These two methods can be developed and tested with minimal expertise, which is one of the reasons for the massive amount of content on the Web today. But new business applications must go beyond this "document-centric" approach toward a more dynamic method of displaying data.
For more complex requirements, the Common Gateway Interface (CGI) has been the most prevalent form of application execution on Web servers. With CGI, pages can be dynamically built depending on data contents. CGI can use interpretive languages (such as PERL) or C-program executable code. Pages can then be dynamically formatted using application logic. As the number of requested objects grows and directory maintenance becomes more cumbersome, a database helps manage the Web site data. Most major relational database management system (RDBMS) products include HTML object data types, as well as automatic recovery features for updates.
For lower-request volumes, a single server is usually satisfactory; more processors and memory can be added as the workload increases. But when the workload grows, these approaches inhibit performance because a new process must be spawned for each request. The newer Web applications interface with middleware, which uses threads to minimize startup overhead and provides ways to retain context state for complex transactions.
Transaction-processing (TP) monitors are a major category of middleware that load balance and group transactions over distributed servers. They can start multiple replicas of an application on several nodes to attain higher scalability. Because the HTTP protocol is stateless and servers don't track the client execution context, different solutions are required for complex transactions with multiple interactions. TP monitors solve this problem by maintaining the current state of each transaction as it progresses through different parts of the application logic.
TP monitors have many other facilities that help distributed systems. For example, they can route requests around failed nodes to other application replicas. Message-routing can even follow different schemes: balanced (uses round-robin scheduling), message-sensitive (based on codes inside the message), enhanced (with weights for each node), and recoverable (uses persistent queues that store and forward messages after node failures). Messages can be prioritized based on message content/context, time, or user-defined priorities. TP monitors can also funnel many clients into fewer sessions, thereby reducing DBMS memory requirements.
You can also use TP monitors for distributed transaction processing (DTP) across several nodes running different DBMS products. Two-phase commit protocols are also supported across these heterogeneous databases to ensure transaction integrity.
Client/Server Evolution
A decade before the Web became popular, the client/server concept fueled
the growth of graphical user interface (GUI) tools and applications. The
era of PCs and LANs fostered the two-tier model, with client tools making
requests to a DBMS residing in the second tier for database access. With
the client/server approach, network traffic can be reduced via stored procedures
that execute common business logic closer to the DBMS. This approach shifts
common application logic even further away from the client, where it can
be shared by all users. It also thins the resource requirements of the client
PC-which is, ironically, the latest trend in networking devices.
As the network speed and the cost-effectiveness of servers improved, a "middle tier" emerged to run major portions of the application using TP monitors and batch and other analysis tools. Three-tier architectures soon became common in higher-volume systems; they also became popular because maintaining each PC software tool from a single tier was so difficult. This development has led an increasingly mobile workforce to demand a "universal access socket" that can launch applications from any device using a thin client with a presentation layer. The remaining user tools and personal files can then safely reside on mid-tier servers.
We had no way of knowing several years ago that the three-tier concept would fit so well with today's Web technology-and that it would be only a matter of time before the visually oriented Web world merged with GUIs, middleware, and DBMS products.
OLTP and Data Warehousing
One of the first considerations when deploying an application over the Internet
is the variability of response times. Applications with longer-running queries
and a low number of interactions are better suited for Internet use. Data
warehouses are a good fit with the Internet because users tend to analyze
large amounts of data and are willing to wait longer for results. An emerging
set of applications is being implemented with operational data stores that
let outside customers use the Internet to tap into internal data for status
information. For example, the Federal Express package-tracking application
was one of the earliest successes to prove that operational data stores
save time and effort when the customer queries them directly.
For another example of a network-enabled data warehouse, consider a parts database from a manufacturer that can be viewed by corporate customers via the Internet. A major retailer provides its suppliers with internal data on their product performance, sales, and stocking information. More companies are likely to implement Internet data warehouses for customer access, which helps streamline and strengthen corporate relationships.
Online transaction processing (OLTP) applications present greater challenges. Such applications are usually mission-critical and therefore require high availability, performance, and reliability. Users of these applications require less "think-time" to generate transactions, and they will not tolerate the long delays typical of the Internet. Short queries can run in the Internet as long as each user submits a only few requests per hour. Understanding user needs and the volume of individual transactions is critical when long response times are expected. OLTP applications are best deployed when network bandwidth is predictable and controlled. Intranets and virtual private networks are better environments for online applications that have lower response time and reliability requirements.
The Internet will likely evolve to provide varying levels of service to users who are willing to pay accordingly. The current Internet distance and path-vector routing that aims for "best effort" transmission will have to accommodate policy-routing schemes. This development may lead to stratified pricing for different levels of bandwidth guarantees, which will enable low-latency electronic commerce traffic across the Internet.
Java and TP Monitors
The Java language vindicated the concept of sending executable code along
with the HTML page to the client. It proved that good performance and portability
across many platforms can be reasonably achieved. Java has been quickly
accepted because of its simplicity and power; it eliminates the more troublesome
object operations such as multiple inheritance, operator overloading, and
pointers.
Java's platform-independent orientation relies on the concept of creating machine-neutral code applets that are executed at the client by the Java Virtual Machine (JVM) environment. The JVM enforces the executable memory boundaries, performs automatic garbage collection, and prevents the applet from writing on the local file system. Initial versions have been slow because of their interpretive nature, but several just-in-time compilers that create a much faster executable are appearing in the marketplace.
In addition to making screens more lively, Java is popular because it can also perform field edits locally on the client, thereby reducing the number of network interactions. In addition, with Java a client PC can access a database using a two-tier approach: Java databases can be accessed using JDBC drivers that provide the specification and API to make SQL requests. These drivers are downloaded as Java-class files over the network in a manner similar to that of applets. The JDBC Driver Manager chooses which low-level driver will handle the target DBMS request.
Another method, Intersolv's JDBC-ODBC bridge, communicates with existing ODBC drivers to make database calls. This method requires bridge installation on each client, making it better suited for intranet environments. For object-oriented applications, the Java Remote Method Interface (JRMI) is available for object request broker communication.[2]
On a three-tier architecture, Java establishes a connection with the server that sent the applet. A TP monitor and application can be positioned at this server to provide application and system services. The Java applet becomes the conduit in which all communication flows between client and servers. This integration has at least one major advantage: the elimination of the need to install software on client PCs to run applications.
Current TP monitor implementations include Tuxedo, in which Bea Jolt Java applets work with the ATMI API to send messages to a Jolt server. This server performs any necessary data marshaling, assembles the request, and interfaces with the Tuxedo TP monitor as a regular transaction. The Jolt server preserves transaction state during execution on requests when multiple interactions occur. With Transarc's Encina, the DE Light applet will package parameters to be executed as an RPC by a gateway residing on a mid-tier server. The gateway will dynamically build an Encina T-RPC or DCE RPC, which then executes as a normal transaction.
Top End from NCR uses Java Remote Client applets that are loaded along with the requested page and connect to the Top End server directly. Top End's ActiveX Controls interface extends the concept further by building OCX Module applications using Visual Basic, C++, PowerBuilder, and many others.
ActiveX
The ActiveX approach integrates component libraries at the desktop with
new controls that are loaded dynamically. This technique improves run-time
performance because most components already reside on the desktop, and bringing
in new ones becomes an additive process. The disadvantage derives from a
reliance on Microsoft environments and the need for additional disk space
for the component libraries. To address this concern, Microsoft has released
a software development kit that helps in porting the 2000-plus ActiveX controls
to Macintosh and Unix.
ActiveX components plug into the container that controls them-a browser, for example. Most of ActiveX has evolved from OLE with lightweight components and distributed mechanisms to execute modules locally or globally. ActiveX is different from Java in that the loaded controls are stored in regular directories at the client for subsequent use. If the controls are missing from the desktop, they are located using the <OBJECT> tag, which points to the servers that store them.
Microsoft has developed several mechanisms to authenticate that source servers are virus-free. Distinct digital signatures within each downloaded component cross-check for any code tampering. Once loaded, these components can read and write files on the desktop. The control container can have multiple ActiveX controls that are activated on certain events, such as clicking on a button.[3]
Aside from controls, other major components of ActiveX include scripting (VBScript or Jscript), ActiveX documents, and ISAPI for server-side support. Java can also interact with the ActiveX environment using Microsoft's J++ visual development tool. Additional products are expected soon from other vendors.
Other Uses for the Mid Tier
Now that we've explored how TP monitors work with Web technologies, what
are some other major uses for mid-tier servers? Many statistical and analytical
packages are available for mid-tier servers, including relational OLAP tools
(ROLAP). In this case, the presentation layer in the three-tier architecture
runs on client workstations and passes requests to the ROLAP tools on the
mid tier, which perform substantial data pivoting, ranking, trending, drilling
up/down, aggregation, and cross-tabulation. The third tier uses a DBMS for
relational data access.
DSS Web from MicroStrategy provides ROLAP functionality with user-oriented features such as Autoprompt, which guides users during each analytical step and international language customization. It caches reports for later retrieval, compresses data to save on transmission volumes, and incrementally displays reports as the data is retrieved in the background. WebOLAP from Information Advantage also runs in mid-tier servers and converts data warehouse data to HTML format and sent to browsers. The company's DecisionSuite Server acts as an intelligent agent that monitors the data warehouse and forwards reports to users when exceptional conditions arise.
Intelligent Agents
As we enter the era of software that interacts with many source devices
and roams across the network, the need is emerging for intelligent agent
software that can be personalized for each user. As the number of Web sites
and amount of content continues to explode, intelligent agents will help
find, organize, and catalog data to map data sources to servers. Some agents
regularly schedule "spiders" (also called software robots or "softbots")
that crawl across multiple Web servers, searching and cross-indexing keywords
for subsequent retrieval.
Agent technology is either static or mobile. The static agent implementation stays at its home server and uses RPC mechanisms to make requests to other servers. For example, the Metacrawler agent resides on a server and launches requests to a dozen search engines concurrently to find pages that match a keyword. The result set is merged and returned to the client browser.
Another agent approach follows a mobile remote programming strategy (Telescript or Java applets, for example) that sends code to execute at remote sites. Agents arrive uninvited, so stricter security precautions and mechanisms for limiting the consumption of resources during agent execution are required.
Agents can also be classified according to where they functionally execute. Agents can run at the source, intermediate, and destination points. Source agents run where the database resides-a financial services site, for example-and notify individual users of preset exception conditions. This approach is the standard push model of information distribution, which is increasingly being employed in "micro-marketing" efforts to individuals.[4]
Intermediate agents run on mid-tier servers and filter information coming from news groups and other sources. They work even when the clients are logged off, and they can span an ever-growing number of information sources. One example is the ActiveWeb Information Broker from Active Software, which performs filtering and content routing. Messages are queued when the destination is not available and are sent to the client once they come back online. The Information Broker can also extract legacy and data warehouse sources and forward messages to interested users. An active adapter will convert source formats to a neutral format for the Information Broker adapter to route. As the message arrives at the client, another active adapter converts the message to a native format for subsequent processing.
Destination agents, the final type, reside on the client and retrieve data on demand based on a pull model. With destination agents, the user can customize agents to control the arrival of an increasing amount of available data-for example, for filtering incoming e-mail.
An increasing number of agents are also being developed to participate in commodity trading and negotiation. Users even dispatch special-purpose agents to roam the Web for comparison shopping. For example, the BargainFinder agent searches Web sites and returns lists of stores, items, and prices.
As more agents participate in the many layers of networks, n-tier architectures will evolve in many interesting ways. In the future, agents will cooperate with each other to accomplish business transactions, such as searching bank Web sites for the best loan rates. Agents can also compete with each other: A push-based agent can send e-mail to a community of prospective buyers, and a specific client agent may block messages using individualized filtering rules.[4]
Clusters
As a Web site grows and applications become more mission-critical, different
types of architectures-such as clusters-will have to be implemented. Clusters
offer both scalability and availability.
The original single-server solutions achieve vertical scalability by adding processors, memory, channels, controllers, and disks. But clusters offer horizontal scalability as they expand capacity by adding nodes, distributing the workload, increasing availability by not having a single point of failure, and offering flexibility in geographical dispersion of servers.
Many Internet users are willing to return to a target server if it's temporarily offline. But in electronic commerce, new customers attempting to access an unavailable Web site may not return, so high availability takes on new importance when applications have critical business uses. To provide a highly available environment, clusters work with TP monitors to enhance failover capabilities and workload balancing. All nodes in a cluster keep communication mechanisms to validate their operation. Cluster software automatically switches over permissions for files, database locks (such as Oracle), and IP addresses. Most major hardware vendors offer cluster management capabilities, but they vary in how many nodes can be handled and which recovery methods are used during node failures.
When viewing a cluster from a data perspective, two basic topologies emerge. As illustrated in Figure 1, a shared data cluster enables any node to access a common pool of data. For example, Oracle clusters are frequently configured with each node sharing connectivity with all disks in the system. This approach works well in many Web applications, most of which minimize lock-management overhead by being read-only.
The other approach, that of shared-nothing clusters, provides high scalability and geographical placement of nodes across longer distances. Data replication may be used to spread requests further to alternate sites when daily requests reach high volumes. Although several major DBMS products support shared-nothing arrangements, each product may take a different middleware approach.
Most cluster implementations employ Unix, MVS Sysplex, and proprietary solutions. The number of nodes varies from two to 32; for example, the Wolfpack API standard from Microsoft, which brings clustering to Windows NT, is expected to group up to eight nodes. Over the next few years, clusters for all operating systems are expected to increase in robustness and performance as interconnect latencies and speeds improve.
Internet Challenges
The blending of all these technologies demands new solutions for data warehousing
and online applications. Although the casual Internet user might be willing
to "forgive" a server that doesn't respond, users in business-oriented,
mission-critical environments demand high system availability.
Although implementing OLTP systems over the Internet and intranets is possible, several challenges vary the nature of the network remain unresolved. An intranet has a controlled population of user accounts, but the Internet is completely open to anyone wanting access to exposed servers. Therefore, a company can capacity-plan and tune the former, but on the latter, network traffic is highly unpredictable-response times can range from a few seconds to minutes. As a result, users accustomed to subsecond response times over a private network may experience long response times when running Internet-based transactional applications. Mixing long messages (video, audio, long text pages, and so on) with short ones makes traffic unpredictable, and the presence of millions of users and servers performing different types of work online makes control very difficult.
Security of Transmitted Data
Securing transmitted data over public lines is a major concern. Data encryption,
the most common transmission security solution, comprises two main schemes:
symmetric and asymmetric encryption. The oldest, and slowest, form of transmission
security uses symmetric encryption based on the DES standard, with the same
56-bit encryption key used on both ends of the transmission. (Some DES hardware
implementations speed up the process.) In the case of "triple DES,"
several encryption rounds (passes) with different keys on the message are
made to reduce the threat of brute force decryption attacks. The goal is
to give the computational difficulty of breaking the message a higher cost
than the value of obtaining the data.
In the last decade, the use of asymmetric encryption has become widespread through the use of public/private key pairs, the most prevalent being that of the RSA algorithm. Key lengths range from 48 to 1,024 bits; however, as key lengths grow, so does the computational overhead for encryption. In fact, to retain the ability to decode messages abroad, the U.S. government has restricted the export of any scheme with the higher key lengths. (Recent legal rulings have questioned the constitutionality of these restrictions.)
The Secure Sockets Layer protocol (SSL) pioneered by Netscape combines encryption and authentication with message integrity-checking to prevent tampering during transmission. SSL supports secure and nonsecure messages in alternate transmissions, reducing overhead when sending the non-private portions of the transaction. Another scheme, Secure HTTP (S-HTTP) from EIT Inc., also provides application-level encryption and offers a choice of algorithms (DES or RSA) with different digital signatures.
The Secure Electronic Transaction standard (SET), sponsored by Visa, MasterCard, and a variety of major hardware/software vendors, supports many strategies to maintain confidentiality, user identity, and message integrity over the Internet. SET will soon be implemented on all layers of processing, including the browser, application, authentication, and transmission mechanisms. Other building blocks include the Europay Master Visa Card (EVM) smart-card standard, which is designed to reduce counterfeiting risks and validate transactions.
User authentication is another major security issue. The most basic method consists of the traditional "something you know" scheme: user name and password. A "something you have" layer-for example, magnetic cards that are read electronically-adds an additional level of security. The final method is authentication of "who you are" by using biometric physical characteristics (such as voice prints or retinal scans) or a third-party certificate authority such as Verisign. These third parties provide services to validate the identity of the user and issue a certificate used to sign messages electronically. The certificate authority involves itself in the processing of each transaction by issuing a challenge to users to determine their validity. This particular method has a major advantage: non-repudiation. Like a signed, legally binding contract, third-party certification proves that the transaction originated from the user.
One of the main challenges of elaborate authentication schemes is the infrastructure cost of transaction processing. Currently, authenticating "micro" transactions-less than $25-is too expensive. In fact, a whole industry is emerging based on the need for electronic cash technology that can cost-efficiently process such transactions. Such a capability would enable new marketing tactics for Internet products, such as pay-per-execution of tools, informational browsing, and downloading of low-cost items. But before this can occur, electronic technology will have to reach critical mass in the marketplace, and several legal issues-including the guarantee of value over time, fund insurance, and traceability-must also be resolved.
Security is a complex topic that spans many dimensions beyond that of secure transmission, such as security of the operating system, database, e-mail, firewalls, and user permissions. Personnel security that includes separation/rotation of duties and constant system monitoring are also important, as most major lapses occur from the inside. Even the most elaborate security mechanisms are vulnerable, so utilizing a combination of security measures, as well as continuously plugging newly discovered "holes," is an ongoing process.[5]
Some of these issues will be overcome in the near future. For example, the current IP scheme of 32 bits is rapidly running out of addresses, as they're assigned in ranges to each organization. The major problem of migrating to a larger addressing structure with minimum turmoil may be resolved by IPV6, which is designed to accommodate from new addresses from 32 to 128 bits while coexisting with the IP structure.
Bandwidth
Last but not least, Internet bandwidth issues will impact the global network
as well as individual users. All of the major telecommunication carriers
are massively upgrading the infrastructure bandwidth for the many backbones
in place today. But the biggest challenge is the limited bandwidth on the
last "hop to the home." Intermediate solutions-ISDN, for example-triple
the capacity of this last hop, but their cost has limited widespread acceptance.
Promising solutions include ASDL and cable connections. Cable has traditionally been used as a broadcast medium, so several challenges-particularly that of uploads-will have to be overcome. Despite these challenges, the fact that cable bandwidth capacity is hundreds of times faster than that of existing modem solutions makes it the most attractive option as more and more interactive multimedia data comes onto the Internet.
Looking Toward the Future
The availability of PDAs, Web TVs, and network computers should bring even
more users into this era of global networking. Tighter integration between
supplier and consumer companies will shorten the ordering, shipping, and
billing cycles and will disseminate changing operational conditions more
quickly. The areas of forecasting, tracking, logistics, and replenishment
should also benefit from direct two-way interchange with partner companies.
The whole science of advertising, from mass marketing to interactive marketing, will find new avenues to reach consumers. For example, push-based models such as broadcasting may incorporate pull-based mechanisms so that clients can interact on the screen. These new techniques will provide excellent feedback loops to narrow individual consumer preferences and enable the tailoring of advertising.
The number of pages on the Web will also explode exponentially because it provides identity in a variety of contexts. In this increasingly complex world, identity is the means by which organizations, products, services, people, ideas, and events rise above the "clutter." This need should lead to billions of useful combinations of access patterns and requirements, each with its own Web page.
The technology to provide services for transactions and data warehouses is rapidly falling in place. Just about every GUI, data-mining, and OLAP product is being converted to work with browsers. Major industry groups are converging on functionality for Web access. The DCE Web project by the Open Group, the CORBAnet using the Internet Inter-Orb Protocol (IIOP) (see the sidebar "The New IIOP Protocol"), and DCOM by Microsoft are major initiatives that aim to make thousands of tools "Internet-friendly."
Many first-wave Web-enabled implementations are low-risk and have a quick payoff. They provide the first proof of concept for these new technologies:
Consumer-Oriented Applications
Employee/Corporate Applications
Figure 1. Three-tier Web architecture.

Copyright 1997 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.