Weaving Data into Information

Gio Wiederhold

Two-layer client/server architectures can introduce huge maintenance problems. In contrast, mediator architectures link data resources and application programs so that end-user applications can tap services and information across many domains

As information systems grow, complexity becomes a major concern. Although applications can share many common functions such as data transformation and integration, in the two-tier client/server model, you can't conveniently assign them to either layer. An unsuitable architecture creates complexity because of the replication of similar functions. In turn, maintenance costs-- already 80 percent of most IS department budgets--will grow even faster than the number of components because all the included elements must interact.

A new solution is emerging, however. By assigning sharable functions to value-adding middleware services called mediators, you can reduce the high cost of adapting servers to changing demands without making client applications too complex.

Mediators provide intermediary services in information systems by linking data resources with application programs. A mediated architecture promotes reuse and scalability so that end-user applications can tap services and information from sources across many domains.

Mediator management is the responsibility of a human owner. A mediator program, directed by its owner, ensures stable service delivery even when resources change; provides improvements to serve clients better; assesses the disparity of concepts in sources and clients and maintains tools to resolve them; and invokes tools to resolve differences in format, representation, and scope.

Initially, innovative companies introduced the mediation concept by developing solutions that supported specific applications or by implementing mediators as extended services from databases. Today, many large companies have developed reusable mediator tools for their own internal use, and vendors such as Junglee, Persistence Software, and IBrain Software provide off-the-shelf solutions for various domains; other vendors build mediators "to order." In this article, I'll explain how this intermediate, value-added middleware layer isolates maintenance tasks, allows rapid upgrades, and maintains high application performance.

SHARABLE FUNCTIONS

As I explained before, in a two-layer client/server architecture you must assign all functions to either the server or client. For many functions the decision is an awkward one (see Table 1). The current debates about thin clients vs. fat clients illustrate that the alternatives are unclear, even if some function assignments are obvious.


 
Problems At Servers At Clients
Integration A single server cannot effectively perform integration with data from other servers. The maintenance required to remain consistent with many other servers requires a level of knowledge and concern distinct from that required to do one's own job well.

Only retrieving raw data and transforming it in the client again requires the acquisition of redundant data, and makes direct sharing of the functions with other clients impossible.
Transformation A server would have to understand and adapt rapidly to the needs of various clients. Different client groups will probably require different object configurations. Each client program must maintain knowledge about multiple servers for integration. Sharing this knowledge effectively with other clients is awkward.
Table 1. Function assignment problems in a two-layer system.

 

For example, data selection is a function best performed at the server because you don't want to ship large amounts of unneeded data to the client. The SQL SELECT statement's effectiveness is evidence of that assignment; not many languages can encapsulate most of their functionality in one verb. Similarly, user interaction is an obvious client function: Local response must be rapid and reliable, and display, keyboard, and voice input require local feedback.

However, the assignment of functions such as integrating data from multiple sources and transforming server data into useful information for the client program is ambiguous. Furthermore, most clients are best served by object-oriented information, although integration of multiple, heterogeneous sources is best understood in the relational model.

ARCHITECTURE

As Figure 1 shows, the mediated architecture comprises three layers: server resources, client applications, and value-adding mediators. I'll start by discussing the roles of servers and clients.


 

 

In large systems, channeling data from local and external resources to applications requires many servers. The motivating applications for databases tend to be transactional operations such as inventory control, payroll, production control, and so on. Eventually, this data becomes information for high-level client applications that support planning and decision making. The number of these client applications increases more rapidly than OLTP applications because management needs tend to change. Client/server architectures are attractive in this arena because you can design and deploy applications rapidly with minimal impact on the server. However, maintaining a large number of them is complex. And, although important, these information applications shouldn't constrain day-to-day operations.

Information-centric client applications are typically designed independently and later than the base OLTP applications; for instance, planning support must be synchronized with frequently changing management objectives. Implementing these applications requires using existing sources because sufficient time is rarely available for building planning systems and their data collection from scratch. Complementary sources of information for decision making are also often obtained externally from financial information systems, digital libraries, geographic information systems, and simulations. Clearly, managing many, diverse, and heterogeneous sources soon overwhelms high-level client applications with a large number of irrelevant but crucial details; Web access is especially difficult to manage.

Mediators add value by converting data to information--accessing and retrieving relevant data from multiple heterogeneous resources, integrating the homogenized data according to matching descriptors and keys, and reducing the integrated data to increase information relevance and density.

Figure 2 shows how some of these functions work. Note that in value-added mediation, more effort is devoted to processing the results of the retrievals than to accessing them. Locating and gathering the data are prerequisites but do not provide the desired result: information delivery in a form that client applications can use directly and effectively.


 

 

ACCESS TO REMOTE SERVICES

Accessing remote servers, especially those developed autonomously, is a long-standing problem. You must address differences in hardware, operating systems, database systems, database schemas, and scope. Table 2 lists some partial solutions. Mediation includes concepts from these solutions but has the explicit goal of maintaining server autonomy.


 
Multidatabase systems Allow queries to address more than one independent source database
Federated databases Integrated schemas that support joins over multiple, consistent databases
Wrappers Server front-end software that provides SQL or OQL access to nondatabase files or legacy databases
Knobots Software agents that search for relevant data through multiple databases or the Web
Webcrawlers Software that retrieves data from the Web for incorporation into local databases
Table 2. Precursors to mediated systems.

 

Autonomously developed data can rarely be integrated by simply executing joins over their attributes. More often, integration requires:

• Resolution of scope mismatches. Scope mismatch occurs when records kept in a table from source A do not cover the same set of items in source B. (For example, purchased items are not always inventoried.)

• Abstraction to bring material to matching levels of granularity for integration (for example, employee hours vs. project labor budgets).

• Omission of replicated information (for example, when employee addresses exist in both sources but in different forms).

• Interpolation or extrapolation to match differences in temporal data (for example, when labor records are weekly but budgets are monthly).

Although you can develop rules and write programs to address these issues, placing the responsibility for coherence on client programs is a heavy burden best shared through intermediate services.

INCREASING INFORMATION CONTENT

Information system customers become less effective as the volume of data increases. They suffer from information overload as access and integration improves. Having good information, however, should reduce uncertainty in decision making. Mediator processing tasks help reduce information overload by:

• Reducing historical data to limited snapshots

• Assessing quality of material from diverse sources

• Pruning data ranked low in quality or relevance

• Omitting information already known according to the customer model

• Statistically summarizing data into higher-level categories as relevant to the customer

• Generalizing and broadening searches to satisfy query expectations

• Reporting exceptions from expected values or trends

• Triggering actions due to exceptions from expected values or trends

• Adapting to the customer's bandwidth and media capabilities

• Sending information and metainformation to the customer application.

These services are not independent; the precise combination required depends on the domain the mediator serves. For example, in financial services, matching currencies and addressing temporal differences are often important. In this case, the mediator processing steps would be to convert currency according to rates prevailing in the period, adjust for inflation using cost-of-living indexes and standard projections; match weekly, monthly, and quarterly reporting periods; and adjust for differences in corporate financial years.

A mediator can provide such services for many clients. In fact, placing these functions in a mediator increases the consistency of analytical results. For example, if distinct analyses differ in their expectation of inflation rates, comparisons become futile.

In a two-layer client/server architecture these tasks are usually performed redundantly. If they're built into customer applications, it makes keeping up with the variety and changes in resources difficult. If servers are to provide such computations for many clients, system complexity ensues if related servers must be accessed for complementary information (say, expenses based on a different reporting period).

One service a mediator shouldn't have to provide is that of actual presentation. Desktop computers are sufficiently powerful to convert information into a useful form, and GUI-related code is often particular to the interface itself and not to information content. If the client contains a Web browser, which provides a basic set of representation primitives, the mediator may create and ship information formatted in HTML. However, receiving Web pages will not be helpful if client applications need to process local data further or combine information from more than one mediator. Ideally, XML will provide a more computation-friendly standard.

Mediation is hence simplified by delegating the complexities of the customer interface to the application program. Generalizing such services often occupies more than 70 percent of server processing time. In contrast, the mediators and invoking applications need only an API.

INTERFACES

The precise implementation of mediation is less important than its ability to perform its functions in the context of the overall system. However, there are some basic architectural elements required.

As Figure 3 shows, because the mediator architecture conceptually comprises three layers, two major interfaces are involved: mediators to applications and base resources to mediators. Large-scale systems may require intermediate interfaces as well because several sublayers can exist inside the mediation layer.


 

 

Much of the effort in moving to sharable mediator architecture involves recognizing interface standards so you can rapidly assemble different configurations. Mediation provides the open architecture required for such "virtual" enterprises.

For the mediators-to-applications interface, many tools designed for the two-layer client/server model are appropriate, such as distributed and augmented SQL, object query language (OQL), ODBC, and other interfaces for object-oriented access such as CORBA. When using these tools, you may have to wrapper legacy applications. You can also transfer complex data structures using the ASN-1 standard's abstract syntax notation.

The interfaces need greater capabilities, however, at the application layer. HTML works here for thin clients, and if data is sufficiently reduced, Java can also be effective. CORBA has a role here, as do the other object model standards. For specific domains, specialized standards--such as product data exchange standard (PDES) objects for engineering--may be appropriate. The requirement, of course, is that the sender and receiver agree on the chosen representation. They also should agree on the vocabulary and its structure (the ontology). For this purpose, a project based at the University of Maryland, Baltimore, with support from the U.S. Defense Advanced Research Projects Agency (DARPA), has developed KQML, a knowledge-query and manipulation language that includes these features.

REDUCING INFORMATION OVERLOAD

The most important value-added mediation service--reducing information overload--requires substantial processing as well as a customer requirement model. Because an object configuration provides such a model, transforming the model into an object format is often part of the processing at this point.

Summarization aggregates data following the hierarchy established by the object model. In systems without mediators, summarization is specified by the client--and, to the extent that SQL functions are available, executed by the server. However, SQL does not provide aggregate functions for variance or standard deviation, which are needed to check if averages are based on simple distributions. As a result, the client must then perform even moderately complex summarizations. Often you must filter source data to delete anomalies, perform conversions if data comes from multiple countries, and so on.

In current practice, the IT staff--not database services--performs much summarization computation by moving data from databases into spreadsheets. Thus these staff members explicitly program their own views of the customer's model. For example, cost data collected in a factory exists at a detail level that records the activity of every worker and machine with respect to every task. For the payroll domain, the worker's efforts are aggregated to daily hours, processed with data that determines overtime rates, and further aggregated to weekly totals that determine paychecks. At the pay level, benefits are added, taxes computed, and contributions withheld. But the same source data from the factory floor must be aggregated according to a different customer model to arrive at costs per product. You can add product development cost allocation to this aggregation to arrive at the base costs that eventually determine prices and profits.

You can create a more effective abstraction of source data by seeking exceptions. In this mode, only results that differ significantly from the customer's expectation are presented--for example, any abnormal clinical findings or an unexpected drop in sales for a product line. The need for a customer model in the former case is obvious: A 10 percent change in a patient's weight over a short time is typically cause for concern, so putting absolute limits on weights would lead to useless exceptions even if patients were categorized by age, height, and gender.

Many business decisions are motivated by changes in customer demand, but simple tabulations do not tell the full story. Sale amounts are affected by exchange rates and promotions. Factory sales are buffered by inventories. Many products are affected by weather and regional preferences. Only when specialists have considered these factors can the information be used for production planning and investment decisions.

Systems also increasingly collect historical data. Such data is initially reduced by aggregating it to intervals that produce an adequate overview--say, by months or quarters. For sales data, further corrections can be made by normalizing data to expected annual cycles. Data can be reduced further by limiting the data points to a slope or to a mean combined with a variance over the preceding period.

In summary, using mediators for summarization services and exception handling offers several benefits: Computation can be shared by multiple clients, which can be assured consistent information; less data must be moved into client workstations; and results are produced at a higher conceptual level--closer to the decision makers in an enterprise.

DOMAIN-SPECIFIC MEDIATORS

Maintenance concerns alone make it impractical to concentrate integration and abstraction functions for every source and application in a single piece of software managed by a single organization. Partitioning mediation tasks by domain is a reasonable guideline, and a mediator should be maintained by a single coherent group rather than by committee. Such a group will use terms and structure object models consistently.

Having multiple object models causes an updating problem when the object contains only a portion, namely a database view, of the source relations. This view lacks the broader context provided by human experts such as the DBA. However, you can integrate such expertise into the mediator. In the approach we've developed at Stanford, all possible ambiguities caused by view updates are enumerated and ranked when the mediating transformation is defined. Hence, mediation delivers the byproduct of a practical solution to the view/update problem without involving customers in issues beyond their concern.

Mediation is the principal means of resolving problems of semantic interoperation but is needed for many other challenges as well. However, a single mediator can't address every issue relating to every application. A single group of people could never develop and maintain such a general mediator. Many client applications require more than one type of processing and hence may need support from multiple mediators (see Figure 4). Furthermore, different applications will require different mediator configurations. For example, a production planner needs production cost estimates and product demand information, whereas the sales manager needs demand information--perhaps with finer granularity--as well as inventory data.


 

 

USING MULTIPLE MEDIATORS

As I discussed earlier, you can keep mediators simple by restricting each of them to single, coherent domain. However, advanced applications, specifically decision-support ones, must often resolve conflicts among disparate domains. For instance, investment decisions involve financial and production information, which is produced using different metrics. Incommensurate information is best integrated at a higher level than that occupied by domain specialists. This computation is best left to a client or a higher-level mediator.

Having a hierarchical customer model driving a mediation process does not inhibit client applications from integrating results from multiple mediators and accommodating dissimilar domains. Such a high-level integration will be pragmatic because applying comparable metrics in dissimilar domains is difficult. For example, while you can combine employee competence and cost for the purposes of personnel productivity analysis, the same comparisons will may not hold for other applications.

ADDING VALUE

To warrant implementation of a mediating service as a distinct module, enough added value must exist to overcome the cost of adding a layer and its interfaces to the information processing flow. But the costs and benefits to be considered are only partially related to performance; having identifiable and maintainable service modules provides significant long-term management benefits. Independent enterprises can provide some of these services over proprietary networks or by leasing programs to customer sites.

Increasing information density. A major task for an effective mediation service is to reduce the data volume to be shipped to user applications while maintaining its information content. Embedding more information in less data increases information density. A high information density addresses the information overload problem; reducing transmission to the customer's workstation also cuts communication delays and costs. Abstraction is the principal tool for data reduction, either by summarization or by exception seeking. Both functions depend on a simple, hierarchical model of the customer's needs.

Transforming data into object structures. Making information relevant to clients often means transforming them into an object-oriented format. Object technology lets applications use an infrastructure that aggregates detail into meaningful units in many important domains. Internally, objects have hierarchical linkages because the class definitions that control them are based on hierarchies. Objects provide a valid customer model, even when the real world is more complex.

A customer model focuses on a task set and domain of interest. Different customer roles are represented by different hierarchical models. Consider the fact that people, when faced with complex tasks, categorize processes and objects according to a divide-and-conquer paradigm. Good categorizations are taxonomies with two attributes: disjointness (no object belongs to more than one category) and completeness (all objects can be classified).

In database technology, a view relation, defined by a single SQL view expression, also creates a hierarchy. A view relation is no longer in normalized form. Each join in the view expression defines a relationship. The attribute named by a WHERE clause along a relationship defines the higher level. SQL views have been adequate for applications, so using hierarchical models in mediation follows a well-accepted path.

Customer acceptance of object models, which are also hierarchical in nature, supports the hypothesis that customer models can be hierarchical and hence manageable within this paradigm. Thus creating object structures from the complex and interconnected world of real data is one of mediation's major value-added contributions. A domain expert must manage the mediator transformation programs, of course.

However, defining fixed, large-scale object structures is difficult. The assumption that one hierarchical viewpoint is suitable for all occasions is demonstrably false: The object model of an inventory of assemblies differs for the purchase agent acquiring the parts from the suppliers and for the factory assembling them. Very large hierarchical objects--say, having 200 elements--nearly always create viewpoint conflicts. Forcing unsuitable and overloaded representations onto the client processing programs increases the cost of finding and executing solutions; in mathematics finding the right representation for a problem is 80 percent of the effort, and the same holds true in computing. Thus mediators work best when they create specific, well-defined object representations.

Furthermore, multiple mediators can generate alternate object configurations from the same base data without creating redundant persistent data and its related inconsistency problems.

Improving maintenance. Mediation adds value to data by applying the knowledge of the experts who create the mediator. Mediators should also be maintained by those experts so the quality of the functions remains effective in a constantly changing world. When an improved mediator is developed it can be advertised over the network to existing subscribers as well as to potential new clients; a poorly maintained mediator will lose value over time and be a candidate for replacement by a competitor. Existing customers can continue to use the old mediator version undisturbed until they decide their application needs the upgrade. The maintainer will, of course, try to keep the number of mediator versions modest. The charges for old mediators may also increase to encourage applications that depend on old versions to upgrade.

IMPLEMENTATION

Many applications take advantage of mediator technology. Early applications were in military intelligence, a domain where customers can impose no control over sources. Subsequent applications have focused on manufacturing--combining design and prototype production data--and aerospace. (One application at Lockheed Space Systems selects and validates gimbals for spacecraft antenna positioning; an interesting spin-off project collects and integrates satellite data for land-use planning.) Other mediators are being developed in healthcare, plant safety, and environmental cleanup.

Implementations vary greatly. Workstations are the favored platform, often using Unix. Many current mediators have been coded in C and C++; where knowledge-based processing is crucial, mediators have also been programmed in languages such as LISP or CLIPS, a C-compatible rule language. If optimization is important, the mediators may depend on packages written in Fortran. For the customer, the internal implementation should not be the issue, but for maintenance purposes, making a wise choice is crucial.

Some companies now focus on simply providing the framework for mediators. For instance, IBrain Software's core technology is a single framework for querying and analyzing information of multiple types, from multiple places, using multiple analytical methodologies. IBrain's domain focus has been finance, but its technology will probably also be applicable in industries such as healthcare, pharmaceuticals, manufacturing, and enterprise management.

MEDIATION IS HERE

No current mediated system performs the full set of tasks described here, although partial examples do exist. Several commercial solutions, such as that from Persistence Software, already support the creation of objects from relations. If that technology can be applied to mediator generation, the scale and significance of that technology may increase considerably. (Some projects are already underway.)

Some integrators have the capability to build the required application interfaces and implement this architecture. The platforms and languages vary and there is some discussion on style (such as fat vs. thin mediators). But as these integrators interact with their customers to acquire domain knowledge and deploy more implementations, mediators will be installed more rapidly and be better maintained by their owners.

Visit www-db.stanford.edu/LIC/DBPDrefs.html for references.


Gio Wiederhold is a professor of computer science at Stanford University. He has developed databases and advanced application technology there since 1969. Gio also managed mediation technology support during an assignment at DARPA between 1991 and 1994. You can reach him via his Web page at www-db.stanford.edu/people/gio.html.
 


 
search - home - archives - contacts - site index
 

Copyright © 1998 Miller Freeman Inc. All Rights Reserved
Redistribution without permission is prohibited.

Questions? Comments? We would love to hear from you!