WebScripter 2003 Intent of Work (IOW)

Robert MacGregor, Robert Neches

Information Sciences Institute, University of Southern California

4676 Admiralty Way, Marina del Rey, CA 90292

(310) 822-1511, [email protected], [email protected]

1 Problem Statement

DARPA’s DAML program face two key challenges: First, we need much more content; there is a paucity of DAML knowledge available on the Web, and there are few tools available to generate more content. Second, we need to demonstrate that the emerging DAML knowledge sources and tools enable solutions to previously intractable real-world problems.

The WebScripter project has produced a tool that enables easy construction of reports derived from multiple, heterogeneous DAML knowledge sources. A WebScripter report can be regenerated automatically when the source data changes, enabling up-to-date viewing of dynamically-changing data. More than 200 sites have downloaded the WebScripter source code. However, there is still not a lot of DAML content available as grist for new WebScripter reports. Hence, we are shifting our focus and are now building a new generation of WebScripter tools capable of generating as well as consuming DAML knowledge sources.

We are focusing on two key problems. The first one we will dub the “Collaborative Markup” problem. In the paper world, people working on a common project often collaborate by scribbling notes on paper documents, and passing around the marked-up documents. This approach is very effective for small projects, but becomes progressively less effective as the size of a project (number of documents) grows. Our objective is to provide a markup capability for Web documents (and ultimately, all electronic documents) that is similarly useful, and that can scale. For people to find our markup system usable, markups must be easy to read and modify. However, scale-up becomes possible only if the computer can assist in organizing and retrieving the marked-up documents, and this in turn is possible only if the markups are machine interpretable. With machine-interpretable markup (e.g., DAML markup), the potential exists to support collaborative activities that scale orders of magnitude beyond what is possible in the paper world or the current computer world.

Our second objective is to solve the “Document Classification” problem. Currently, there are vast quantifies of useful documents on the Web, and relatively few tools for retrieving and organizing them. Keyword-based search is still the standard for document retrieval, but it fails when a search is based on the meaning of words (rather than the words themselves), or if the document has no words (e.g., a satellite image). Semantic markup provides a means for relieving both types of failure—retrieval of documents based on associated “semantic annotations” enables significantly more precise document retrieval, and it allows for the automated compilation of document collections categorized according to the semantics of the annotations.

2 Technical Approach

For 2003, we are addressing the Collaborative Markup problem by developing a Collaborative Semantic Annotation (CSA) system that enables users to markup arbitrary Web documents, and to view, edit, and comment upon each others’ annotations. This work will be complemented by two concurrent projects, CHIME (sponsored-by ARDA) and SIM-TBASSCO (funded under DARPA’s DASADA program).

The CHIME system is designed to support semantic markup (automated and manual ) of imagery documents (maps). The goals of CHIME partially overlap those of CSA, but it uses geo-temporally situated maps as a medium for collaborative markup. CHIME’s potential users are intelligence analysts who currently lack an appropriate collaboration environment. The CHIME project is also DAML-based, and will rely heavily on CSA’s annotation management capability.

USC/ISI’s GeoWorlds and SIM-TBASSCO projects have made significant progress on our second problem, automated document classification. Their GeoTopics system can be used to generate Web portals that dynamically maintain specialized document collections. However, this system is hampered by the lack of semantic precision possible using only keyword-based classification. In parallel with the WebScripter work, our SIM-TBASSCO project will enhance GeoTopics with annotation management technology, enabling it to generate document portals containing much more precisely-targeted document collections.

2.1 Collaborative Semantic Annotation

We are pioneering a new approach to collaboration technology that features (i) semantic attachment of annotations to documents; (ii) monotonic updating; (iii) use of a triple store as a shared communication medium; and (iv) new semantic collaboration structures that formalize the virtual relationships and aggregates created in the annotation universe. Users of our Collaborative Semantic Annotation (CSA) tool will view and edit semantically-structured annotations that can be superimposed on any Web document. The DAML statements that represent the meaning of annotations, and that provide the “semantic glue” that interrelates annotations and documents, will not be visible to users.

2.1.1 Semantic Attachment

Rather than attaching an annotation to a document using a fixed offset (e.g., an XPointer), CSA annotations will be matched to text or data within a Web document. This means that links between an annotation and a document can survive changes to that documents. This is particularly useful when the document is dynamically-generated. The pages in a GeoTopics portal are new each day, but CSA annotations can be attached to those portions of the pages that survive from one day to the next (even when those portions relocate within a page). Annotating imagery (maps) is done by associating an annotation with a latitude/longitude coordinate. This effectively allows an annotation to attach to any number of maps that share the same geographic location.

2.1.2 Monotonic Updates

Our collaboration tools are being designed to permit unfettered updating (including modification and retraction) of data within a WebScripter report or within an annotation template. Like the data in a database log, nothing is actually deleted. For example, changing a value in a cell from ‘2002’ to ‘2003’ will be accomplished by asserting a small set of DAML statements that record what the change was (“‘2003’ superceeds ‘2002’ in a particular cell”), when it occurred (a timestamp), and the author of the change. The WebScripter tools will be equipped with an automatic filtering capability that allows users to define rules of control how updates should be displayed. A simple rule for a single-valued cell might be “only show the data value having the most recent timestamp”. A more complex rule might add the restriction “ignore all updates except those made by authors A, B, or C”. Thus, users can be shielded from updates made by other users, if they so choose.

2.1.3 Collaboration Structures

Users of CSA annotations will inhabit a universe of their own making. This universe will require new access control structures, new aggregation structures (as data is automatically classified in accordance with the annotations), and an increased reliance on metadata that defines levels of trust and security. For 2003, we plan on implementing skeletal access control and trust ontologies that will enable us to experiment within this new universe.

2.1.4 Asynchronous Collaboration

Combining semantic attachment, monotonic updating, and a shared annotation store yields the ability to support “asynchronous collaboration”. CSA users can make arbitrary updates to multi-media and database data and selectively view updates made by other users (collaborators), without explicitly establishing lines of communication between each other, and without the need for them to be on-line at the same time. We believe that this medium will evolve into a major new paradigm for collaborative activities.

2.2 Technical Infrastructure

To construct our CSA system, we have imported two major pieces of software, a knowledge acquisition tool and a “triple store”. Stanford’s Protégé system provides a powerful means for defining the ontology portion of a DAML knowledge base, and it offers a very flexible, forms-based GUI that we are using within CSA and CHIME as a user-friendly tool for constructing and editing annotations. Hewlett-Packard’s Jena (triple storage) system is designed to manage RDF and DAML knowledge bases. A triple store is the analog of an RDBMS for storing relational data, or and XML database manager for XML storage. Our integration of these tools merits additional discussion.

2.2.1 Protégé Knowledge Acquisition tool

Our SIM-TBASSCO project has interfaced Stanford’s Protégé system to the Jena triple store. This provides a convenient means for defining new DAML ontologies (i.e., it provides a means for generating DAML content). It also enables DAML data to be viewed/edited from an object-oriented perspective. In contrast, WebScripter supports viewing and editing of DAML data using a tabular format. Both of these formats are equally important. For CSA we will enable users to compose and view annotations using either of these tools, so that they can choose whichever is most appropriate for their task.

2.2.2 Triple Store-based Management of DAML

Storing DAML data in a triple store offers a complementary approach (i.e., an alternative to Web pages) for storing and managing DAML data. Advantages include (i) scalability (storing data in databases rather than files enables access to much larger DAML sources), (ii) enhanced data integration (the triple store provides a focal point for integrating data from multiple sources), (iii) queriability (data in a triple store can be queried/filtered using database query and indexing tools), (iv) fine-grained updates (a user can update a single triple, without needing to load and store an entire DAML file), (v) shared access (data stored in a triple store supports concurrent access by multiple users).

The triple store now serves as a point of integration between all of our DAML tools. DAML data in the store can be viewed and edited using Protégé. WebScripter can compose reports derived both from (DAMLized) Web sources and from a triple store, and it can store its reports and associated articulation axioms into the triple store. All of the annotations produced by the planned Collaborative Semantic Annotation tool and by CHIME will be stored in a triple store.

We have implemented a layer of software between Jena and the rest of our tools, so a port to another triple store (e.g., PARKA) should be relatively easy. Note, however, that we have not abandoned Web publishing of DAML generated by WebScripter tools—we maintain a server that publishes dynamic HTML reports derived from DAML stored in the triple store.

3 Technology Transfer/Target Applications

Here we summarized planned and potential use of the WebScripter tools.

3.1 GeoTopics/Asia-Pacific Area Network (APAN)

Our SIM-TBASSCO project is upgrading GeoTopics to support semantically-defined document collections—the DAML intersection, union, and set difference operators will be utilized to define an algebra of document categories. This will enable users to define categories of documents with a precision unattainable using ordinary keyword-based retrieval. This capability has been requested by users at Asia-Pacific Area Network (APAN), located at the US Pacific Area Command (USPACOM), who today use a GeoTopics portal to help them monitor reports of terrorist activities in the Asia-Pacific region. The DAML produced by the upgraded geospatial reasoning tools will provide additional DAML content for WebScripter and other DAML tools..

3.2 Collaborative Heterogeneous Information Mosaiking Environment (CHIME)

CHIME will enable data (annotations) to be displayed on maps, organized according to a N-dimensional classification (based on geolocation, temporal location, and additional semantic classifications. WebScripter’s Annotation Manager will provide the engine that enables dynamic display and on-line editing of the geo-temporally-located annotations.

3.3 SONAT and Horus/Intelink Portal Automation

.Currently, WebScripter is being used by the SONAT project to generate DAML reports. The GeoWorlds scripting capability that underlies our GeoTopics portal makes it easy to create custom portals that dynamically generate hierarchically-organized document collections for specialized information topics. Our Collaborative Semantic Annotation capability can be employed independently, or in combination with a portal application. We will negotiate with the SONAT and Horus developers to introduce one or both of these technologies into their systems (SONAT is already using WebScripter as a report generator).

4 Metrics

Here we offer some technical benchmarks for scalability, ease of use, and interoperability that can be employed to gauge whether or not we are advancing the ultimate metric, which measures the number of users and applications that employ WebScripter tools and technology.

Scalability—we aim to demonstrate that the Annotation Server that underlies our tools will remain responsive as the DAML knowledge base increases beyond 10,000 statements. Perhaps the main obstacle here is to invent a means for generating that many annotations. The CHIME project, which will include the ability to automatically generate annotations, may yield the most realistic test for this benchmark.

Ease of use—the first-generation WebScripter supports a paradigm whereby naive users could achieve semantic integration of data from disparate sources. We will be using Protégé, a “power user” tool, to construct the classes and templates used in our annotations. Our goal here is to invent a simplified user interface, and demonstrate the routine creation of annotation templates by “ordinary” users.

Interoperability—the goal here is to demonstrate that our two distinct view/edit capabilities (WebScripter and Protege) operate successfully (i.e., usefully) in multiple annotation environments. We will demonstrate their utility both for helping users to manage the CHIME annotation store, and a for a GeoTopics-based annotation application. We hope as well to demonstrate our annotation capability executing within SONAT and/or Horus.

5 Conclusions

DARPA’s DAML program is confronted with a classic bootstrapping problem, wherein the value of the DAML tools manifests only when there is sufficiently large base of knowledge on the Web in DAML form. The WebScripter project’s new direction (Collaborative Semantic Annotation) circumvents that problem by using DAML as a “semantic glue” that increases the utility of existing multi-media documents and database data. We believe that this approach has the potential to hasten the date when the Sematic Web becomes a reality. We see this as a step towards the “semantic desktop”, an environment where all of the virtual objects in a users’ computing environment will be enhanced with DAML (or “son of DAML”) annotations that define the structure, relationships, and behavior of that universe.