An Architecture for Web Visualisation Systems

Petros A. Demetriades and Alexandra Poulovassilis
Department of Computer Science
King's College London,
Strand, London WC2R 2LS, UK

e-mail: {petros, alex}@dcs.kcl.ac.uk

Abstract

We describe an architecture for developing World Wide Web (web) visualisation systems. Our approach differs from that used in other such systems in that it is not an implementation of a particular visualisation technique but rather a framework which allows the development of extensible visualisation systems, and which facilitates experimentation and re-usability of components. This is achieved by providing a mechanism for interconnecting the three basic types of entities involved in visualising the web: Collectors, which are agents that gather web-related data to be visualised; DataStores, which are repositories into which the collected data is stored; and Views, which perform the actual interpretation of the data into some visual form. It allows many different Collectors and Views to be connected to the same DataStore and to simultaneously update and visualise it. In addition to facilitating experimentation and side-by-side comparison of different collection, storage and visualisation methods and techniques, this architecture also promotes re-usability of existing components, such as visualisation engines, web-spiders and data repositories.

1. Introduction

The popularity and widespread use of the web has caused a number of usability problems to surface. The two most troublesome ones have come to be known as "information overload" and the "lost in hyperspace syndrome". The former refers to the difficulty in discovering relevant information and resources on the web and the latter to the difficulties associated with orientation within, and navigation of, the web. These problems are closely related to the fact that the web and the Internet are very large and continue to expand at an exponential rate [ 12 ]. Unless these problems are resolved, the usefulness of the web as an information resource will decrease as its size increases.

One approach to a solution is the development of effective techniques for visualising web-space. We define "web-space" as a collection of HTML documents (web-pages) that are somehow connected. Examples of web-spaces are arbitrary sub-sets of the entire web, users' browsing histories or the results of a web search. Visual interpretations of web-space can aid navigation and orientation in cyber-space in much the same way that conventional road maps aid navigation and orientation in real-space. Similarly, they can improve information discovery by facilitating exploration and navigation of the results of web-queries. We define a "web-query" as the process of searching for information on the web using any of the available methods. We term the set of URLs returned by a web-query the "hits" of the web-query.

The process of visualising web-space involves 3 activities:
  1. collection of the web-space,
  2. temporary or persistent storage, and
  3. interpretation of the web-space into some visual form.

How we collect, store and subsequently visualise web-space, depends on its exact type. The three types of web-space identified above, for example, could be collected, stored and visualised as follows:

Firstly, an arbitrary subset of the web can be collected using a web-spider [ 1011 ] and stored in either a relational database using two relations (one for the web-pages and one for the links) or in a database which supports set attributes in one relation <U,S> where S contains the URLs of all the pages to which the page U links to. This information can be visualised as a directed graph where nodes are web-pages and edges are links between web-pages so that an edge (V1,V2) represents a physical link from page V1 to page V2 and indicates that the web-page V1 contains a hyperlink to the web-page V2. For example, Figure 5 shows how a subset of "http://www.dcs.kcl.ac.uk" comprising 5 pages could be stored and visualised. The example assumes that the index page A links to pages B, C, D and E, that pages C, D and E contain no links, and that page B contains a link back to the index page A.

Secondly, the results of web-queries can be collected either by a search client which interrogates Internet Index Servers such as AltaVista [3], Infoseek [4] and Excite [5] or by some other means such as WebSQL [6]. The results can be stored in both a relational DBMS and a DBMS which supports set attributes in one relation containing the URLs returned. Alternatively, if the approach we chose for our example system described in Section 4 is used then the results can be stored in an RDBMS as two relations, one containing the host part of the URLs returned and the other containing the relative path to each particular hit within the host , or in a DBMS that supports set attributes as one relation <H,I> where I is the set of hits that reside on host H. Figure 6 demonstrates this for a web-query that results in the 3 hits: http://www.dcs.kcl.ac.uk/petros/WWW7Paper.html, http://www.dcs.kcl.ac.uk/alex/WebPaper.html and http://www.w3c.org/Visual.html. Such web-space can be visualised as a list of hits, grouped by host as shown in Figure 4.

Lastly, browsing history is not generally collected in a user-initiated manner, as for the above two types of web-spaces, but by a process continuously running along-side a browser or by a proxy server. It can be stored in an RDBMS in two relations one for the URLs of visited pages and one for the details of each visit. It can be visualised as a labelled, directed graph where nodes represent web-pages and labelled edges represent the order in which pages are visited. For example, if a user browsed the web-subset described in Figure 5 in the order A, C, A, E, A, D, A, B and then jumped to the web site of the WWW Consortium (http://www.w3c.org) using a predefined bookmark then this session could be stored and visualised as in Figure 7.

This need for different collection, storage and visualisation requirements for each type of web-space has led us to the development of an architecture to facilitate the experimentation and exploration of different collection, storage and visualisation techniques with minimal re-development and re-design effort.

We give a description of the conceptual structure of our architecture in Section 2 of the paper and proceed to give implementation details of one instance of this architecture in Section 3. To demonstrate the possible visualisation systems that could be built within we give details in Section 4 of one specific implementation that we have recently undertaken. We end in Section 5 by summarising our conclusions and future research directions.

2. The Conceptual Architecture

Conceptually the architecture (Figure 1) consists of three independent, stand-alone entities:

A core layer binds these entities together and provides a mechanism for communication and reliable message passing between them. This further reduces the complexity that needs to be built into each entity by transparently performing certain operations.


Figure 1: Block Diagram of our Architecture

Figure 1: Block Diagram of our Architecture

The core exposes a set of interfaces to all entities to abstract the operations of performing general querying and updating of the web-space irrespective of the actual data model used to store it, and message passing between the entities. In addition to its role as an "abstraction layer" between the entities the core also transparently performs some behind-the-scenes operations which further reduce the complexity of each entity. Firstly, whenever a view or collector performs a change to a DataStore the core immediately notifies all other entities of the change performed (in as precise terms as possible) so that they can alter their states to reflect the new state of the system. Secondly, since DataStores are essentially shared resources, it provides concurrency control and prevents access to them by more than one entity at the same time (this facility is only used for DataStores that do not provide their own concurrency control).

3. Physical Implementation of the Architecture

We have implemented a prototype of our architecture for the Windows 95 and Windows NT Operating Systems (collectively known as Win32). The prototype is written in C++ and uses the MFC application framework [15]. The main characteristics of this implementation are:

Figure 2 shows the physical architecture and how it relates to the conceptual architecture of Section 2.


Figure 2: Physical Architecture and relation to Conceptual Architecture

Figure 2: Physical Architecture and relation to Conceptual Architecture

Although this prototype is implemented on Win32 it is possible to bind together entities running on different platforms (e.g. UNIX) as long as there is some way of "talking to them". All that is required is to code a small stub of the required type of entity on the Win32 platform to act as the go-between the two systems.

This implementation suffers from two limitations. Firstly the core and at least a small part of each entity must reside and execute on the Win32 platform even if the actual entity resides and executes on a different system or platform. This reduces flexibility and slightly increases development time. Our architecture can be greatly enhanced if it is implemented in a distributed computing model such as that specified by CORBA [18] or DCOM [19] as this will allow much more flexibility in the kind of entities that could attach to an existing system and would enhance entity location independence and transparency.

Secondly, the update and query operations that the views and collectors can perform are limited to insertion, sorting, filtering and counting. More complex operations must be performed using a direct interface to the physical database system used in the DataStore. This is not a limitation of the architecture but rather a shortcut we have taken to reduce the development time of this first prototype. What is required in order to perform more complex queries is the implementation of an additional interface function to allow sending arbitrarily complex queries in the form of some DML, e.g. SQL, to the DataStore.

4. An example system: Visualising the results of web-queries

In order to demonstrate our architecture and test our prototype we have developed a system for visualising the results of web-queries. The objective of the system is to facilitate information discovery in three ways: Firstly, by storing the hits in a database and allowing the user to interactively explore them by performing operations such as sorting, searching and filtering. Secondly by having multiple collectors (two, in this simple demonstration system) that query index servers simultaneously. Thirdly by clustering hits by host and assigning a "hit-count" attribute to each host i.e. the number of hits that reside on it 1.

The system comprises the following entities:

  1. Three Views: The HostsView displays host information, the HitsView displays hit information and the User Interface View serves as a container for the other two views and provides a unified interface to a user.
  2. Two Collectors: These collect web-space by querying Internet index servers. They submit a query to their assigned index server, receive the HTML document returned, parse it, extract the hit URLs and descriptions from it and send this information to the DataStore for persistent storage. These two particular collectors query the Infoseek and Excite index servers.
  3. Two Linked Lists as the DataStore: For this demonstration system we used two linked lists as our DataStore. One list stores the hosts and the other the hits.
Each of these entities is implemented as a different class as described in Section 3. The system operates as follows:

The user enters keywords describing the information sought, using the interface provided by the User Interface (UI) view (Figure 3). This view creates the two collectors and attaches them to the core using the AttachEntity function provided by the core and passes the query term to them. The collectors submit the query to the two Internet index servers simultaneously and wait for the servers to begin relaying the results. Upon receipt of the HTML documents, the collectors begin a cycle of parse, extract hit, send to DataStore until either the index servers report that there are no more results, or the required number of hits has been reached. They use the core function InsertNode to send each hit to the DataStore. Whenever the core receives an InsertNode request it forwards it to the DataStore and upon successful insertion, sends a NodeInserted message to all the views. The UI view does not act upon this message but the other two do. The HostsView displays any new hosts using the GetNode function and updates the count of hits per publisher using the GetCount function. The HitsView displays the new hit again using the GetNode function.


Figure 3: The query entry dialogue box

Figure 3: The query entry dialogue box

When the required number of hits has been reached or the index servers report no more hits, the collectors terminate and send a Collection Termination message. This is received by all other entities which act as follows:

When the core receives the SortData request it sends it to the DataStore. Upon successful completion it sends a DataChange message (along with the information that the change is one of sort order) to the views which act as follows:

Now that the simple interactivity features of the views have been enabled, the user can interact with them. For example clicking on a host causes the hosts view to send a SetFilter request to the DataStore via the core. This request also states what the filter is.

Upon successful completion the core sends a DataChange message to the views along with details of what the change was. The views act as follows:

Figure 4 shows a screen shot of the three views. The enclosing frame with the menu bar and tool bar is the UI view. The left-hand side pane is the HostsView and the right hand side is the HitsView. The query term entered was "Windows".


Figure 4: Our demonstration system - A Search Client

Figure 4: Our demonstration  system - A Search Client

5. Conclusions and Further Work

We have described an architecture for developing web visualisation systems. Our architecture is novel in that it facilitates:

  1. Re-usability of existing DBMS, visualisation and web-space collection systems by providing a simple mechanism for interfacing to them, even if no source code is available and even if they are running on different platforms.
  2. Comparison of visualisation techniques as applied to the web by facilitating the application of different visualisations to the same web-space.
  3. Experimentation of storing web information using different data models.

We demonstrated the feasibility of this architecture by implementing a simplified version for Windows 95 and NT and proceeded to show how visualisation systems could be built within this by utilising it to build a simple search client application.

We are currently investigating the effectiveness of the Hypernode Model [17] for storing web-spaces. This model is based on nested, directed graphs termed hypernodes. We believe that the model's inherent support for nesting of graphs will provide a powerful tool for structuring and visualising large web-spaces. We are also investigating the use of Hyperlog [16] for browsing, querying and updating such web-spaces. Hyperlog is a graph-based database language that visualises schema information, data, and query output as sets of hypernodes. This information is stored, browsed and queried in a uniform way. In the context of visualising and manipulating web-spaces we wish to explore using Hyperlog for structuring and restructuring web-spaces and for authoring of customised documents and links.

References

[1] Network Wizards.Internet Domain Survey, January 1998.http://www.nw.com/zone/WWW/report.html
[2] Matthew Gray of MIT. Internet Statistics: Growth and Usage of the Web and the Internet, June 20, 1996.http://www.mit.edu/people/mkgray/net/
[3] Digital Equipment Corporation, AltaVista Internet Index Server. http://www.altavista.digital.com/
[4] Infoseek Corporation, Infoseek Internet Index Server.http://www.infoseek.com/
[5] Excite Incorporated, Excite Internet Index Server. http://www.excite.com/
[6] Alberto O. Mendelzon, George A. Mihaila, and Tova Milo. Querying the World Wide Web, Journal of Digital Libraries 1(1), pp. 68-88, 1997.
[7] Tim Berners-Lee. Universal Resource Locators. Internet working draft, 1 Jan 1994 (now expired).http://info.cern.ch/hypertext/WWW/Addressing/URL/Overview.html
[8] Martin Charles Golumbic. Algorithmic Graph Theory and Perfect Graphs, Academic Press, Inc., 1980.
[9] David Eichmann. The RBSE Spider -Balancing Effective Search Against Web Load. In proceedings of the First International World Wide Web Conference, CERN Geneva, May 1994.http://www.cern.ch/PapersWWW94/spider.ps
[10] Roy T. Fielding. Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web. In proceedings of the First International World Wide Web Conference, CERN Geneva, May 1994.http://www.cern.ch/PapersWWW94/fielding.ps
[11] Charles Petzold. Programming Windows 95. Microsoft Press 1996.
[12] Microsoft Corporation. Programmers Guide to Microsoft Windows 95. Microsoft Press 1995.
[13] Robert Lafore. C++ Interactive Course. Waite Group Press 1996.
[14] Microsoft Corporation. Microsoft Foundation Class Library.http://premium.eu.microsoft.com/msdn/library/devprods/vc++/vcmfc/D2/vcmfchm.htm
[15] Microsoft Corporation. MFC Document/View Architecture Topics.http://premium.eu.microsoft.com/msdn/library/devprods/VC++/Progsguide/F1F/D24/S51EE.HTM
[16] Stefan Hild and Alexandra Poulovassilis. Implementing Hyperlog, a Graph-Based Database Language. In Journal of Visual Languages and Computing (1996) 7, pp. 267-289.
[17] Alexandra Poulovassilis and Mark Levene. A Nested-Graph Model for the Representation and Manipulation of Complex Objects. In ACM Transactions on Information Systems, Vol. 12, No. 1, January 1994, pp. 35-68.
[18] Object Management Group. CORBA for beginners.http://www.omg.org/news/begin.htm
[19] Microsoft Corporation. COM Technologies (COM,DCOM, COM+, MTS, and ActiveX).http://www.eu.microsoft.com/cominfo/. February 1998.
[20] Jeromy Carriere and Rick Kazman. WebQuery: Searching and Visualizing the Web through Connectivity. In Hyperproceedings of the Sixth International World Wide Web Conference, Santa Clara, California USA, April 7-11, 1997. http://proceedings.www6conf.org/HyperNews/get/PAPER96.html
[21] Gustavo O. Arocena, Alberto O. Mendelzon, George A. Mihaila. Applications of a Web Query Language. In Hyperproceedings of the Sixth International World Wide Web Conference, Santa Clara, California USA, April 7-11, 1997. http://proceedings.www6conf.org/HyperNews/get/PAPER267.html

1The rationale behind the hit-count attribute stems from the fact that current index servers are keyword-based and therefore not very accurate in selecting documents that are relevant to a web-query expression. We envisaged that if a host contains a large number of documents that the index servers "think" are relevant to a query (i.e. has a high hit-count) , then it is more likely that they are all truly relevant to the query than the single document on some other host with a hit-count of one. The hit-count attribute can thus be considered a measure of the probability that documents residing on a host are relevant to a web-query. Figure 4 shows the results for the query "Windows" sorted by decreasing order of hit-count. As can be seen the first host listed (with by far the highest hit-count) is that of the web site of Microsoft Corporation. The use of "connectivity" and "structure" in such a manner is not new and has been proposed as a solution to this problem of imprecision of keyword-based searching in several different forms [ 2021 ]. Our experience with our prototype system supports this view, as a definite improvement over the standard method of searching the web (i.e. using an index server directly), was perceived. We however postpone a final judgement to a future publication, pending a thorough, larger-scale investigation.
Figure 5: Storage and visualisation of an arbitrary subset of the web

StorageVisualisation

1. Relational DBMS:  One relation for the web-pages (nodes) and one for the links (edges)

(Note that the host part "http://www.dcs.kcl.ac.uk/" of the URL
for records A to E is not shown here to save space.)
N O D E S
Page IDPage URL
Aindex.html
Bteaching/yellow-book/index.html
Cteaching/msc/index.html
Dresearch/phd-CS.html
Eteaching/index.html
E D G E S
Page IDLinks To
AB
AC
AD
AE
BA

2. DBMS that supports set attributes:  One relation for both nodes and edges

Page IDPage URLLinks To
Aindex.html{B,C,D,E}
Bteaching/yellow-book/index.html{A}
Cteaching/msc/index.htmlø
Dresearch/phd-CS.htmlø
Eteaching/index.htmlø

1. Directed Graph:

Example Visualisation of an Arbitrary Web-Subset as a Directed Graph


Figure 6: Storage of the results of web-queries

1. Relational DBMS:  One relation for the hosts and one for the hits

H O S T S
Host IDHost Part of URLHit-Count
1http://www.dcs.kcl.ac.uk2
2http://www.w3c.org1
H I T S
Hit Part of URLHost ID
/petros/WWW7Paper.html1
/alex/WebPaper.html1
/Visual.html2

2. DBMS that supports set attributes:  One relation for both hosts and hits

HostHit-CountHits On Host
http://www.dcs.kcl.ac.uk2{"/petros/WWW7Paper.html",
"/alex/WebPaper.html"}
http://www.w3c.org1{"/Visual.html"}


Figure 7: Storage and visualisation of browsing history

StorageVisualisation

1. Relational DBMS:

(Note that the host part "http://www.dcs.kcl.ac.uk/" of the URL
for records A to E is not shown here to save space.)
IDPage URL
Aindex.html
Bteaching/yellow-book/index.html
Cteaching/msc/index.html
Dresearch/phd-CS.html
Eteaching/index.html
Fhttp://www.w3c.org/
OrderDateTimeID
120/03/199814:00:40A
220/03/199814:02:00C
320/03/199814:03:00A
420/03/199814:03:02E
520/03/199814:05:12A
620/03/199814:05:15D
720/03/199814:07:30A
820/03/199814:07:35B
920/03/199814:09:40F

1. Labelled Directed Graph:

Example Visualisation of Browsing History as a Labelled, Directed Graph