The development of domain specific search engines as an aid to Internet portals

1995-present

Project Team

Dr Keith Mannock; (past members to be listed)

Background

General purpose search engines exhibit a number of problems, 1) they have a reasonably useful coverage but aren't specialised, which leads to 2) it is difficult to get high precision with them, and 3) query formulations are not clear to the naïve user. Now for a user of a portal these type of features need to be addressed; specifically the precision of the information that is being presented. In our work a portal is taken as being:

Most current portals offer a "my" option, personalised news coverage, etc. in an attempt to make the portal a sticky site. Portals typically command the highest pricing for banner ad placement. There are a couple of problems with creating and maintaining a portal, 1) building portals is a labour intensive process, and 2) they require a significant ongoing effort.

Main Objectives

In this research we are building a prototype system which has the following objectives:

To this end we have developed an architecture which draws upon techniques from the following domains, Information Retrieval, Database Management, Machine Learning and Distributed Systems. The system is based upon a search engine architecture which has

Specific Features

We highlight three main areas where the novel features of our architecture can be highlighted:

  1. Reinforcement learning
    used for exploration in a directed fashion
    used in the Spider and Indexer modules
  2. Text classification
    used to form a browseable topic hierarchy
    used in the Indexer module
  3. Information extraction
    used to find specific search features
    used in the Indexer module

Experimental Study

DoSE has been used on a number of real world portals to determine the functionality and efficiency of the architecture, specifically it has been tested on:

The resulting study found that DoSE was twice as efficient as topic-focussed spider and three times more efficient than breadth-first search. DoSE extracts ten fields from spidered documents with 80% accuracy (including extraction of multimedia content.). That DoSE places the URI into a fifty-leaf hierarchy with 75% accuracy; which compares favourably with human levels of agreement.


Last updated: Sunday, November 17, 2002