How Panoptic Works
Click to go backClick to go home

Panoptic provides the ability to search over collections of documents. The collection is defined by an administrator and then updated by the Panoptic search engine. The update process normally involves gathering data, converting some files to text and then indexing the data to allow the query processor to answer queries.
Panoptic Architecture
 
The diagram above shows the main components of the system, while the different phases of a collection update are described in more detail below.
 
Gathering Data
 
The first step in updating a collection is to gather data. The default way to do this is to use a web crawler (spider) to crawl a set of web servers and download a snapshot of the documents to be indexed. The task of the crawler is to discover and fetch as many web pages as possible within the scope of the search service. Your Panoptic administrator defines the scope by means of a seed list of URLs and a list of internet domains to be included. For example, the initial seed list may be as simple as http://www.panopticsearch.com/index.html and the list of domains may be as simple as panopticsearch.com/.

In this example, the spider would fetch the seed page and extract the URLs of all the pages to which it links. Each of these URLs is checked to see whether they have been processed already and whether it lies outside the panopticsearch.com/ domain. If not, they are added to the list to be processed. The content of the original page is then saved for indexing.

The process of taking URLs off the list, fetching them, extracting links and saving them for later indexing continues until the list is empty. You can see that if a web page cannot be reached by following links from the seed page, it will not be discovered. Your Panoptic administrator can specify that certain websites or directories should be excluded from spidering and can also control what types of files are spidered.

It is also possible to gather data by copying files from a network mounted filesystem (e.g. copy files from an NT fileserver).
 
Text Extraction (Format Conversion)
 
If the gathering phase brings back files which are not HTML or plain text files (for example, Word, RTF, PDF and PostScript), it is necessary to extract the text from them before indexing. Panoptic provides a framework which scans all the gathered files, determines the type of the file and calls the appropriate third-party filter.
 
Indexing
 
The Panoptic indexer first builds a vocabulary of every distinct word which occurs in all the pages brought back by the spider. See the simple search page for a definition of what constitutes a word.
For each vocabulary entry, the indexer builds a list (kept in very compact format) of every occurrence of that word in every page.
 
Query Processing
 
When you type a query such as medium napoletana pizza, relevance scores for all the documents are set to zero. Then each query term is looked up in the vocabulary list and the associated list of occurrences is accessed. For every document which contains one or more occurrences, a word-present flag is set and the relevance score is incremented by an amount which depends upon:
(dot)
How many times this word occurred
(dot)
How many other documents it occurred in
(dot)
The length of this document
The particular relevance formula used is a slightly modified form of the Okapi BM25 function developed by Steve Robertson and Steve Walker of City University, London.
After all the query terms have been processed, the documents are sorted, first on the basis of how many term-present flags were set and then on the basis of relevance score.


Panoptic Search Engine

© Copyright CSIRO Australia, 1997-2004.