|
Panoptic provides the ability to search over collections of documents.
The collection is defined by an administrator and then updated by the
Panoptic search engine. The update process normally involves gathering data,
converting some files to text and then indexing the data to allow the query
processor to answer queries.
|
 |
| |
|
The diagram above shows the main components of the system, while the different phases of a collection update are described in more detail below.
|
| |
|
| |
|
The first step in updating a collection is to gather data.
The default way to do this is to use a web crawler (spider) to crawl
a set of web servers and download a snapshot of the documents
to be indexed.
The task of the crawler is to discover and fetch as many web pages
as possible within the scope of the search service.
Your Panoptic administrator defines the scope by means of a
seed list of URLs and a list of internet domains to be included.
For example, the initial seed list may be as simple as
http://www.panopticsearch.com/index.html and the list of domains
may be as simple as panopticsearch.com/.
In this example, the spider would fetch the seed page and extract
the URLs of all the pages to which it links. Each of these URLs
is checked to see whether they have been processed already and whether
it lies outside the panopticsearch.com/ domain. If not,
they are added to the list to be processed. The content of the original
page is then saved for indexing.
The process of taking URLs off the list, fetching them, extracting
links and saving them for later indexing continues until the list is
empty. You can see that if a web page cannot be reached by following
links from the seed page, it will not be discovered.
Your Panoptic administrator can specify that certain websites or
directories should be excluded from spidering and can also control
what types of files are spidered.
|
|
It is also possible to gather data by copying files from a network
mounted filesystem (e.g. copy files from an NT fileserver).
|
| |
|
| |
|
If the gathering phase brings back files which are not HTML
or plain text files (for example, Word, RTF, PDF and PostScript),
it is necessary to extract the text from them before indexing.
Panoptic provides a framework which scans all the gathered files,
determines the type of the file and calls the appropriate
third-party filter.
|
| |
|
| |
|
The Panoptic indexer first builds a vocabulary of every distinct
word which occurs in all the pages brought back by the spider.
See the simple search page for
a definition of what constitutes a word.
|
|
For each vocabulary entry, the indexer builds a list (kept in very
compact format) of every occurrence of that word in every page.
|
| |
|
| |
|
When you type a query such as medium napoletana pizza,
relevance scores for all the documents are set to zero. Then each
query term is looked up in the vocabulary list and the associated list
of occurrences is accessed. For every document which contains one
or more occurrences, a word-present flag is set and the relevance
score is incremented by an amount which depends upon:
|
|
How many times this word occurred
|
|
How many other documents it occurred in
|
|
The length of this document
|
|
|
The particular relevance formula used is a slightly modified form
of the Okapi BM25 function developed by Steve Robertson and Steve
Walker of City University, London.
|
|
After all the query terms have been processed, the documents are
sorted, first on the basis of how many term-present flags were set
and then on the basis of relevance score.
|
|