NWI II, An Enhanced Nordic Web index: Final report

Anders Ardö, Allan Arvidson, Sebastian Hammer,
Kenneth Holmlund och Sigfrid Lundberg

1 Table of Contents

1 Table of Contents

2 Introduction

3 Metadata, Harvesting, Cooperative indexing and Archiving

3.1 Metadata mappings and statistics

3.2 Cooperative Harvesting

3.2.1 Umeå university

3.2.2 Distributed WWW and metadata indexes in Umeå and Lund

3.2.3 DTV - Øresundsuniversitet subdatabas

3.3 WWW archiving

3.4 Ongoing and planned improvements of the harvesting system

3.4.1 Collections and collaborative harvesting

3.4.2 Harvesting of different media types

3.5 Merging of metadata services with the Web Indexes

3.6 WWW archiving and the Combine

4 Searching

4.1 Record Representation

4.2 Performance Tuning

4.3 Relevance Ranking

4.4 Miscellaneous

4.5 Collaboration with other initiatives on searching

5 User interfaces

5.1 Add URL service

5.1.1 ZGATE

5.2 Online help

5.2.1 Metadata search user interfaces

5.3 Robot-master user interface

6 Distribution and Documentation

7 Conclusions

8 References

2 Introduction

The Nordic Web Index (NWI) project is a collaborative effort across the Nordic countries, aiming at providing a free Worldwide Web search service to the general public in the countries involved. NWI has been fruitful for several reasons: First and foremost we are today providing access to databases covering the WWW in four of the Nordic countries and, as of September 1998, five service points in six languages Denmark [1], Finland [2], Sweden [3], Norway [4] and Iceland [5]. New service points and new databases are swemeta, danmeta, other Nordic.

NWI technology has been used to create several databases beside the main NWI ones metioned above. Examples include the following:

ØWI - Search-engine for Øresund University Web Index. Indexes all pages from over 120 servers in the 11 Universities forming the Øresund University [6].
EUN (European School net) full text database covering school and school net WWW-servers in Europe [7].
Regional search engine for all servers at DTU (Technical University of Denmark). More than 200 servers with over 100.000 pages [8].
A national research information system in Sweden[9].
Local search engines for all DTV servers and Lund university.
SveSÖK [10]

We have been able to contribute this techology to other initiatives because the NWI architecture builds on open standards and the fact that the software is freely distributable [11]. Dissemination of the experiences gained from the NWI has been made through several different channels; i.e., the WWW7 conference [12], the 1997 NordUnet conference [13]. Also, in connection with that conference, the National Library of Iceland hosted a workshop on indexing collaboration arranged by the NWI [14].

3 Metadata, Harvesting, Cooperative indexing and Archiving

3.1 Metadata mappings and statistics

A module for extracting metadata information from the harvested pages has been implemented. With this module we built two new databases, swemeta and danmeta, covering Sweden and Denmark, respectively. During the extraction process metadata tags were converted to Dublin Core equivalents or discarded if no meaningful conversion can be done. It turns out that approximately 10% of all pages contain meaningful metadata. Detailed information on the conversion and other statistics around metadata usage can be found at [15].

The extracted records are indexed into a separate database where fields correspond to Dublin Core fields. In the Nordic Countries we have experimented with this for a year and a half, and, from early this year, made more serious implementations of these search systems based on Z39.50 and Dublin Core. For the time being, we are using two kinds of systems in parallel:

``Pure'' DC systems building upon GRS-1 and the extended tagset-G and BIB-1.
Mapping between GILS and DC

Since we have had to be somewhat pragmatic we even have some hybrids between the two.

We do mappings between proprietary metadata systems and DC. For instance, AUTHORS as entered by many HTML-authoring tools are mapped to DC.Creator. So are old Dublin Core metadata still containing the element DC.Author. Having scanned the Nordic Web for metadata we still have dozens of such mappings.

We also do "Dublin Core-ish" metadata down-grading, to some extent, the way it supposed to operate when applying the Canberra Qualifier Philosophy. Thus, when encountering qualified metadata, it be merged into a higher level of granularity of description.

3.2 Cooperative Harvesting

Cooperative harvesting has been implemented and utilized for a variety of purposes. Here we give a couple of examples.

3.2.1 Umeå university

The role of Umeå University in this project has been to utilize collaborative harvesting and other collaborative methods in order to optimize administration and performance. Umeå University is involved in several initiatives involving indexing and www-based information systems, such as the following:

The Europhysnet project for research information and scientific communication in physics. All physics related institutions in Sweden are, in principle, involved. Thus, a solution where the national NWI node feeds the continental Europhysnet system is preferred. The project is awaiting EC funding to then expand on the European scale.
The national EnviroNet project for scientific information in environmental science.
The local course directory [16], an application now being used by several Swedish universities) and the related ASKen project [17] (http://asken.hsv.se/) of the National Agency for Higher Education.
Also other sources, e.g. a database for information on diploma projects and the local library catalogue hold information that could be of interest to unite in a single web based interface. o *.umu.se holds more than 100,000 web pages and we find it important that this information is available in the popular search engines on the Internet. We have still little or no control over which documents should be exported and indexed or how they should be classified/rated.

All these projects involve different people and organizations for maintenance, development and editorial work as well as pressure on scientists and lecturers to produce suitable material of high quality and structure. Efficient coordination is necessary to reduce costs and improve performance. The various projects utilize different sets of metadata standards (mainly based on Dublin Core), and in many cases we foresee that information must be tagged several times with nearly identical information.

One of our main objectives is therefore to design a cooperative, flexible, and cost effective indexing system for Umeå University that allows for coordination of these various projects and also fits into a national and international perspective.

3.2.2 Distributed WWW and metadata indexes in Umeå and Lund

The universities of Umeå and Lund are working on projects related to harvesting of metadata and localized WWW indexing:

Local metadata search services in Umeå and Lund, which are inter-operable with the national Safari project for research information.
Local indexing of *.umu.se and *lu.se and *.lth.se and related domains in the respective areas.

We have chosen to focus on a solution that coordinates Safari and the local indexes of the two universities. Thus, research information for Safari is available in the main index for each university and the same system is also used to feed the national Safari system. A straightforward extension would be to use this system as a distributed collaborative harvester for NWI, but that also requires further analysis of performance and cost effectiveness.

3.2.3 DTV - Øresundsuniversitet subdatabas

Collaborative harvesting (ie where two or more sites contribute to one database) have been implemented using the following method. For simplicity we here describe an existing application, where DTV builds a local database over all their pages (using NWI-technology) while at the same time contributing to the main Danish database. Two machines, Hera and Venus, each have the NWI software installed. Hera is the Danish service point and Venus is the WWW-server for DTV. Venus is configured to harvest only pages in the *.dtv.dk domain (i.e. all pages at DTV). When a page is harvested it is normally sent to a database location where it is parsed and the resulting record is inserted into the database. On Venus, a harvested page is sent to two database locations, one is the local the other is the Danish service point (Hera). Technically this can be done by specifying several database locations in the Job Control Format (JCF). The only special configuration on Hera is that the domain *.dtv.dk is excluded from harvesting (since Venus is doing this). This is not strictly necessary but we avoid double harvesting this way. The result is that Venus can have its own scheduling policy and database while still keeping the main Danish database on Hera up to date with regards to the *.dtv.dk domain.

3.3 WWW archiving

Most robots on the Internet are constructed to be used as part of an indexing service, which is mostly concerned with textual information. When the aim is to archive entire documents, all inline images, sounds, etc. are needed as well. The aim here is to reconstruct not only the words, but also the "look and feel" of a document. This places somewhat different demands on the gathering software. In order to tailor the software for archiving purposes [18], several modifications have been made.

The aim is to archive everything, therefore the robot has been changed so it acquires all objects, regardless of MIME type. This means that also images, sounds, etc. are gathered. Of course pictures, sound and similar documents are not parsed for HTML-links.

In order to ensure the longtime integrity of items in the archive, it was decided to store items as multi-part-MIME objects, where the each individual object is stored together with miscellaneous data available, including the HTTP response header. For this purpose the robot was extended with module with a module saving items in that format. Other software can access these objects when archived items are requested from the archive.

Frequently, documents of interest resides on servers not registered under the ".se"-domain. Therefore a feature has been added which allows the definition of extra domains, which are not .se, to be added to the robots search space. In this way it is possible to also acquire documents under e.g. the ".com" domain which are regarded to be of interest when preserving the Swedish Web.

3.4 Ongoing and planned improvements of the harvesting system

3.4.1 Collections and collaborative harvesting

One new planned capability of the robot is the implementation of collections in the harvesting database. By collections we refer to a substructure that could be implemented in the parser, such that an external API could return the name of the collection to which a given record should belong.

The decision of membership in a collection could be based on any criteria that are meaningful for a given service and could include subject classification derived from metadata or through automated classification software, server domain, etc. One obvious basis for membership in could be presence of metadata specific to given communities or projects.

We foresee that by introducing the collection concept we will also facilitate collaborative harvesting, since each collection should be possible to put into separate information retrieval systems.

3.4.2 Harvesting of different media types

Currently the harvesting robot only supports a very limited range of Internet Media Types ( text/html and text/plain ). However, we have recently implemented a parser for a common graphics format, namely GIF. The GIF specification permits embedding of a limited amount of metadata, such as a description of an image. The amount of such descriptions on the Internet is depressingly low, and the time is not yet ripe for launching services built on such metadata.

3.5 Merging of metadata services with the Web Indexes

NetLab and DTV will eventually merge the current metadata services into the regular services. In doing that we will ensure that all metadata will be inter-operable with GILS.

3.6 WWW archiving and the Combine

These features were originally implemented in the earlier NWI-robot. Work is currently underway for moving archive software to the Combine [19] robot.

4 Searching

Within the scope of the NWI II project, Index Data's [20] effort has been primarily focused on maintaining and extending the Zebra information retrieval package which is (currently) employed by each of the national NWI databases. Much of this work in turn has been focused on addressing the special requirements of the NWI project, posed both by the size of the databases and the heterogeneous nature of the information content.

4.1 Record Representation

Early in the project, it was decided to migrate the distributed system from the first version of the GILS (Government Information Locator Service) profile specification to the current, second, version of the profile. Index Data (ID) supplied the appropriate database schema specifications, as well as the requisite input mapping tables. Furtherimore, ID provided assistance during the transitory period while the participating access points and databases migrated to the new database profile.

4.2 Performance Tuning

The increasing sizes of the national databases (more than two million records for the larger collections) have posed new, logistical challenges to the project. Multi-gigabyte databases require complex routines for maintenance, database updatei, and security. Search engine operations become increasingly critical as index update processes consume hours and even days. The Zebra search engine already provides support for secure updates to existing database without requiring a system shutdown. However, secondary storage requirements quickly become critical, and ID has produced a compression subsystem for the database index which drastically reduces the disk space requirements of the running system, as well as increasing overall performance. However, the space requirements and performance issues related to the continuous updates of the system remain an important issue which is subject of research at ID as well as the larger database operators in the project. The continuous, large-scale influx of record insertions, deletions, and updates which are inherent in a WWW directory pose a unique challenge to the database maintenance software.

4.3 Relevance Ranking

As the size of the larger national databases increased, it became apparent that the built-in relevance ranking algorithms did not provide sufficient recall when posed with the extremely heterogeneous contents of the free-text data harvested from the web. An additional challenge has been the differences in ranking results from the different national databases, caused by the different contents (ie. the same document would not necessarily be given the same absolute rank for the same query by the different databases). This is a natural property of statistical "natural-language" search engines, but nevertheless it complicated the cross-national search functionality of the NWI access points. Experiments with database-independent ranking-schemes, and eventually a conference of information retrieval experts from each of the participating, academic institutions led to the development by ID of a modular framework for the relevance calculator component of Zebra. This allows individual partners to experiment with the development of different relevance ranking metrics,and - eventually, it enables the project to draw on the considerable experience in natural-language searching which is available to the project - particularly at DTV and Netlab, in addition to ID itself.

The project has seen a steady increase in the quality and recall of the Zebra search engine, although it is clear that there is still some work to do before the NWI service can match the sophisticated, proprietary approaches employed by dedicated, commercial services such as Alta Vista.

4.4 Miscellaneous

A steady stream of minor extensions to the functionality of the search engine has been supplied based on the requirements posed by the NWI system. These include alternate sorting schemes, functions for URL searches, new term relationship and masking functionality, and so forth. For ID, the project has further provided an important platform for verifying the system's performance on large and heterogeneous data sets.

4.5 Collaboration with other initiatives on searching

Regarding searching, the NWI initiative participates in the European initiative for hierarchical indexing collaboration initiated by Terena [21]. The goal is a cross-searching gateway which should be able to search a number of European Academic search engines through a single protocol (Whois++)

5 User interfaces

5.1 Add URL service

There is a page at the Danish service point where users can add a new or changed WWW-page to NWI. If the page is in a NWI domain the URL is automatically sent by mail to the appropriate service point. At the Danish service point such a mail is automatically processed further and the URL is scheduled for harvesting. This normally results in that the added WWW-page is harvested and inserted into the database within a day. Normally it becomes visible to the public within a week (i.e. the next time the database is indexed).

If the site has an address in a domain outside the Nordic domains (.dk, .fi, .fo, .gl, .is, .no, .se) the user is asked to send a mail explaining why the site should be included in a Nordic database. These mails are processed manually and such sites are inserted into the "Other Nordic" database.

In addition to the add URL service there is a direct connection between the Nordic Metadata tool and the Combine robot running at NetLab. A URL included as a Dublin Core identifier will be queued up for indexing the same day. Records with metadata are entered into Swemeta on a daily basis.

5.1.1 ZGATE

The user interface software package is much simplified. The configuration parameters are now table driven and the feel and look of the user interface is specified with screen templates.

5.2 Online help

Context sensitive online help have been included in the Danish service point [1]. The help include general help text, hints and examples. Help is available from all pages.

Another form of online help is an active page which investigates (in terms of expected number of hits), in real time, several different searches using various combinations of your search terms, database fields and boolean operators. The user interface allows entering a number of search terms and selecting database. A matrix with an estimate of what a number of search combinations will give in terms of hits. All the user have to do is to select a query with acceptable result set and click on it to get it executed. This help service is available from the Danish NWI home page and from the advanced search page.

Example searching for ``nordisk web''. Entries in the matrix indicate expected number of hits
Searching for	authors	titles	descriptions	headings	anchor texts	free text
	Part of document
"nordisk web"	0	1	0	3	140	165
nordisk and web	0	1	5	54	399	1616
nordisk	2	1109	107	652	2803	13282
web	509	7103	4334	9224	26895	108011
nordisk or web	511	8211	4436	9822	29299	119677

Clicking on the number 54 will execute a search for 'nordisk and web' in the headings field which will give 54 hits.

5.2.1 Metadata search user interfaces

Using this new user interface software package ZGATE, a number of advanced search user interfaces was easily developed, see the advanced search page at the Danish NWI node [1]. Examples include:

Command line boolean: search allows the user write an arbitrarily complex boolean query directly in the input box.
Forms based boolean: search aids the user in the construction of a boolean query by proving a form with menu's and input boxes that together defines a boolean query.
Metadata searches: allows the user to make queries against metadata present in Danish and Swedish HTML-pages (danmeta and swemeta, respectively). There are about 150,000 records in each country, most of which can be searched for creator (author), title, subject and description. More advanced interfaces are available in the Safari project [9]

In addition to these there are some experimental interfaces.

5.3 Robot-master user interface

In starting using the Desire project's Combine Harvesting Robot, we have significantly improved the user interface of the robot administration. It is still, however, based on a large number of scripts and the everyday monitoring of a Combine installation could be simplified. The current robot is configured by editing three configuration files:

combine.config contains the host-name(s) of server(s) involved in the harvesting process and port numbers that harvester components use for internal communication.
config_allow contains perl regular expressions that will have to match WWW-servers or URLs that are to be part of the region this installation covers. (The first step, positive selection)
config_exclude contains perl regular expressions that matches WWW-servers or URLs that not are to be part of the region this installation covers. (The second step, refining by exclusion)

Depending on the complexity of the region to covered the policy-filter configuration could be almost trivial (5-10 lines) or very complex (several 1000 lines with complex regular expressions).

Administrative/managing scripts have to be developed to suit local policies and needs. These may include removal of log-files, feeding newly found URLs into the harvester, checking and updating database records, reindexing database, making backups, etc. Examples of such scripts can be provided on request. A collection of such tools are planned to become available under the Combine home page [19]. Other tools have been developed for handling of metadata based projects, where the harvester is checking the existence of specific metadata fields.

A number of WWW-based system administration tools have been specified but are not yet implemented [22]. These include

General configuration: A tool to generate and maintain configuration files.
Database configuration: A WWW interface to generate a zebra.cfg.
Harvesting configuration: This tool will generate/edit the files called etc/config_allow and etc/config_exclude
Administration: A WWW based tool to administer day to day tasks of running the combine system

6 Distribution and Documentation

The Combine harvesting robot [19] is now distributed as free software [11]. There is also extensive documentation available, covering most details of the software architecture and use. In particular, there are documents covering the installation along with cookbook examples [23] and a more detailed User's Guide [24]. There is additional information as well, although, some of which is still in preparation.

7 Conclusions

The NWI II has proved to be a successful Nordic cooperation as both a development project and as a service it is still unique inasmuch as it provides tools and methodologies involving the following:

a standardized search protocol which permits distributed searching
cross searching with other databases including library OPACS.
tools and methodologies for integrated WWW archiving and indexing

The major drawback of the NWI initiative has been the lack of a solid organization for the service itself. The original partners in the project have not had resources to upgrade their software to the current level and some sites suffer from acute shortage of hardware.

8 References

1	NWI Danish service point <http://nwi.dtv.dk/>
2	NWI Finland <http://nwi.funet.fi/>
3	NWI Sweden <http://nwi.lub.lu.se/>
4	NWI Norway <http://nwi.bibsys.no/>
5	NWI Iceland <http://nwi.bok.hi.is/>
6	Øresundsuniversitetet <http://www.uni.oresund.org/studie/owiindex.htm>
7	EUN Multimedia Schoolnet <http://www.eun.org>
8	DTU demonstration page for evaluation <http://nwi.dtv.dk/dtudtv.html>
9	SAFARI, the Swedish nation research information system <http://safari.hsv.se/>
10	SveSÖK <http://www.svesok.kb.se/>
11	General Public License <http://www.gnu.org/copyleft/gpl.html>
12	A. Ardö and S. Lundberg, 1998. A regional distributed WWW search and indexing service -- the DEISRE way. Computer Networks and ISDN systems 30:173-183 <http://nwi.dtv.dk/www7/>
13	NWI: Presentation at NORDUnet '97 Conference in Reykjavik, Iceland June 29th - July 1st 1997 <http://nwi.dtv.dk/nordunet/>
14	S. Lundberg, 1997. The cooperative harvesting and indexing workshop in Reykjavik 29 June 1997 <http://www.ub.lu.se/ind_work_shop.html>
15	Metadata mappings and statistics <http://nwi.dtv.dk/MD/>
16	The local course directory <http://info.adm.umu.se/utbkat/>
17	The ASKen project <http://asken.hsv.se/>
18	Kulturarw3 <http://kulturarw3.kb.se/>
19	The Combine Harvesting Robot <http://www.lub.lu.se/combine/>
20	Index Data Aps <http://www.indexdata.dk/>
21	Cooperative Hierarchical Indexing Coordination <http://www.terena.nl/task-forces/tf-chic/>
22	Unesco Regional Web Index Config Tool <http://nwi.dtv.dk/UNESCO/Configtool.html>
23	The Combine README <http://www.lub.lu.se/combine/dist/README-v1.1.txt>
24	Combine User's Guide <http://www.lub.lu.se/combine/docs/uguide.html>

1	Table of Contents
2	Introduction
3	Metadata, Harvesting, Cooperative indexing and Archiving
3.1	Metadata mappings and statistics
3.2	Cooperative Harvesting
3.2.1	Umeå university
3.2.2	Distributed WWW and metadata indexes in Umeå and Lund
3.2.3	DTV - Øresundsuniversitet subdatabas
3.3	WWW archiving
3.4	Ongoing and planned improvements of the harvesting system
3.4.1	Collections and collaborative harvesting
3.4.2	Harvesting of different media types
3.5	Merging of metadata services with the Web Indexes
3.6	WWW archiving and the Combine
4	Searching
4.1	Record Representation
4.2	Performance Tuning
4.3	Relevance Ranking
4.4	Miscellaneous
4.5	Collaboration with other initiatives on searching
5	User interfaces
5.1	Add URL service
5.1.1	ZGATE
5.2	Online help
5.2.1	Metadata search user interfaces
5.3	Robot-master user interface
6	Distribution and Documentation
7	Conclusions
8	References