Anders Ardö,
Allan Arvidson, Sebastian Hammer,
Kenneth Holmlund och Sigfrid Lundberg
The Nordic Web Index (NWI) project is a collaborative effort across the Nordic countries, aiming at providing a free Worldwide Web search service to the general public in the countries involved. NWI has been fruitful for several reasons: First and foremost we are today providing access to databases covering the WWW in four of the Nordic countries and, as of September 1998, five service points in six languages Denmark [1], Finland [2], Sweden [3], Norway [4] and Iceland [5]. New service points and new databases are swemeta, danmeta, other Nordic.
NWI technology has been used to create several databases beside the main NWI ones metioned above. Examples include the following:
We have been able to contribute this techology to other initiatives because the NWI architecture builds on open standards and the fact that the software is freely distributable [11]. Dissemination of the experiences gained from the NWI has been made through several different channels; i.e., the WWW7 conference [12], the 1997 NordUnet conference [13]. Also, in connection with that conference, the National Library of Iceland hosted a workshop on indexing collaboration arranged by the NWI [14].
A module for extracting metadata information from the harvested pages has been implemented. With this module we built two new databases, swemeta and danmeta, covering Sweden and Denmark, respectively. During the extraction process metadata tags were converted to Dublin Core equivalents or discarded if no meaningful conversion can be done. It turns out that approximately 10% of all pages contain meaningful metadata. Detailed information on the conversion and other statistics around metadata usage can be found at [15].
The extracted records are indexed into a separate database where fields correspond to Dublin Core fields. In the Nordic Countries we have experimented with this for a year and a half, and, from early this year, made more serious implementations of these search systems based on Z39.50 and Dublin Core. For the time being, we are using two kinds of systems in parallel:
We do mappings between proprietary metadata systems and DC. For instance, AUTHORS as entered by many HTML-authoring tools are mapped to DC.Creator. So are old Dublin Core metadata still containing the element DC.Author. Having scanned the Nordic Web for metadata we still have dozens of such mappings.
We also do "Dublin Core-ish" metadata down-grading, to some extent, the way it supposed to operate when applying the Canberra Qualifier Philosophy. Thus, when encountering qualified metadata, it be merged into a higher level of granularity of description.
Cooperative harvesting has been implemented and utilized for a variety of purposes. Here we give a couple of examples.
The role of Umeå University in this project has been to utilize collaborative harvesting and other collaborative methods in order to optimize administration and performance. Umeå University is involved in several initiatives involving indexing and www-based information systems, such as the following:
All these projects involve different people and organizations for maintenance, development and editorial work as well as pressure on scientists and lecturers to produce suitable material of high quality and structure. Efficient coordination is necessary to reduce costs and improve performance. The various projects utilize different sets of metadata standards (mainly based on Dublin Core), and in many cases we foresee that information must be tagged several times with nearly identical information.
One of our main objectives is therefore to design a cooperative, flexible, and cost effective indexing system for Umeå University that allows for coordination of these various projects and also fits into a national and international perspective.
The universities of Umeå and Lund are working on projects related to harvesting of metadata and localized WWW indexing:
We have chosen to focus on a solution that coordinates Safari and the local indexes of the two universities. Thus, research information for Safari is available in the main index for each university and the same system is also used to feed the national Safari system. A straightforward extension would be to use this system as a distributed collaborative harvester for NWI, but that also requires further analysis of performance and cost effectiveness.
Collaborative harvesting (ie where two or more sites contribute to one database) have been implemented using the following method. For simplicity we here describe an existing application, where DTV builds a local database over all their pages (using NWI-technology) while at the same time contributing to the main Danish database. Two machines, Hera and Venus, each have the NWI software installed. Hera is the Danish service point and Venus is the WWW-server for DTV. Venus is configured to harvest only pages in the *.dtv.dk domain (i.e. all pages at DTV). When a page is harvested it is normally sent to a database location where it is parsed and the resulting record is inserted into the database. On Venus, a harvested page is sent to two database locations, one is the local the other is the Danish service point (Hera). Technically this can be done by specifying several database locations in the Job Control Format (JCF). The only special configuration on Hera is that the domain *.dtv.dk is excluded from harvesting (since Venus is doing this). This is not strictly necessary but we avoid double harvesting this way. The result is that Venus can have its own scheduling policy and database while still keeping the main Danish database on Hera up to date with regards to the *.dtv.dk domain.
Most robots on the Internet are constructed to be used as part of an indexing service, which is mostly concerned with textual information. When the aim is to archive entire documents, all inline images, sounds, etc. are needed as well. The aim here is to reconstruct not only the words, but also the "look and feel" of a document. This places somewhat different demands on the gathering software. In order to tailor the software for archiving purposes [18], several modifications have been made.
The aim is to archive everything, therefore the robot has been changed so it acquires all objects, regardless of MIME type. This means that also images, sounds, etc. are gathered. Of course pictures, sound and similar documents are not parsed for HTML-links.
In order to ensure the longtime integrity of items in the archive, it was decided to store items as multi-part-MIME objects, where the each individual object is stored together with miscellaneous data available, including the HTTP response header. For this purpose the robot was extended with module with a module saving items in that format. Other software can access these objects when archived items are requested from the archive.
Frequently, documents of interest resides on servers not registered under the ".se"-domain. Therefore a feature has been added which allows the definition of extra domains, which are not .se, to be added to the robots search space. In this way it is possible to also acquire documents under e.g. the ".com" domain which are regarded to be of interest when preserving the Swedish Web.
One new planned capability of the robot is the implementation of collections in the harvesting database. By collections we refer to a substructure that could be implemented in the parser, such that an external API could return the name of the collection to which a given record should belong.
The decision of membership in a collection could be based on any criteria that are meaningful for a given service and could include subject classification derived from metadata or through automated classification software, server domain, etc. One obvious basis for membership in could be presence of metadata specific to given communities or projects.
We foresee that by introducing the collection concept we will also facilitate collaborative harvesting, since each collection should be possible to put into separate information retrieval systems.
Currently the harvesting robot only supports a very limited range of Internet Media Types ( text/html and text/plain ). However, we have recently implemented a parser for a common graphics format, namely GIF. The GIF specification permits embedding of a limited amount of metadata, such as a description of an image. The amount of such descriptions on the Internet is depressingly low, and the time is not yet ripe for launching services built on such metadata.
NetLab and DTV will eventually merge the current metadata services into the regular services. In doing that we will ensure that all metadata will be inter-operable with GILS.
Within the scope of the NWI II project, Index Data's [20] effort has been primarily focused on maintaining and extending the Zebra information retrieval package which is (currently) employed by each of the national NWI databases. Much of this work in turn has been focused on addressing the special requirements of the NWI project, posed both by the size of the databases and the heterogeneous nature of the information content.
As the size of the larger national databases increased, it became apparent that the built-in relevance ranking algorithms did not provide sufficient recall when posed with the extremely heterogeneous contents of the free-text data harvested from the web. An additional challenge has been the differences in ranking results from the different national databases, caused by the different contents (ie. the same document would not necessarily be given the same absolute rank for the same query by the different databases). This is a natural property of statistical "natural-language" search engines, but nevertheless it complicated the cross-national search functionality of the NWI access points. Experiments with database-independent ranking-schemes, and eventually a conference of information retrieval experts from each of the participating, academic institutions led to the development by ID of a modular framework for the relevance calculator component of Zebra. This allows individual partners to experiment with the development of different relevance ranking metrics,and - eventually, it enables the project to draw on the considerable experience in natural-language searching which is available to the project - particularly at DTV and Netlab, in addition to ID itself.
The project has seen a steady increase in the quality and recall of the Zebra search engine, although it is clear that there is still some work to do before the NWI service can match the sophisticated, proprietary approaches employed by dedicated, commercial services such as Alta Vista.
There is a page at the Danish service point where users can add a new or changed WWW-page to NWI. If the page is in a NWI domain the URL is automatically sent by mail to the appropriate service point. At the Danish service point such a mail is automatically processed further and the URL is scheduled for harvesting. This normally results in that the added WWW-page is harvested and inserted into the database within a day. Normally it becomes visible to the public within a week (i.e. the next time the database is indexed).
If the site has an address in a domain outside the Nordic domains (.dk, .fi, .fo, .gl, .is, .no, .se) the user is asked to send a mail explaining why the site should be included in a Nordic database. These mails are processed manually and such sites are inserted into the "Other Nordic" database.
In addition to the add URL service there is a direct connection between the Nordic Metadata tool and the Combine robot running at NetLab. A URL included as a Dublin Core identifier will be queued up for indexing the same day. Records with metadata are entered into Swemeta on a daily basis.
The user interface software package is much simplified. The configuration parameters are now table driven and the feel and look of the user interface is specified with screen templates.
Context sensitive online help have been included in the Danish service point [1]. The help include general help text, hints and examples. Help is available from all pages.
Another form of online help is an active page which investigates (in terms of expected number of hits), in real time, several different searches using various combinations of your search terms, database fields and boolean operators. The user interface allows entering a number of search terms and selecting database. A matrix with an estimate of what a number of search combinations will give in terms of hits. All the user have to do is to select a query with acceptable result set and click on it to get it executed. This help service is available from the Danish NWI home page and from the advanced search page.
Part of document | ||||||
Searching for | authors | titles | descriptions | headings | anchor texts | free text |
"nordisk web" | 0 | 1 | 0 | 3 | 140 | 165 |
nordisk and web | 0 | 1 | 5 | 54 | 399 | 1616 |
nordisk | 2 | 1109 | 107 | 652 | 2803 | 13282 |
web | 509 | 7103 | 4334 | 9224 | 26895 | 108011 |
nordisk or web | 511 | 8211 | 4436 | 9822 | 29299 | 119677 |
Clicking on the number 54 will execute a search for 'nordisk and web' in the headings field which will give 54 hits.
Using this new user interface software package ZGATE, a number of advanced search user interfaces was easily developed, see the advanced search page at the Danish NWI node [1]. Examples include:
In starting using the Desire project's Combine Harvesting Robot, we have significantly improved the user interface of the robot administration. It is still, however, based on a large number of scripts and the everyday monitoring of a Combine installation could be simplified. The current robot is configured by editing three configuration files:
Administrative/managing scripts have to be developed to suit local policies and needs. These may include removal of log-files, feeding newly found URLs into the harvester, checking and updating database records, reindexing database, making backups, etc. Examples of such scripts can be provided on request. A collection of such tools are planned to become available under the Combine home page [19]. Other tools have been developed for handling of metadata based projects, where the harvester is checking the existence of specific metadata fields.
A number of WWW-based system administration tools have been specified but are not yet implemented [22]. These include
The Combine harvesting robot [19] is now distributed as free software [11]. There is also extensive documentation available, covering most details of the software architecture and use. In particular, there are documents covering the installation along with cookbook examples [23] and a more detailed User's Guide [24]. There is additional information as well, although, some of which is still in preparation.
The NWI II has proved to be a successful Nordic cooperation as both a development project and as a service it is still unique inasmuch as it provides tools and methodologies involving the following:
The major drawback of the NWI initiative has been the lack of a solid organization for the service itself. The original partners in the project have not had resources to upgrade their software to the current level and some sites suffer from acute shortage of hardware.
1 | NWI Danish service point <http://nwi.dtv.dk/> |
2 | NWI Finland <http://nwi.funet.fi/> |
3 | NWI Sweden <http://nwi.lub.lu.se/> |
4 | NWI Norway <http://nwi.bibsys.no/> |
5 | NWI Iceland <http://nwi.bok.hi.is/> |
6 | Øresundsuniversitetet <http://www.uni.oresund.org/studie/owiindex.htm> |
7 | EUN Multimedia Schoolnet <http://www.eun.org> |
8 | DTU demonstration page for evaluation <http://nwi.dtv.dk/dtudtv.html> |
9 | SAFARI, the Swedish nation research information
system <http://safari.hsv.se/> |
10 | SveSÖK <http://www.svesok.kb.se/> |
11 | General Public License <http://www.gnu.org/copyleft/gpl.html> |
12 | A. Ardö and S. Lundberg, 1998. A regional
distributed WWW search and indexing service -- the DEISRE way.
Computer Networks and ISDN systems 30:173-183 <http://nwi.dtv.dk/www7/> |
13 | NWI: Presentation at NORDUnet '97 Conference in
Reykjavik, Iceland June 29th - July 1st 1997 <http://nwi.dtv.dk/nordunet/> |
14 | S. Lundberg, 1997. The cooperative harvesting
and indexing workshop in Reykjavik 29 June 1997 <http://www.ub.lu.se/ind_work_shop.html> |
15 | Metadata mappings and statistics <http://nwi.dtv.dk/MD/> |
16 | The local course directory <http://info.adm.umu.se/utbkat/> |
17 | The ASKen project <http://asken.hsv.se/> |
18 | Kulturarw3 <http://kulturarw3.kb.se/> |
19 | The Combine Harvesting Robot <http://www.lub.lu.se/combine/> |
20 | Index Data Aps <http://www.indexdata.dk/> |
21 | Cooperative Hierarchical Indexing
Coordination <http://www.terena.nl/task-forces/tf-chic/> |
22 | Unesco Regional Web Index Config Tool <http://nwi.dtv.dk/UNESCO/Configtool.html> |
23 | The Combine README <http://www.lub.lu.se/combine/dist/README-v1.1.txt> |
24 | Combine User's Guide <http://www.lub.lu.se/combine/docs/uguide.html> |