xslt_indexer -- a full text indexing tool extracting text and XML fragments using XSLT

Sigfrid Lundberg (slu@kb.dk)
The Royal Library
Copenhagen

Introduction

The xslt_indexer is a JAVA application and an API developed as a part of the migration initiative which aimed at migrate a number of old database applications to a common XML based platform.

This project, also known as SJAUIGA, developed a search and xslt_indexer was developed as a general method of loading complex XML objects, such as METS and TEI documents into a fielded search-able index.

The machinery consists of an implementation of the Lucene JAVA search API. On top of that we have built an application xsl_index, that can index arbitrary XML documents.

The application

xsl_index is using the xalan java xslt processor's capability to use xpath functions extensions written in java. The xsl_index is a command line tool. Its xsl_index --help reports the following:

Usage:
-h or --help:                       Print this message
-x or --xslt     <xsl transform>:   A XSLT script performing the indexing
-c or --create   <index directory>: The directory where Lucene should
                                    create a new index
-u or --update   <index directory>: The directory where Lucene should
                                    update an existing index
-d or --datafile <file name>:       The name of an XML file to be indexed.
                                    If a - is given as argument, the
                                    program assumes that a list of file
                                    names can be read from STDIN.
-s or --source <file name>:         Similar to -d but instead of containing a
                                    single XML object file is supposed to
                                    contain one single XML document per line
                                    as a single string. If the argument is -,
                                    these XML objects are read from STDIN.
-v or --verbose <verbosity>:        verbosity (debug level) 1, 2 or 3
                                    The log is written to STDERR. You will
                                    redirect to file yourself

All files and directory should be given with complete path
	

There is no need to program to index a new kind of XML, however one has to be able to write fairly advanced XSLT scripts. We will give the one, mets2lucene.xsl we wrote to index our METS files as an example below.

A few worked examples

Assuming we have a couple of hundred mets file in a file system, all of which are called 'metsfile.xml', you may index them all with the following command

find /var/www/html/metsnavigator/manus/ -name metsfile.xml -print | \
    ./xsl_index --xslt mets-data/scripts/mets2lucene.xsl \
    --create ./index  --datafile - --verbose 3 > index.log 2>&1
	

We use find to search throught our file system. xsl_index will read its style sheet in ../xslt/mets2lucene.xsl, it will create a new Lucene index in directory ./lucene. The xsl_index will read file names to index from STDIN, hence we use the option --datafile - and it will write its very verbose (--verbose 3) log to the file index.log

Using brief options.

./xsl_index -x ./xslt_indexer/xslt/tei2lucene.xsl -u poma_index/ \
            -d ./poma/tei/Poma-parsed.xml -v 0
	

which implies that we read an indexing style sheet from xslt_indexer/xslt/tei2lucene.xsl, updates an index in poma_index/ and creates a search index for a single XML file /poma/tei/Poma-parsed.xml. The indexer will hardly log anything.

The -s (or --source) option is useful for instance for the case that multiple documents are stored in a single file. Assuming that we have a file consisting of a lot of xhtml which is structured such that fragments of XML can be retrieved from it just by "grepping" for a string (in the case below 'literary fragments', we can index these XML doclets using the command:

About the software

xsl_index consists of two JAVA classes,

dk.kb.dup.xsltIndexer.Driver
dk.kb.dup.xsltIndexer.IndexLoader
	

Driver is the class containing the main method, and it is that class which "transforms" the documents. The main class communicates with the rest of the system by sending parameters to xalan, and the style sheet must pass some of them the to IndexLoader. The list of parameters used are

Parameters that are given to the Driver on command line or created at need. They are passed on to the stylesheet via xalan. The style sheet must pass them on to the IndexLoader.
parametermeaning
mode create or update. Tells Lucene whether it should create a new index or update an existing one. From command line
index_directory where the index should be built from command line
readerAn object (a Lucene indexReader) passed from the Driver. The style sheet must pass this on to the IndexLoader.
writerAn object (a Lucene indexWriter) passed from the Driver. The style sheet must pass this on to the IndexLoader.
analyzerAn object (a Lucene SimpleAnalyzer) passed from the Driver. The style sheet must pass this on to the IndexLoader.
debug_level The verbosity parameter from command line
datasource Used for incremental update of indexes. Since one XML file can lead to hundreds of records, we embed its name in each record it will lead to. We can then identify (by a lucene search) and delete the lucene documents as needed if the XML file has been updated.
sourcefield The field in the lucene documents in which we store the datasource

We give an example style sheet, designed for indexing METS files. The example may not be up to date, but shows the principles.

Please note the extension name spaces, and in particular xmlns:IndexLoader="xalan://dk.kb.dup.xsltIndexer.IndexLoader, which is where we declare our IndexLoader class. The methods in that class which we use are

Java methods used from inside XSLT for indexing text in the XML document
open_indexOpens the index for reading and writing. May be deprecated in future releases.
open_documentCalled at the beginning of a lucene document.
setAnalyzerMethod for passing the analyzer object from the Driver to the IndexLoader
setIndexReaderMethod for passing the index reader object from the Driver to the IndexLoader
setIndexWriterMethod for passing the index writer object from the Driver to the IndexLoader
add_fieldadds a field and its content to the index
add_xml_fieldadds a XSLT node set as a XML document as a field in the index
close_documentwe close a lucene document
close_indexCalled after completing the indexing after a XML document. May be deprecated in future releases.
delete_documentsdelete lucene documents connected to the current data source
encode_uria utility function. It turned out that calling java.net.URI.Encode was better than using the URI encoder available inside Xalan.

An example stylesheet

The functions are declared in <xalan:component> ... <xalan:component> tag at the very beginning of the style sheet.

In the example below, all calls to the IndexLoader are in bold face, and link to the appropriate part of the javadoc.

<xsl:transform version="1.0"
	       xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	       xmlns:m="http://www.loc.gov/METS/"
	       xmlns:xlink="http://www.w3.org/1999/xlink" 
	       xmlns:md="http://www.loc.gov/mods/v3"
	       xmlns:xalan="http://xml.apache.org/xslt"
	       xmlns:java="http://xml.apache.org/xslt/java"
	       xmlns:str="http://exslt.org/strings"
	       xmlns:IndexLoader="xalan://dk.kb.dup.xsltIndexer.IndexLoader">

  <xsl:output encoding="UTF-8"
	      method="text"/>

  <xsl:param name="mode"            select="'create'"   />
  <xsl:param name="index_directory" select="'/dev/null'"/>
  <xsl:param name="debug_level"     select="'0'"        />
  <xsl:param name="datasource"      select="''"         />
  <xsl:param name="sourcefield"     select="''"         />
  <xsl:param name="reader"          select="''"         />
  <xsl:param name="writer"          select="''"         />
  <xsl:param name="analyzer"        select="''"         />

 <xalan:component
      prefix="IndexLoader"
      functions="open_index open_document setAnalyzer setIndexReader setIndexWriter add_field add_xml_field close_document close_index delete_documents encode_uri">
    <xalan:script lang="javaclass" src="xalan://dk.kb.dup.xsltIndexer.IndexLoader"/>
  </xalan:component>

  <xsl:template match="/">

    <!-- here we pass objects created in the Driver to the index loader.
    These function calls should be somewhere where they are executed
            once per indexed XML file -->

    <xsl:value-of 
	select="IndexLoader:setAnalyzer($analyzer)"/>
    <xsl:value-of 
	select="IndexLoader:setIndexReader($reader)"/>
    <xsl:value-of 
	select="IndexLoader:setIndexWriter($writer)"/>

    <xsl:value-of 
	select="IndexLoader:open_index($mode,$index_directory,$debug_level)"/>

    <!-- Both the datasource and the corresponding source field come from
	 the indexing driver.
         This function call ensures that the lucene documents from a given
	 XML document are deleted prior to indexing. This is our way to
	 implement incremental update -->

    <xsl:if test="$mode='update' and $datasource and $sourcefield">
       <xsl:value-of 
	   select="IndexLoader:delete_documents($sourcefield,$datasource)"/>
    </xsl:if>
    <xsl:apply-templates select="m:mets"/>
    <xsl:value-of select="IndexLoader:close_index()"/>
  </xsl:template>

  <xsl:template match="m:mets">
    <xsl:apply-templates select="m:structMap[@type='physical']"/>
  </xsl:template>

  <xsl:template match="m:structMap">
    <xsl:for-each
	select="/m:mets/
		 m:dmdSec[@ID='md-root']/
		 m:mdWrap/
		 m:xmlData/
		 md:mods/
		 md:titleInfo">
      <xsl:variable name="lang">
	<xsl:value-of select="@xml:lang"/>
      </xsl:variable>
      <xsl:apply-templates select="/m:mets/m:structMap[@type='physical']/m:div">
	<xsl:with-param name="lang">
	  <xsl:value-of select="$lang"/>
	</xsl:with-param>
      </xsl:apply-templates>
    </xsl:for-each>
  </xsl:template>

  <xsl:template match="m:div">
    <xsl:param name="lang" select="'dan'"/>

    <xsl:variable name="language">
      <xsl:choose>
	<xsl:when test="@xml:lang">
	  <xsl:value-of select="@xml:lang"/>
	</xsl:when>
	<xsl:otherwise>
	  <xsl:value-of select="$lang"/>
	</xsl:otherwise>
      </xsl:choose>
    </xsl:variable>

    <xsl:value-of select="IndexLoader:open_document()"/>
    <!-- Both the field and the data comes from the indexing driver.
         The function call must be present in order to permit incremental
         update -->
    <xsl:if test="$datasource and $sourcefield">
      <xsl:value-of 
	  select="IndexLoader:add_field($sourcefield,$datasource,'store.yes','un_tokenized')"/>
    </xsl:if>

    <xsl:if test="@ID">
      <xsl:value-of select="IndexLoader:add_field('divid',@ID,'store.yes','un_tokenized')"/>
    </xsl:if>
    <xsl:if test="@ORDERLABEL">
      <xsl:value-of select="IndexLoader:add_field('orderlabel',@ORDERLABEL,'store.yes','un_tokenized')"/>
    </xsl:if>

    <xsl:value-of select="IndexLoader:add_field('record_lang',$language,'store.yes','tokenized')"/>
    <xsl:call-template name="generateURI">
      <xsl:with-param name="lang">
	<xsl:value-of select="$language"/>
      </xsl:with-param>
    </xsl:call-template>

    <xsl:if test="@DMDID">
      <xsl:variable name="goto_id" select="@DMDID"/>
      <xsl:apply-templates
	  select="//m:dmdSec[@ID=$goto_id]/m:mdWrap/m:xmlData/md:mods">
	<xsl:with-param name="lang">
	  <xsl:value-of select="$language"/>
	</xsl:with-param>
      </xsl:apply-templates>
    </xsl:if>
    <xsl:value-of select="IndexLoader:close_document()"/>
    <xsl:apply-templates select="m:div[@xml:lang=$language]"/>
  </xsl:template>

  <xsl:template match="md:mods">
    <xsl:param name="lang" select="'dan'"/>
    <xsl:apply-templates select="md:titleInfo[@xml:lang=$lang]"/>
    <xsl:apply-templates select="md:note[@xml:lang=$lang]"/>
    <xsl:apply-templates select="md:name[@xml:lang=$lang]"/>
    <xsl:apply-templates select="md:identifier"/>
  </xsl:template>

  <xsl:template match="md:titleInfo">
    <xsl:apply-templates select="md:title"/>
  </xsl:template>

  <xsl:template match="md:title">
    <xsl:value-of select="IndexLoader:add_field('title',.,'store.yes','tokenized')"/>
  </xsl:template>

  <xsl:template match="md:note">
    <xsl:variable name="description">
      <xsl:apply-templates/>
    </xsl:variable>
    <xsl:value-of select="IndexLoader:add_field('description',$description,'store.yes','tokenized')"/>
  </xsl:template>

  <xsl:template match="md:name">
    <xsl:variable name="description">
      <xsl:apply-templates/>
    </xsl:variable>
    <xsl:value-of select="IndexLoader:add_field('creator',$description,'store.yes','tokenized')"/>
  </xsl:template>


  <xsl:template match="md:identifier[@type='signature']">
    <xsl:value-of select="IndexLoader:add_field('signature',.,'store.yes','tokenized')"/>
  </xsl:template>

  <xsl:template match="md:relatedItem"/>

  <xsl:template name="generateURI">
    <xsl:param name="lang" select="'dan'"/>
    <xsl:variable name="application">
      <xsl:value-of select="substring-before(/m:mets/@OBJID,':')"/>
    </xsl:variable>
    <xsl:variable name="document">
      <xsl:value-of select="substring-after(/m:mets/@OBJID,':')"/>
    </xsl:variable>
    <xsl:variable name="orderlabel">
      <xsl:if test="@ORDERLABEL">
	<xsl:value-of select="concat(IndexLoader:encode_uri(@ORDERLABEL),'/')"/>
      </xsl:if>
    </xsl:variable>
    <xsl:value-of select="IndexLoader:add_field('identifier',concat('http://www.kb.dk/permalink/2006/',$application,'/',$document,'/',$lang,'/',$orderlabel),'store.yes','un_tokenized')"/>
  </xsl:template>

</xsl:transform>
    

$Revision$
Last modified $Date$
by $Author$