XML on the web, Client Side XSLT and Google

Sigfrid Lundberg's Stuff 2009-07-11

Bookmark and Share

The market has forced the major browser manufacturers to converge on standards. Microsoft cannot afford to produce a browser that won't work on google apps, and the Linux based netbooks and netboxes make a difference. If software is a service, what is the client? Obviously the web client, the browser. But why are the search engines lagging behind? Browsers are capable of AJAX and advanced XML processing, but the search engines are still basically just removing tags and presenting raw text extracts.

I have quite a few documents on this site that are written in raw XML, such as the Digital Humanities Infrastructures position paper from last year. Since it is about digital humanities and text encoding, I wrote it in TEI XML. The paper is presented using client side XSLT. Just view source on that note, and you'll see my markup. The document is a blatant show-off; it is written in nerdcore vanity. At the top you'll see

	<?xml version="1.0" encoding="UTF-8" ?>
	<?xml-stylesheet href="render.xsl" type="text/xsl" ?>

The xml-stylesheet processing instruction is read by your browsser, which then retrieves the stylesheet render.xsl. Without any further ado it is then transformed into html, rendered using the CSS indicated by my XSLT script for you to read. You can even inspect the resulting html using firebug, if you've got a modern installation.

What does Google do when indexing XML?

There is almost certainly a range of XML formats that is actually interpreted by Google. Which formats, well we don't know but in practice can hardly expect anything more exotic than common syndication formats (such as RSS and ATOM ), microformats and RDFa. I have noticed that Google has been doing OAI harvesting for many years, most likely for the benefit of Scholar.

One could envisage a number of viable indexing strategies for a general search engine for a document like my Digital Infrastructure Nerdcore Vanity paper. One would be to execute a global regex search and replace "<[^>]+>" with "" (which basically means: remove anything between angle brackets, or remove all tags).

This would yield a "detagged incipit", which in this case is "Digital Humanities Infrastructures 2008-06-12 Digital Humanities Infrastructures Sigfrid Lundberg"

Another strategy would be to use the style sheet provided by the nerd publishing this paper. That would yield the "transformed incipit", which for this document is (using the HTML body only) "Digital Humanities Infrastructures Sigfrid Lundberg". In using the transform, Google would then be able to use its usual HTML indexing techniques. For instance it would understand the title, and be able parse all hypertext links.

I made the search digital humanities infrastructure +lundberg and the results looks like "Digital Humanities Infrastructures 2008-06-12 Digital Humanities ... - Jun 5 ... Digital Humanities Infrastructures Sigfrid Lundberg Digital .... Also McCarty's humanities computing research infrastructure (or rather ..."

The machine is actually using a detagged incipit as the title, which is what I think it uses for PDF. This is very far from the ideas in Gleaning Resource Descriptions from Dialects of Languages (GRDDL) where providers of documents actually provide XSLT for the benefit of robots that could use that to extract descriptions in RDF.

blog comments powered by Disqus


Subscribe to Stuff from Sigfrid LundbergSubscribe to my stuff
Subscribe to Stuff from Sigfrid LundbergSubscribe to discussion feed

stuff by category || year


My name is Sigfrid Lundberg. The stuff I publish here may, or may not, be of interest for anyone else.

On this site there is material on photography, music, literature and other stuff I enjoy in life. However, most of it is related to my profession as an Internet programmer and software developer within the area of digital libraries at the Royal Library, Copenhagen (Denmark) and, before that, Lund university (Sweden).

The content here does not reflect the views of my past or present employers

Creative Commons License
This entry (XML on the web, Client Side XSLT and Google) within Sigfrid Lundberg's Stuff, by Sigfrid Lundberg is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.