Digital Representation and the Text Model

Sigfrid Lundberg's Stuff 2010-04-11

Bookmark and Share

Dino Buzzetti, 2002. Digital Representation and the Text Model.New Literary History, 33(1), p. 61-88.

I found this article by searching for Ordered Hierarchy of Content Objects (OHCO) in Google Scholar.

The study and use of the theory and methodologies of text is either very practical, i.e., deployed by people attempting to use it for producing digital editions or text collections. It can also be extremely theoretical, i.e., it is used by people proving that it is impossible to represent text using such crude technologies such strongly embedded markup languages, such as Text Encoding Initiative (TEI) XML.

It should not suprise you that the people discussing OHCO belong to the latter category. It is those who perceive problems that need to name them. And regardless of how many elements and attributes are added to the TEI, it is limited by the fact that it is XML.

Bumblebees and the theory of text

I am writing code for my living. I spend most of my time on practical problems. I'm about as interested in the intricacies of the theory of text as I am in flight mechanics theories proving that bumblebees are unable to fly.

However, if you want a really good discussion about text, you have to turn to these contributions. They are written by people that are just as smart as the students of flight mechanics. Recently they managed to prove that bumblebees can fly which required more sofisticated mathematics and data. That is, it required new knowledge.

Buzzetti's contribution is a classic in the area. It was published 2002, but that version is a translation. The original was published in Italian 1999. That means that it was written somewhere in the interval 1997-1999. That is, he might have started his research before XML even was a recommendation. Most of the tools we use today where not even on the drawing board. When he offers criticism against markup languages, he talks exclusively about SGML, the precursor of XML -- the much extended subset of SGML.

The Map isn't the Landscape

Anything which somehow depicts something else can be said to be a model, like a globe is a model of the earth. Buzetti focus two aspects of any model of a text.

  1. the information represented
  2. the represention of the information

The first aspect is called the content of the text, whereas the second is referred to as an expression of it. For example, assume that you want to preserve a MS Word document. One way would be to print it to a file and transform that to a (i) high resolution bitmap as a (say) tif or jpeg2000. You could also export it as (ii) ASCII text. The former would be an expression of the text and the latter would be its content.

The conventional wisdom says that we have two aspects of a text, its form and its content. Buzzetti doesn't mention this dichotomy at all. Rather, he says that an expression as well as the content may have a form and a substance. In his parlance, an edition is the set of the various contents and expressions available that can be linked to a work. The edition may then represented by interpretations.

Embedded vs. external encoding

Buzzetti is very much against the use of SGML. I will not go into his discussion of the lack of datamodels connected to the the markup, since the explosive development of XML technologies make that part of his criticism obsolete. The Document Object Model API supported by all major programming languages is just one of many answers to that critique.

Buzzetti makes a distinction between strong and weak embedded markup. Languages that embed marks both at the beginning and at the end of character sequences inside a text are strongly embedded. Those that mark onsets are weakly embedded. Buzzetti is against strong embedding, but claims that text encoding ideally should be done without any embedded markup. By using that you blur the distincion between the form of substance.

The ideal form of text encoding is to store the offsets, string lengths and corresponding semantics externally. Doing encoding this way you may support as many overlapping hierarchies of content objects your heart may desire. No hierarchy need to be given more importance all of them are are equal.

Buzzetti is right, but...

TEI doesn't do things this way. It isn't because Buzzetti isn't right. I'm sure he is. Those who did text encoding thought that SGML was good enough. Now we think that XML is OK and practical to use.

The designers of TEI choose to give a priority to the logical structure of the texts we are encoding. That is, since we are allowed to have one single hierarchy, we use it for encoding the content. I.e., the chapters, sections, paragraphs and phrases. We don't use it for pages, lines and characters which belong to the expression. We regard the page and line breaks as points in the character stream and add empty tags for them. These empty tags are called milestones. If we accept some computational difficulties we can encode a lot using this machinery, including insertions and deletions of text across (for example) paragraphs.

But, yes, our encoded text is an expression of the content, and we have limited the range possible interpretations of the text.

This entry is part of my series Readings on digital objects


Subscribe to Stuff from Sigfrid LundbergSubscribe to my stuff

stuff by category || year


My name is Sigfrid Lundberg. The stuff I publish here may, or may not, be of interest for anyone else.

On this site there is material on photography, music, literature and other stuff I enjoy in life. However, most of it is related to my profession as an Internet programmer and software developer within the area of digital libraries. I have been that at the Royal Danish Library, Copenhagen (Denmark) and, before that, Lund university library (Sweden).

The content here does not reflect the views of my employers. They are now all past employers, since I retired 1 May 2023.

Creative Commons License
This entry (Digital Representation and the Text Model) within Sigfrid Lundberg's Stuff, by Sigfrid Lundberg is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.