Saturday, May 31, 2008

New TransCoder release

This is something I've been meaning to wrap up and write up for a while now: thanks to the Duke Integrating Digital Papyrology grant from the Andrew W. Mellon Foundation, I've been able to make a bunch of updates to the Transcoder, a piece of software I originally wrote for the EpiDoc project. Transcoder is a Java program that handles switching the encodings of Greek text, for example from Beta Code to Unicode (or back again). It's used in initiatives like Perseus and Demos. I've been modifying it to work with Duke Databank of Documentary Papyri XML files (which are TEI based). Besides a variety of bug fixes, there is now also included in Transcoder a fully-functional SAX ContentHandler that allows the processing of XML files containing Greek text to be transcoded.

There are a lot of complex edge cases in this sort of work. For example, Beta Code (or at least the DDbDP's Beta) doesn't distinguish between medial (σ) and final (ς) sigmas. That's an easy conversion in the abstract (just look for 's' at the end of a word, and it's final), but when your text is embedded in XML, and there may be an expansion (<expan>) tag in the middle of a word, for example, it becomes a lot harder. You can't just convert the contents of a particular element--you have to be able to look ahead. The problem with SAX, of course, is that it's stream-based, so no lookahead is possible unless you do some buffering. In the end what I did was buffer SAX events when an element (say a paragraph) marked as Greek begins, and keep track of all the text therein. That let me do the lookahead I needed to do, since I have a buffer containing the whole textual content of the <p> tag. When the end of the element comes, I then flush the buffer, and all the queued-up SAX events fire, with the transcoded text in them.

That's a lot of work for one letter, but I'm happy to say that it functions well now, and is being used to process the whole DDbDP. Another edge case that I chose not to solve in the Transcoder program is the problem of apparatus and their contents in TEI. An <app> element can contain a <lem> (lemma) and one or more <rdg> (readings). The problem with it is that the lemma and readings are conceptually parallel in the text. For example:

The quick brown <lem>fox</lem> jumped over the lazy dog.

The TEI would be:

The quick brown <app><lem>fox</lem><rdg>cat</rdg></app> jumped over the lazy dog

So "cat" follows immediately after "fox" in the text stream, but both words occupy the same space as far as the markup is concerned. In other words, I couldn't rely only on my fancy new lookahead scheme, because it broke down in edge cases like this. The solution I went with is dumb, but effective: format the apparatus so that there is a newline after the lemma (and the reading, if there are multiple readings). That way my code will still be able to figure out what's going on. The whitespace so introduced really needs to be flagged as significant, so that it doesn't get clobbered by other XML processes though. That has already happened to us once. It caused a bug for me too, because I wasn't buffering ignorable whitespace.

All that trouble over one little letter. Lunate sigmas would have made life so much easier...