Monday, November 14, 2011

TEI in other formats; part the second: Theory

In my first post on this subject, I poked a bit at how one might represent TEI in HTML without discarding the text model from the TEI document. Now I want to talk a bit more about that model, and the theory behind it. I may at the end say irritable things about Theory as well, if you can stand it until the end. I humbly beg the reader's pardon.

Let's look at the same document I talked about last time: http://papyri.info/ddbdp/p.ryl;2;74/source (see also http://papyri.info/ddbdp/p.ryl;2;74/). We can visualize the document structure using Graphviz and a spot of XSLT (click for high-res):

tree structure of a TEI document

It's a fairly flat tree. As an XML document, it has to be a tree, of course, and TEI leverages this built-in "tree-ness" to express concepts like "this text is part of a paragraph" (i.e. it has a tei:p element as its ancestor). In line 1, for example, we find

<supplied reason="lost">Μάρ</supplied>κος
meaning the first three letters of the name Markos have been lost due to damage suffered by the papyrus the text was written on, and the editor of the text has supplied them. The fact that the letters "Μάρ" are contained by the supplied element, or, more properly that the text node containing those letters is a child of the supplied element, means that those letters have been supplied. In other words, the parent-child relationship is given additional semantics by TEI. We already have some problems here: the child of supplied is itself part of a word, "Markos", and that word is broken up by the supplied element. Only the fact that no white space intervenes between the end of the supplied element and the following text lets us know that this is a word. It's even worse if you look at the tree version, which is, incidentally, how the document will be interpreted by a computer after it has been parsed:


There's no obvious connection here between the first and second halves of the name. And in fact, if we hadn't taken steps to prevent it, any program the processed the document might reformat it so that "Mar" and "kos" were no longer connected. We could solve this problem by adding more elements. As the joke goes, "XML is like violence. If it isn't working, you're not using it enough." We could explicitly mark all the words, using a "w" element, thus: 
<w><supplied reason="lost">Μάρ</supplied>κος</w>
or

which would solve any potential problems with words getting split up, because we could always fix the problem—we would know what all the words are. We could even attach useful metadata, like the lemma (the dictionary headword) of the word in question. We don't do this for a couple of reasons. First, because we don't need to. We can work around the splitting-up of words by markup. Second, because it complicates the document and makes it harder for human editors to deal with, and third, because it introduces new chances for overlap. Overlap is the Enemy, as far as XML is concerned. The more containers you have, the greater the chances one container will need to start outside another, but finish inside (or vice versa). Consider that there's no reason at all a region of supplied text shouldn't start in the middle of one word and end in the middle of another. Look at lines 5-6 for example:
                           ... οὐ̣[χ ἱκανὸν εἶ-]
[ναι εἰ]ς
A supplied section begins in the middle of the third word from the end of line five, and continues for the rest of the line. The last word is itself broken and continues on the following line, the beginning of which is also supplied, that section ending in the middle of the second word on line six. This is a mess that would only be compounded if we wanted to mark off words.

This may all seem like a numbing level of detail, but it is on these details that theories of text are tested. The text model here cares about editorial observations on and interventions in the text, and those are what it attempts to capture. It cares much less about the structure of the text itself—note that the text is contained in a tei:ab, an element designed for delineating a block of text without saying anything about its nature as a block (unlike tei:p, for example). Visible features like columns, or text continued on multiple papyri, or multiple texts on the same papyrus would be marked by tei:divs. This is in keeping with papyrology's focus on the materiality of the text. What the editor sees, and what they make of it is more important than the construction of a coherent narrative from the text—something that is often impossible in any case. Making that set of tasks as easy as possible is therefore the focus of the text model we use.

What I'm trying to get at here is that there is Theory at work here (a host of theories in fact), having to do with a way to model texts, and that that set of theories are mapped onto data structures (TEI, XML, the tree) using a set of conventions, and taking advantage of some of the strengths of the data structures available. Those data structures have weaknesses too, and where we hit those, we have to make choices about how to serve our theories best with the tools we have. There is no end of work to be done at this level, of joining theory to practice, and a great deal of that work involves hacking, experimenting with code and data. It is from this realization, I think, that the "more hack, less yack" ethic of THATCamp emerged. And it is at this level, this intersection, this interface, that scholar-programmer types (like me) spend a lot of our time. And we do get a bit impatient with people who don't, or can't, or won't engage at the same level, especially if they then produce critiques of what we're doing.

As it happens, I do think that DH tends to be under-theorized, but by that I don't mean it needs more Foucault. Because it is largely project-driven, and the people who are able to reason about the lower-level modeling and interface questions are mostly paid only to get code into production, important decisions and theories are left implicit in code and in the shapes of the data, and aren't brought out into the light of day and yacked about as they should be.

Thursday, November 10, 2011

TEI in other formats; part the first: HTML

There has been a fair amount of discussion of late about TEI either having a standard HTML5 representation or even moving entirely to an HTML5 format. I want to do a little thinking "out loud" about how that might work.


Let's start with a fairly standard EpiDoc document (EpiDoc being a set of guidelines for using TEI to mark up ancient documents). http://papyri.info/ddbdp/p.ryl;2;74/source (see http://papyri.info/ddbdp/p.ryl;2;74/ for an HTML version with more info) is a fairly typical example of EpiDoc used to mark up a papyrus document. The document structure is fairly flat, but with a number of editorial interventions, all marked up. Line 12, below, shows supplied, unclear, and gap tags


<lb n="12"/><gap reason="lost" quantity="16" unit="character"/> <supplied reason="lost"> Π</supplied><unclear>ε</unclear>ρὶ Θή<supplied reason="lost">βας καὶ Ἑ</supplied><unclear>ρ</unclear>μωνθ<supplied reason="lost">ίτ </supplied><gap reason="lost" extent="unknown" unit="character"/>

So how might we take this line and translate it to HTML? First, we have an <lb> tag, which at first glance would seem to map quite readily onto the HTML <br> tag, but if we look at the TEI Guidelines page for lb (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html), we see a large number of possible attributes that don't necessarily convert well. In practice, all I usually see on a line break tag in TEI is an @n and maybe an @xml:id attribute. HTML doesn't really have a general-purpose attribute like @n, but @class or @title might serve. On <lb>, @n is often used to provide line numbers, so @title seems logical.


Now <gap reason="lost" quantity="16" unit="character"/> is a bit more of a puzzler. First, HTML's semantics don't extend at all to the recording of attributes of a text being transcribed, so nothing like the gap element exists. We'll have to use a general-purpose inline element (span seems obvious) and figure out how to represent the attribute values. TEI has no lack of attributes, and these don't naturally map to HTML at all in most cases. If we're going to keep TEI's attributes, we'll have to represent them as child elements.  We'll want to identify both the original TEI element and wrap its attributes and maybe its content too, so let's assume we'll use the @class attribute with a couple of fake "namespaces", "teie-" for TEI element names, "teia-" for attribute names, and "teig-" to identify attributes and wrap element contents (the latter might be overkill, but seems sensible as a way to control whitespace). We can assume a stylesheet with a span.teig-attribute selector that sets display:none.


<span class="tei-gap">
  <span class="teig-attribute teia-reason">lost</span>
  <span class="teig-attribute teia-quantity">16</span>
  <span class="teig-attribute teia-unit">character</span>
</span>


Like HTML, TEI has three structural models for elements: block, inline, and milestone. Block elements assume a "block" of text, that is, they create a visually distinct chunk of text. Divs, paragraphs, tables, and similar elements are block level. Inline elements contain content, but don't create a separate block. Examples are span in HTML, or hi in TEI. Milestones are empty elements like lb or br. TEI has several of these, and HTML, which has "generic" elements of the block and inline varieties (div and span) lacks a generic empty element. Hence the need to represent tei:gap as a span.


tei:supplied is clearly an inline element, and we can do something similar to the example above, using span:


<span class="tei-supplied">
  <span class="teig-attribute teia-reason">lost</span>
  <span class="teig-content">Π</span>
</span>


and likewise with unclear:


<span class="tei-unclear">
  <span class="teig-content">ε</span>
</span>


Now, doing this using generic HTML elements and styling/hiding them with CSS could be considered bad behavior. It's certainly frowned upon in the CSS 2.1 spec (see the note at http://www.w3.org/TR/CSS2/selector.html#class-html). I don't honestly see another way to do it though, because, although RDFa has been suggested as a vehicle for porting TEI to HTML, there is no ontology for TEI, so no good way to say "HTML element p is the same as TEI element p, here". Even granting the possibility of saying that, it doesn't help with the attribute problem. And we're still left with the problem of presentation: what will my HTML look like in a browser? It must be said that my messing about above won't produce anything like the desired effect, which for line 12 is something like:
[- ca.16 - Π]ε̣ρὶ Θή[βας καὶ Ἑ]ρ̣μωνθ[ίτ -ca.?- ] 
I could certainly make it so, probably with a combination of CSS and JavaScript, but what have I gained by doing so? I'll have traded one paradigm, XML + XSLT, for another, HTML + CSS + JavaScript. I'll have lost the ability to validate my markup, though I'll still be able to transform it to other formats. I should be able to round-trip it to TEI and back, so perhaps I could solve the validation problem that way. But is anything about this better than TEI XML? I don't think so…

I suspect I'm missing the point here, and that what the proponents of TEI in HTML are really after is a radically curtailed (or re-thought) version of TEI that does map more comfortably to HTML. The somewhat Baroque complexity of TEI leads the casual observer to wish for something simpler immediately, and can provoke occasional dismay even in experienced users. I certainly sympathize with the wish for a simpler architecture, but text modeling is a complex problem, and simple solutions to complex problems are hard to engineer.

Tuesday, June 28, 2011

Humanities Data Curation

Last Thursday, I attended the excellent Humanities Data Curation Summit, organized by Allen Renear, Trevor Muñoz, Katherine L. Walter, and Julia Flanders. I'm still processing the day, which included a breakout session with Allen, Elli Mylonas, and Michael Sperberg-McQueen, who are some of my favorite people in DH.

What I started thinking about today was that we'd skipped definitions at the beginning—there was a joke that Allen, as a philosopher, could have spent all day on that task. But in doing so, we elided the question of what is data in the humanities, and what is different about it from science or social science data.

Humanities data are not usually static, collected data like instrument readings or survey results. They are things like marked up texts, uncorrected OCR, images in need of annotation, etc. Humanities datasets can almost always be improved upon. "Curation" for them is not simply preservation, access, and forward migration. It means enabling interested communities to work with the data and make it better. Community interaction needs to be factored into the data's curation lifecycle.

I feel a blog post coming on about how the Integrating Digital Papyrology / papyri.info project does this...

Thursday, January 20, 2011

Interfaces and Models

In my last post, I argued that TEI is a text modelling language, and in the prior post, I discussed a frequently-expressed request for TEI editors that hide the tags. Here, I'm going to assert that your editing interface (implicitly) expresses a model too, and because it does, generic, tag-hiding editors are a losing proposition.

Everything to do with human-computer interfaces uses models, abstractions, and metaphors. Your computer "desktop" is a metaphor that treats the primary, default interface like the surface of a desk, where you can leave stuff laying around that you want to have close at hand. "Folders" are like physical file folders. Word processors make it look like you're editing a printed page; HTML editors can make it look as though you're directly editing the page as it appears in a browser. These metaphors work by projecting an image that looks like something you (probably) already have a mental model of. The underlying model used by the program or operating system is something else again. Folders don't actually represent any physical containment on the system's local storage, for example. The WYSIWYG text you edit might be a stream of text and formatting instructions, or a Document Object Model (DOM) consisting of Nodes that model HTML elements and text.

If you're lucky, there isn't a big mismatch between your mental model and the computer's. But sometimes there is: we've all seen weirdly mis-formatted documents, where instead of using a header style for header text, the writer just made it bold, with a bigger font, and maybe put a couple of newlines after it. Maybe you've done this yourself, when you couldn't figure out the "right" way to do it. This kind of thing only bites you, after all, when you want to do something like change the font for all headers in a document.

And how do we cope if there's a mismatch between the human interface and the underlying model? If the interface is much simpler than the model, then you will only be able to create simple instances with it; you won't be able to use the model to its full capabilities. We see this with word processor-to-TEI converters, for example. The word processor can do structural markup, like headers and paragraphs, but it can't so easily do more complex markup. You could, in theory, have a tagless TEI editor capable of expressing the full range of TEI, but it would have to be as complex as the TEI is. You could hide the angle brackets, but you'd have to replace them with something else.

Because TEI is a language for producing models of texts, it is probably impossible to build a generic tagless TEI editor. In order for the metaphor to work, there must be a mapping from each TEI structure to a visual feature in the editor. But in TEI, there are always multiple ways of expressing the same information. The one you choose is dictated by your goals, by what you want to model, and by what you'll want the model to do. There's nothing to map to on the TEI side until you've chosen your model. Thus, while it's perfectly possible (and is useful,* and has been done, repeatedly) to come up with a "tagless" interface that works well for a particular model of text, I will assert that developing a generic TEI editor that hides the markup would be hard task.

This doesn't mean you couldn't build a tool to generate model-specific TEI editors, or build a highly-customizable tagless editor. But the customization will be a fairly hefty intellectual effort. And there's a potential disadvantage here too: creating such a customization implies that you know exactly how you want your model to work, and at the start of a project, you probably don't. You might find, for example, that for 1% of your texts, your initial assumptions about your text model are completely inadequate, and so it has to be refined to account for them. This sort of thing happens all the time.

My advice is to think hard before deciding to "protect" people from the markup. Text modeling is a skill that any scholar of literature could stand to learn.

UPDATE: a comment on another site by Joe Wicentowski makes me think I wasn't completely clear above. There's NOTHING wrong with building "padded cell" editors that allow users to make only limited changes to data. But you need to be clear about what you want to accomplish with them before you implement one.

*Michael C. M. Sperberg-McQueen has a nice bit on "padded cell editors" at http://www.blackmesatech.com/view/?p=11

Tuesday, January 11, 2011

TEI is a text modelling language

I'm teaching a TEI class this weekend, so I've been pondering it a bit. I've come to the conclusion that calling what we do with TEI "text encoding" is misleading. I think what we're really doing is text modeling.

TEI provides an XML vocabulary that lets you produce models of texts that can be used for a variety of purposes. Not a Model of Text, mind you, but models (lowercase) of texts (also lowercase).

TEI has made the (interesting, significant) decision to piggyback its semantics on the structure of XML, which is tree-based. So XML structure implies semantics for a lot of TEI. For example, paragraph text appears inside <p> tags; to mark a personal name, I surround the name with a <persname> tag, and so on. This arrangement is extremely convenient for processing purposes: it is trivial to transform the TEI <p> into an HTML <p>*, for example, or the <persname> into an HTML hyperlink, which points to more information about the person. It means, however, that TEI's modeling capabilities are to a large extent XML's own. This approach has opened TEI up to criticism. Buzetti (2002) has argued that its tree structure simply isn't expressive enough to represent the complexities of text, and Schmidt (2010) criticizes TEI for (among other problems) being a bad model of text, because it imposes editorial interpretation on the text itself.

The main disagreement I have with Schmidt's argument is the assumption that there is a text independent of the editorial apparatus. Maybe there is sometimes, but I can point at many examples where there is no text, as such, only readings. And a reading is, must be, an interpretive exercise. So I'd argue that TEI is at least honest in that it puts the editorial interventions front and center where they are obvious.

As for the argument that TEI's structure is inadequate to model certain aspects of text, I can only agree. But TEI has proved good enough to do a lot of serious scholarly work. That, and the fact that its choice of structure means it can bring powerful XML tools to bear on the problems it confronts, means that TEI represents a "worse is better" solution. It works a lot of the time, doesn't claim to be perfect, and incrementally improves. Where TEI isn't adequate to model a text in the way you want to use it, then you either shouldn't use it, or should figure out how to extend it.

One should bear in mind that any digital representation of a text is ipso facto a model. It's impossible do anything digital without a model (whether you realize it's there or not). Even if you're just transcribing text from a printed page to a text editor you're making editorial decisions, like what character encoding to use, how to represent typographic features in that encoding, how to represent whitespace, and what to do with things you can't easily type (inline figures or symbols without a Unicode representation, for example).

So why argue that TEI is a language for modeling texts, rather than a language for "encoding" texts? The simple answer is that this is a better way of explaining what people use TEI for. TEI provides a lot of tags to choose from. No-one uses them all. Some are arguably incompatible with one another. We tag the things in a text that we care about and want to use. In other words, we build models of the source text, models that reflect what we think is going on structurally, semantically, or linguistically in the text, and/or models that we hope to exploit in some way.

For example, EpiDoc is designed to produce critical editions of inscribed or handwritten ancient texts. It is concerned with producing an edition (a reading) of the source text that records the editor's observations of and ideas about that text. It does not at this point concern itself with marking personal or geographic names in the text. An EpiDoc document is a particular model of the text that focuses on the editor's reading of that text. As a counterexample, I might want to use TEI to produce a graph of the interactions of characters in Hamlet. If I wanted to do that, I would produce a TEI document that marked people and whom they were addressing when they spoke. This would be a completely different model of the text than a critical edition of Hamlet might be. I could even try to do both at the same time, but that might be a mess—models are easier to deal with when they focus on one thing.

This way of understanding TEI makes clear a problem that arises whenever one tries to merge collections of TEI documents: that of compatibility. Just because two documents are marked up in TEI, that does not mean they are interoperable. This is because each document represents the editor's model of that text. Compatibility is certainly achievable if both documents follow the same set of conventions, but we shouldn't expect it any more than we'd expect to be able to merge any two models that follow different ground rules.

Notes
* with the caveat that the semantics of TEI <p> and HTML <p> are different, and there may be problems. TEI's <p> can contain lists, for example, whereas HTML's cannot.


Yes, I wrote a blog post with endnotes and bibliography. Sue me.
  1. Buzzetti D. "Digital Representation and the Text Model." New Literary History 2002; 33.1:61-88.
  2. Schmidt, D. "The Inadequacy of Embedded Markup for Cultural Heritage Texts." Literary and LInguistic Computing 2010; 25.3:337-356.

Thursday, January 06, 2011

I Will Never NOT EVER Type an Angle Bracket (or IWNNETAAB for short)

From time to time, I hear an argument that goes something like this: "Our users won't deal with angle brackets, therefore we can't use TEI, or if we do, it has to be hidden from them." It's an assumption I've encountered again quite recently. Since it's such a common trope, I wonder how true it is. Of course, I can't speak for anyone's user communities other than the ones I serve. And mine are perhaps not the usual run of scholars. But they haven't tended to throw their hands up in horror at the sight of angle brackets. Indeed, some of them have become quite expert at editing documents in TEI.

The problems with TEI (and XML in general) are manifold, but its shortcomings often center around its not being expressive *enough* to easily deal with certain classes of problem. And the TEI evolves. You can get involved and change it for the better.

The IWNNETAAB objection seems grounded in fear. But fear of what? As I mentioned at the start, IWNNETAAB isn't usually an expression of personal revulsion, it's not just Luddism, it's IWNNETAAB by proxy: my users/clients/stakeholders won't stand for it. Or they'll mess it up. TEI is hard. It has *hundreds* of elements. How can they/why should they learn something so complex just to be able to digitize texts?! What we want to do is simple, can't we have something simple that produces TEI in the end?

The problem with simplified editing interfaces is easy to understand: they are simple. Complexities have been removed, and along with them, the ability to express complex things. To put it another way, if you aren't dealing with the tags, you're dealing with something in which a bunch of decisions have already been made for you. My argument in the recent discussion was that in fact, these decisions tend to be extremely project-specific. You can't set it up once and expect it to work again in different circumstances; you (or someone) will have to do it over and over again. So, for a single project, the cost/benefit equation may look like it leans toward the "simpler" option. But taken over many projects, you're looking either at learning a reasonably complex thing or building a long series of tools that each produce a different version of that thing. Seen in this light, I think learning TEI makes a lot of sense. On the learning TEI side, the costs go down over time, on the GUI interface side, they keep going up.

Moreover, knowing TEI means that you (or your stakeholders) aren't shackled to an interface that imposes decisions that were made before you ever looked at the text you're encoding, instead, you are actually engaging with the text, in the form in which it will be used. You're seeing behind the curtain. I can't really fathom why that would be a bad thing.

(Inspiration for the title comes from a book my 2-year-old is very fond of)