Thursday, December 31, 2009


In between shortening my lifespan by doing a crazy yardwork project this week, I've been following with interest the tweets from #MLA09. A couple of items of interest were that Digital Humanities has become an overnight success (only decades in the making), the job market (still) reeks, and there are serious inequities in the status of non-faculty collaborators in DH projects. None of this is new, of course, but it's good to see it so well stated in a highly-visible venue.

I'm more than ever convinced that, despite the occasional feelings of regret, I made the right decision to stop seeking faculty employment after I got my Ph.D. DH was not then, and perhaps still isn't now, a hot topic in Classics. It is odd, because some of the most innovative DH work comes out of Classics, but, as I've said on a number of occasions, DH pickup in the field is concentrated in a few folks who are 20 years ahead of everyone else. It's interesting to speculate why this may be so. Classics is hard: you have to master (at least) a couple of ancient languages (Latin, Greek at least), plus a couple of modern ones (French and German are the most likely suspects, but maybe Italian, Spanish, Modern Greek, etc. also, depending on your specialization), then a body of literature, history, and art before you can do serious work. Ph.D.s from other disciplines sometimes quail when I describe the comps we had to go through (2 3-hour translation exams, 2 4-hour written exams, and an oral—and that's before you got to do your proposal defense). It may be that there's no room for anything else in this mix, and it's something you have to add later on. Virtually all the "digital classicists" I know are either tenured or are not faculty (and aren't going to be—at least not in Classics). It's all a bit grim really. A decade ago, if you were a grad student in Classics with an interest in DH, you were doomed unless you were willing to suppress that interest until you had tenure. I don't know whether that's changed at all. I hope it has.

The good news, of course, is that digital skills are highly portable (and better-paid). The one on-campus interview I had (for which I wasn't offered the job) would have paid several thousand (for a tenure-track job!) less than the (academic!) programming job I ended up taking. And as fate would have it, I ended up doing digital classics anyway, at least until the grant money runs out.

So I wonder what the twitter traffic from APA10 will be like next week. Maybe DH will be the next big thing there too, but a scan of the program doesn't leave me optimistic.

Wednesday, December 16, 2009

Converting APIS

On Monday, I finished converting the APIS (Advanced Papyrological Information System) intake files to EpiDoc XML. I thought I'd write it up, since I tried some new things to do it. The APIS intake files employ a MARC-inspired text format that looks like:

cu001 | 1 | duke.apis.31254916
cu035 | 1 | (NcD)31254916
cu965 | 1 | APIS

status | 1 | 1
cu300 | 1 | 1 item : papyrus, two joining fragments mounted in
glass, incomplete ; 19 x 8 cm
cuDateSchema | 1 | b
cuDateType | 1 | o
cuDateRange | 1 | b
cuDateValue | 1 | 199
cuDateRange | 2 | e
cuDateSchema | 2 | b
cuDateType | 2 | o
cuDateValue | 2 | 100
cuLCODE | 1 | egy
cu090 | 1 | P.Duk.inv. 723 R
cu500 | 1 | Actual dimensions of item are 18.5 x 7.7 cm
cu500 | 2 | 12 lines
cu500 | 3 | Written along the fibers on the recto; written
across the fibers on the verso in a different hand and
inverse to the text on the recto
cu500 | 4 | P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R
cu510_m | 5 |
cu520 | 6 | Papyrus account of wheat from the Arsinoites (modern
name: Fayyum), Egypt. Mentions the bank of Pakrouris(?)
cu546 | 7 | In Demotic
cu655 | 1 | Documentary papyri Egypt Fayyum 332-30 B.C
cu655 | 2 | Accounts Egypt Fayyum 332-30 B.C
cu655 | 3 | Papyri

cu653 | 1 | Accounting -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 2 | Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 3 | Wheat -- Egypt -- Fayyum -- 332-30 B.C.
cu245ab | 1 | Account of wheat [2nd cent. B.C.]
cuPart_no | 1 | 1
cuPart_caption | 1 | Recto
cuPresentation_no | 1 | 1 | 1
cuPresentation_display_res | 1 | 1 | thumbnail
cuPresentation_url | 1 | 1 |
cuPresentation_format | 1 | 1 | image/gif
cuPresentation_no | 1 | 2 | 2
cuPresentation_display_res | 1 | 2 | 72dpi
cuPresentation_url | 1 | 2 |
cuPresentation_format | 1 | 2 | image/gif
cuPresentation_no | 1 | 3 | 3
cuPresentation_display_res | 1 | 3 | 150dpi
cuPresentation_url | 1 | 3 |
cuPresentation_format | 1 | 3 | image/gif
perm_group | 1 | w

cu090_orgcode | 1 | NcD
cuOrgcode | 1 | NcD

Some of the element names come from, and have the semantics of MARC, while others don't. Fields are delimited with pipe characters '|' and are sometimes 3 columns, sometimes 4. The second column is meant to express order, e.g. cu500 (general note) 1, 2, 3, and 4. If there are 4 columns, the third is used to link related fields, e.g. an image with its label. The last column is the field data, which can wrap to multiple lines. This has to be converted to EpiDoc like:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="">
<title>Account of wheat [2nd cent. B.C.]</title>
<idno type="apisid">duke.apis.31254916</idno>
<idno type="controlno">(NcD)31254916</idno>
<idno type="invno">P.Duk.inv. 723 R</idno>
<summary>Papyrus account of wheat from the Arsinoites (modern name: Fayyum), Egypt.
Mentions the bank of Pakrouris(?)</summary>
<note type="general">Actual dimensions of item are 18.5 x 7.7 cm</note>
<note type="general">12 lines</note>
<note type="general">Written along the fibers on the recto; written across the fibers on
the verso in a different hand and inverse to the text on the recto</note>
<note type="general">P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R</note>
<textLang mainLang="egy">In Demotic</textLang>
<p>1 item : papyrus, two joining fragments mounted in glass, incomplete ; 19 x 8 cm</p>
<origDate notBefore="-0199" notAfter="-0100"/>
<language ident="en">English</language>
<language ident="egy-Egyd">In Demotic</language>
<keywords scheme="#apis">
<term>Accounting -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>Wheat -- Egypt -- Fayyum -- 332-30 B.C.</term>
<rs type="genre_form">Documentary papyri Egypt Fayyum 332-30 B.C</rs>
<rs type="genre_form">Accounts Egypt Fayyum 332-30 B.C</rs>
<rs type="genre_form">Papyri</rs>
<div type="bibliography" subtype="citations">
<ref target="">Original record</ref>.</p>
<div type="figure">
<figDesc> thumbnail</figDesc>
<graphic url=""/>
<figDesc> 72dpi</figDesc>
<graphic url=""/>
<figDesc> 150dpi</figDesc>
<graphic url=""/>

I started learning Clojure this summer. Clojure is a Lisp implementation on top of the Java Virtual Machine. So I thought I'd have a go at writing an APIS converter in it. The result is probably thoroughly un-idiomatic Clojure, but it converts the 30,000 plus APIS records to EpiDoc in about 2.5 minutes, so I'm fairly happy with it as a baby-step. The script works by reading the intake file line by line and issuing SAX events that are handled by a Saxon XSLT TRansformerHandler, which in turn converts to EpiDoc. So in effect, the intake file is treated as though it were an XML file and transformed with a stylesheet.

Most of the processing is done with three functions:

generate-xml takes a File, instantiates a transforming SAX handler from a pool of TransformerFactory objects, starts calling SAX events, and then hands off to the process-file function.

(defn generate-xml
(let [xslt (.poll @templates)
handler (.newTransformerHandler (TransformerFactoryImpl.) xslt)]
(doto handler
(.setResult (StreamResult. (File. (.replace
(.replace (str file-var) "intake_files" "xml") ".if" ".xml"))))
(.startElement "" "apis" "apis" (AttributesImpl.)))
(process-file (read-file file-var) "" handler)
(doto handler
(.endElement "" "apis" "apis")
(catch Exception e
(.println *err* (str (.getMessage e) " processing file " file-var))))
(.add @templates xslt)))

recursively processes a sequence of lines from the file. If lines is empty, we're at the end of the file, and we can end the last element and exit, otherwise, it splits the current line on pipe characters, calls handle line, then calls itself on the remainder of the line sequence.

(defn process-file
[lines, elt-name, handler]
(if (empty? lines)
(.endElement handler "" elt-name elt-name)
(if (not (.startsWith (first lines) "#")) ; comments start with '#' and can be ignored
(let [line (.split (first lines) "\\s+\\|\\s+")
ename (if (.contains (first lines) "|") (aget line 0) elt-name)]
(handle-line line elt-name handler)
(process-file (rest lines) ename handler)))))

handle-line does most of the XML-producing work. The field name is emitted as an element, columns 2 (and 3 if it's a 4-column field) are emitted as @n and @m attributes, and the last column is emitted as character cont If the line is a continuation of the preceding line, then it will be emitted as character data.

(defn handle-line
[line, elt-name, handler]
(if (> (alength line) 2) ; lines < 2 columns long are either continuations or empty fields
(do (let [atts (AttributesImpl.)]
(doto atts
(.addAttribute "" "n" "n" "CDATA" (.trim (aget line 1))))
(if (> (alength line) 3)
(doto atts
(.addAttribute "" "m" "m" "CDATA" (.trim (aget line 2)))))
(if (false? (.equals elt-name ""))
(.endElement handler "" elt-name elt-name))
(.startElement handler "" (aget line 0) (aget line 0) atts))
(let [content (aget line (- (alength line) 1))]
(.characters handler (.toCharArray (.trim content)) 0 (.length (.trim content)))))
(if (== (alength line) 1)
(.characters handler (.toCharArray (aget line 0)) 0 (.length (aget line 0)))))))

The -main function kicks everything off by calling init-templates to load up a ConcurrentLinkedQueue with new Template objects capable of generating an XSLT handler and then kicking off a thread pool and mapping the generate-xml function to a sequence of files with the ".if" suffix. -main takes 3 arguments, the directory to look for intake files in, the XSLT to use for transformation, and the number of worker threads to use. I've been kicking it off with 20 threads. Speed depends on how much work my machine (3 GHc Intel Core 2 Duo Macbook Pro) is doing at the moment, but is quite zippy.

(defn init-templates
[xslt, nthreads]
(dosync (ref-set templates (ConcurrentLinkedQueue.) ))
(dotimes [n nthreads]
(let [xsl-src (StreamSource. (FileInputStream. xslt))
configuration (Configuration.)
compiler-info (CompilerInfo.)]
(doto xsl-src
(.setSystemId xslt))
(doto compiler-info
(.setErrorListener (StandardErrorListener.))
(.setURIResolver (StandardURIResolver. configuration)))
(dosync (.add @templates (.newTemplates (TransformerFactoryImpl.) xsl-src compiler-info))))))

(defn -main
[dir-name, xsl, nthreads]
(def xslt xsl)
(def dirs (file-seq (File. dir-name)))
(init-templates xslt nthreads)
(let [pool (Executors/newFixedThreadPool nthreads)
tasks (map (fn [x]
(fn []
(generate-xml x)))
(filter #(.endsWith (.getName %) ".if") dirs))]
(doseq [future (.invokeAll pool tasks)]
(.get future))
(.shutdown pool)))

I had some conceptual difficulties figuring out how best to associate Templates with the threads that execute them. The easy thing to do would be to put the Template creation in the function that is mapped to the file sequence, but that bogs down fairly quickly, presumably because a new Template is being created for each file and memory usage balloons pretty quickly. So that doesn't work. In Java, I'd either a) write a custom thread that spun up its own Template or b) create a pool of Templates. After some messing around, I went with b) because I couldn't see how to do such an object-oriented thing in a functional way. b) was a bit hard too, because I couldn't see how to store Templates in a Clojure collection, access them, and use them without wrapping the whole process in a transaction, which seems like it would lock the collection much too much. So I used a threadsafe Java collection, ConcurrentLinkedQueue, which manages concurrent access to its members on its own.

I've no doubt there are better ways to do this, and I expect I'll learn them in time, but for now, I'm quite pleased with my first effort. Next step will probably be to add some Schematron validation for the APIS files. My impression of Clojure is that it's really powerful, and a good way to write concurrent programs. To do it really well, I think you'd need a fairly deep knowledge of both Lisp-style functional programming and the underlying Java/JVM aspects, but that seems doable.

Tuesday, October 27, 2009

Object Artefact Script

A couple of weeks ago, I attended a workshop at the Edinburgh eScience Institute on the relation of text in ancient (and other) documents to its context and on the problems of reading difficult texts on difficult objects and ways in which technology can aid the process of interpretation and dissemination without getting in the way of it. The meeting was well summarized by Alejandro Giacometti in his blog, and the presentations are posted on the eSI wiki.

Kathryn Piquette discussed what would be required to digitally represent Egyptian hieroglyphic texts without divorcing them from their contexts as an integral part of monumental architecture. For example, the interpretation of the meaning of texts should be able to take into account the times of day (and/or year) when they would have been able to be read, their relationship to their surroundings, and so on. The established epigraphical practice of divorcing the transcribed text from its context, while often necessary, does some violence to its meaning, and this must be recognized and accounted for. At the same time, digital 3D reconstructions are themselves an interpretation, and it is important to disclose the evidence on which that interpretation is based.

Ségolène Tarte talked about the process of scholarly interpretation in reading the Vindolanda tablets and similar texts. As part of analysing the scholarly reading process, the eSAD project observed two experts reading a previously-published tablet. During the course of their work, they came up with a new reading that completely changed their understanding of the text. The previous reading hinged on the identification of a single word, which led to the (mistaken) recognition of the document as recording the sale of an ox. The new reading hinged on the recognition of a particular letterform as an 'a'. The ways in which readings of difficult texts are produced—involving skipping around looking for recognizable pieces of text upon which (multiple) partial mental models of the texts are constructed, which must then be resolved somehow into a reading—means that an Interpretation Support System (such as the one eSAD proposes to develop) must be sensitive to the different ways of reading scholars use and must be careful not to impose "spurious exactitude" on them.

Dot Porter gave an overview of a variety of projects that focus on representing text, transcription, and annotation alongside one another as a way into discussing the relationship between digital text and physical text. She cautioned against attempts to digitally replicate the experience of the codex, since there is a great deal of (necessary) data interpolation that goes on in any detailed digital reconstruction, and this elides the physical reality of the text. Digital representations may improve (or even make possible) the reading of difficult texts, such as the Vindolanda tablets or the Archimedes Palimpsest, so for purposes of interpretation, they may be superior to the physical reality. They can combine data, metadata, and other contextual information in ways that help a reader to work with documents. But they cannot satisfactorily replicate the physicality of the document, and it may be a bit dishonest to try.

I talked about the img2xml project I'm working on with colleagues from UNC Chapel Hill. I've got a post or two about that in the pipeline, so I won't say much here. It involves the generation of SVG tracings of text in manuscript documents as a foundation for linking and annotation. Since the technique involves linking to an XML-based representation of the text, it may prove superior to methods that rely simply on pointing at pixel coordinates in images of text.

Ryan Bauman talked about the use of digital images as scholarly evidence. He gave a fascinating overview of sophisticated techniques for imaging very difficult documents (e.g. carbonized, rolled up scrolls from Herculaneum) and talked about the need for documentation of the techniques used in generating the images. This is especially important because the images produced will not resemble the way the document looks in visible light. Ryan also talked about the difficulties involved in linking views of the document that may have been produced at different times, when the document was in different states, or may have used different techniques. The Archimedes Palimpsest project is a good example of what's involved in referencing all of the images so that they can be linked to the transcription.

Finally, Leif Isaksen talked about how some of the techniques discussed in the earlier presentations might be used in crowdsourcing the gathering of data about inscriptions. Inscriptions (both published and unpublished) are frequently encountered (both in museums and out in the open) by tourists who may be curious about their meaning, but lack the ability to interpret them. They may well, however, have sophisticated tools available for image capture, geo-referencing, and internet access (via digital cameras, smartphones, etc.). Can they be employed, in exchange for information about the texts they encounter, as data gatherers?

Some themes that emerged from the discussion included:

  • the importance of communicating the processes involved in generating digital representations of texts and their contexts (i.e. showing your work)

  • the need for standard ways of linking together image and textual data

  • the importance of disseminating data and code, not just results

This was a terrific workshop, and I hope to see followup on it. ESAD is holding a workshop next month on "Understanding image-based evidence," that I'm sorry I can't attend and from which look forward to seeing the output.

Friday, October 16, 2009

Stomping on Innovation Killers

@foundhistory has a nice post on objections one might hear on a grant review panel that would unjustly torpedo an innovative proposal. I thought it might be a good idea to take a sideways look at these as advice to grant writers.

  • Haven’t X, Y, and Z already done this? We shouldn’t be supporting duplication of effort.

  • Are all of the stakeholders on board? (Hat tip to @patrickgmj for this gem.)

  • What about sustainability?

So, some ideas for countering these when you're working on your proposal:

  1. Have you looked at work that's been done in this area (this might entail some real digging)? If there are projects and/or literature that deal with the same areas as your proposal, then you should take them into account. You need to be able to show you've done your homework and that your project is different from what's come before.

  2. Who is your audience? Have you talked to them? If you can get letters of support from one or more of them, that will help silence the stakeholders objection.

  3. You ought to have some sort of story about sustainability and/or the future beyond the project, to show that you've thought about what comes next. Even if your project is an experiment, you should talk about how you're going to disseminate the results so that those who come after will be able to build on your work.

I agree with Tom that these criticisms can be deployed to stifle creative work. In technology, sometimes wheels need to be reinvented, sometimes the conventional wisdom is flat wrong, and sometimes worrying overmuch about the future paralyses you. But if you're writing a proposal, assume these objections will be thrown at it, and do some prior thinking so you can spike them before they kill your innovative idea.

Monday, August 10, 2009

Upgrade Notes

During my recent work on moving the Papyrological Navigator from Columbia to NYU, I ran into some issues that bear noting. It's a bit hard to know whether these are generalizable, but they seem to me to be good examples of the kinds of things that can happen when you're upgrading a complex system, and I don't want to forget about them.

Issue #1
Search results in the PN are supposed to return with KWIC snippets, highlighting the search terms. As part of the move, I upgraded Lucene to the latest release (2.4.1). The Lucene in the PN was 2.3.x, but the developer at Columbia had worked hard to eke as much indexing speed out of it as possible, and had imported code from the 2.4 branch, with some modifications. Since this code was really close to 2.4, I'd had reason to hope the upgrade would be smooth, and it mostly was. Highlighting wasn't working for Greek though, even though the search itself was...

Debugging this was really hard, because as it turned out, there was no failure in any of the running code. It just wasn't running the right code. A couple of the slightly modified Lucene classes in the PN codebase were being stepped on by the new Lucene because instead of a jar named "ddbdp.jar", the new PN jars were named after the project in which they resided (so, "pn-ddbdp-indexers.jar". And they were getting loaded after Lucene instead of before. Not the first time I'd seen this kind of problem, but always a bit baffling. In the end I moved the PN Lucene classes out of the way by changing their names and how they were called.

Issue #2

This one was utterly baffling as well. Lemmatized search (that is, searching for dictionary headwords and getting hits on all the forms of the word—very useful for inflected languages, like Greek) was working at Columbia, and not at NYU. Bizarre. I hadn't done anything to the code. Of course, it was my fault. It almost always is the programmer's fault. A few months before, in response to a bug report (and before I started working for NYU), I had updated the transcoder software (which converts between various encodings for Ancient Greek) to conform to the recommended practice for choosing which precomposed (letter + accent) character to use when the same one (e.g. alpha + acute accent) occurs in both the Greek (Modern) and Greek Extended (Ancient) blocks in Unicode. Best practice is to choose the character from the Greek block, so \u03AC instead of \u1F71 for ά. Transcoder used to use the Greek Extended character, but since late 2008 it has followed the new recommendation and used characters from the Greek block, where available. Unfortunately this change happened after transcoder had been used to build the lemma database that the PN uses to expand lemmatized queries. So it had the wrong characters in it, and a search for any lemma containing an acute accent would fail. Again, all the code was executing perfectly; some of the data was bad. It didn't help that when I pasted lemmas into Oxygen, it normalized the encoding, or I might have realized sooner that there were differences.

Issue #3

Last, but not least, was a bug which manifested as a failure in certain types of search. "A followed by B within n places" searches worked, but "A and B (not in order) within n places" and "A but not B within n places" both failed. Again, no apparent errors in the PN code. The NullPointerException that was being thrown came from within the Lucene code! After a lot of messing about, I was able to determine that the failure was due to a Lucene change that the PN code wasn't implementing against. Once I'd found that, all it took to fix it was to override a method from the Lucene code. This was actually a Lucene bug ( which I reported. In trying to maintain backward compatibility, they had kept compile-time compatibility with pre-2.4 code, but broken it in execution. I have to say, I was really impressed with how fast the Lucene team, particularly Mark Miller, responded. The bug is already fixed.

So, lessons learned:

  1. Tests are good. I didn't have any available for the project that contained all of the bugs listed here. They exist (though coverage is spotty), but there are dependencies that are tricky to resolve, and I had decided to defer getting the tests to work in favor of getting the PN online. Not having tests ate into the time I'd saved by deferring them.

  2. In both cases #1 and #3, I had to find the problem by reading the code and stepping through it in my head. Practice this basic skill.

  3. Look for ways your architecture may have changed during the upgrade. Anything may be significant, including filenames.

  4. Greek character encoding is the Devil (but I already knew that).

  5. It's probably your fault, but it might not be. Look closely at API changes in libraries you upgrade. Go look at the source if anything looks fishy. I didn't expect to find anything wrong with something as robust as Lucene, but I did.

Friday, January 23, 2009

Endings and Beginnings

It's been that sort of a week. Great beginning with the inauguration on Tuesday and the start of a new Obama presidency. My wife was in tears. Growing up in a small southern town, she never imagined she'd see a black president, and now our youngest daughter will never know a world in which there hasn't been one. Sometimes things do change for the better.

On a personal note, I gave my notice to UNC on Tuesday. My position was partially funded with soft money, and one-time money is one of the primary ways they're trying to address the budget crisis, in order not to lay off permanent employees (as is right and proper). I'm rather sad about leaving, but I will be starting a job with the NYU digital library team in February, working on digital papyrology. This has the look of a job where I can unite both the Classics geek and the tech geek sides of my personality. I may become unbearable.