Tuesday, December 28, 2010

DH Tea Leaves

From reading my (possibly) representative sample of DH proposals, I'd say the main theme of the conference will not be "Big Tent Digital Humanities" but "data integration". Of the 8 proposals I read, more than half of them were concerned with problems of connecting data across projects, disciplines, and different systems. My proposal was too (making 9), so perhaps I did have a representative sample.

Data integration is a meaty problem, resistant to generalized solutions. To my mind the answers, such as they are, will rely on the same set of practices that good data curation techniques use: open formats and open source code, and good documentation that covers the "why" of decisions made for projects as well as the "how." Data integration is a process that involves gaining an understanding of the sources and the semantics of their structures before you can connect them together. So, while there are tools out there that can enable successful data integration, there are (as usual) no silver bullets. Grasping the meanings and assumptions embodied in each project's data structures has to be the first step and this is only possible when those structures have been explained.

Sunday, December 05, 2010

That Bug Bit Me

"I had this problem and I fixed it" stories are boring to anyone except those intimately concerned with the problem, so I'm not going to tell that story. Instead, I'm going to talk about projects in the Digital Humanities that rely on 3rd party software, and talk about the value of expertise in programming and software architecture. From the outside, modern software development can look like building a castle out of Lego pieces: you take existing components and pop them together. Need search? Grab Apache Solr and plug it in. Need a data store? Grab a (No)SQL database and put your data in it. Need to do web development fast? Grab a framework, like Rails or Django. Doesn't sound that hard.

This is, more or less, what papyri.info looks like internally. There's a Solr install that handles search, a Mulgara triple store that keeps track of document relationships, small bits of code that handle populating the former two and displaying the web interface, and a JRuby on Rails application that provides crowdsourced editing capabilities.

Upgrading components in this architecture should range from trivially easy, to moderately complex (that's only if some interface has changed between versions, for example).

So why did I find myself sitting in a hotel lobby in Rome a few weeks ago having to roll back an to an older version of Mulgara so the editor application would work for a presentation the next day? A bunch of our queries had stopped working, meaning the editor couldn't load texts to edit. Oops.

And why did I spend the last week fighting to keep the application standing up after a new release of the editor was deployed?

The answer to both questions was that our Lego blocks didn't function the way they were supposed to. They aren't Lego blocks after all—they're complex pieces of software that may have bugs. The fact that our components are open source, and have responsive developers behind them is a help, but we can't necessarily expect those developers to jump to solve our problems. After all, the project's tests must have passed in order for the new release to be pushed, and unless there's a chorus of complaint, our problem isn't necessarily going to be high on their list of things to fix.

No, the whole point of using open source components is that you don't have to depend solely on other people to fix your problems. In the case of Mulgara, I was able to track down and fix the bug myself, with some pointers from the lead developer. The fix (or a better version of it) will go into the next release, and meantime we can use my patched version. In the case of the Rails issue, there seems to be a bug in the ActiveSupport file caching under JRuby that causes it to go nuts: the request never returns and something continually creates resources that have to be garbage collected. The symptom I was seeing was constant GC and a gradual ramping up of the CPU usage to the point where the app became unstable. Tracing back from that symptom took a lot of work, but once I identified it, we were able to switch away from file store caching, and so far things look good.

My takeaway from this is that even when you're constructing your application from prebuilt blocks, it really helps to have the expertise to dig into the architecture of the blocks themselves. Software components aren't Lego blocks, and although you'll want to use them (because you don't have the time or money to write your own search engine from scratch) you do need to be able to understand them in a pinch. It also really pays to work with open source components. I didn't have to spend weeks feeding bug reports to a vendor to help them fix our Mulgara problem. A handful of emails and a about a day's worth of work (spread over the course of a week) were enough to get me to the source of the problem and a fix for it (a 1-liner, incidentally).