Scriptio Continua: 2010

Tuesday, December 28, 2010

DH Tea Leaves

From reading my (possibly) representative sample of DH proposals, I'd say the main theme of the conference will not be "Big Tent Digital Humanities" but "data integration". Of the 8 proposals I read, more than half of them were concerned with problems of connecting data across projects, disciplines, and different systems. My proposal was too (making 9), so perhaps I did have a representative sample.

Data integration is a meaty problem, resistant to generalized solutions. To my mind the answers, such as they are, will rely on the same set of practices that good data curation techniques use: open formats and open source code, and good documentation that covers the "why" of decisions made for projects as well as the "how." Data integration is a process that involves gaining an understanding of the sources and the semantics of their structures before you can connect them together. So, while there are tools out there that can enable successful data integration, there are (as usual) no silver bullets. Grasping the meanings and assumptions embodied in each project's data structures has to be the first step and this is only possible when those structures have been explained.

Sunday, December 05, 2010

That Bug Bit Me

"I had this problem and I fixed it" stories are boring to anyone except those intimately concerned with the problem, so I'm not going to tell that story. Instead, I'm going to talk about projects in the Digital Humanities that rely on 3rd party software, and talk about the value of expertise in programming and software architecture. From the outside, modern software development can look like building a castle out of Lego pieces: you take existing components and pop them together. Need search? Grab Apache Solr and plug it in. Need a data store? Grab a (No)SQL database and put your data in it. Need to do web development fast? Grab a framework, like Rails or Django. Doesn't sound that hard.

This is, more or less, what papyri.info looks like internally. There's a Solr install that handles search, a Mulgara triple store that keeps track of document relationships, small bits of code that handle populating the former two and displaying the web interface, and a JRuby on Rails application that provides crowdsourced editing capabilities.

Upgrading components in this architecture should range from trivially easy, to moderately complex (that's only if some interface has changed between versions, for example).

So why did I find myself sitting in a hotel lobby in Rome a few weeks ago having to roll back an to an older version of Mulgara so the editor application would work for a presentation the next day? A bunch of our queries had stopped working, meaning the editor couldn't load texts to edit. Oops.

And why did I spend the last week fighting to keep the application standing up after a new release of the editor was deployed?

The answer to both questions was that our Lego blocks didn't function the way they were supposed to. They aren't Lego blocks after all—they're complex pieces of software that may have bugs. The fact that our components are open source, and have responsive developers behind them is a help, but we can't necessarily expect those developers to jump to solve our problems. After all, the project's tests must have passed in order for the new release to be pushed, and unless there's a chorus of complaint, our problem isn't necessarily going to be high on their list of things to fix.

No, the whole point of using open source components is that you don't have to depend solely on other people to fix your problems. In the case of Mulgara, I was able to track down and fix the bug myself, with some pointers from the lead developer. The fix (or a better version of it) will go into the next release, and meantime we can use my patched version. In the case of the Rails issue, there seems to be a bug in the ActiveSupport file caching under JRuby that causes it to go nuts: the request never returns and something continually creates resources that have to be garbage collected. The symptom I was seeing was constant GC and a gradual ramping up of the CPU usage to the point where the app became unstable. Tracing back from that symptom took a lot of work, but once I identified it, we were able to switch away from file store caching, and so far things look good.

My takeaway from this is that even when you're constructing your application from prebuilt blocks, it really helps to have the expertise to dig into the architecture of the blocks themselves. Software components aren't Lego blocks, and although you'll want to use them (because you don't have the time or money to write your own search engine from scratch) you do need to be able to understand them in a pinch. It also really pays to work with open source components. I didn't have to spend weeks feeding bug reports to a vendor to help them fix our Mulgara problem. A handful of emails and a about a day's worth of work (spread over the course of a week) were enough to get me to the source of the problem and a fix for it (a 1-liner, incidentally).

Monday, May 10, 2010

#alt-ac Careers: Digital Humanities Developer

(part 1 of a series)

I've been a digital humanities developer, that is, someone who writes code, does interface and system design, and rescues data from archaic formats in support of DH projects in a few contexts during the course of my career. I'll be writing a piece for Bethany Nowviskie's #alt-ac (http://nowviskie.org/2010/alt-ac/) volume this year. This is (mostly) not that piece, though portions of it may appear there, so it's maybe part of a draft. This is an attempt to dredge up bits of my perspective as someone who has had an alternate-academic career (on and off) for the last decade. It's fairly narrowly aimed at people like me, who've done advanced graduate work in the Humanities, have an interest in Digital Humanities, and who have or are thinking about jobs that employ their technical skills instead of pursuing a traditional academic career.

A couple of future installments I have in mind are "What Skills Does a DH Developer Need?" and "What's Up with Digital Classics?"

In this installment, I'm going to talk about some of the environments I've worked in. I'm not going to pull punches, and this might get me into trouble, but I think that if there is to be a future for people like me, these things deserve an airing. If your institution is more enlightened than what I describe, please accept my congratulations and don't take offense.

Working for Libraries

In general, libraries are a really good place to work as a programmer, especially doing DH projects. I've spent the last three years working in digital library programming groups. There are some downsides to be aware of: Libraries are very hierarchical organizations, and if you are not a librarian then you are probably in a lower "caste". You will likely not get consistent (or perhaps any) support for professional development, conference attendance, etc. Librarians, as faculty, have professional development requirements as part of their jobs. You, whose professional development is not mandated by the organization (merely something you have to do if you want to stay current and advance your career), will not get the same level of support and probably won't get any credit for publishing articles, giving papers, etc. This is infuriating, and in my opinion self-defeating on the part of the institution, but it is an unfortunate fact.

[Note: Bethany Nowviskie informs me that this is not the case at UVA, where librarians and staff are funded at the same level for professional development. I hope that signals a trend. And by the way, I do realize I'm being inflammatory, talking of castes. This should make you uncomfortable.]

Another downside is that as a member of a lower caste, you may not be able to initiate projects on your own. At many insitutions, only faculty (including librarians) and higher level administrators can make grant proposals, so if you come up with a grant-worthy project idea, someone will have to front for you (and get the credit).

There do exist librarian/developer jobs, and this would be a substantially better situation from a professional standpoint, but since librarian jobs typically require a Master's degree in Library and/or Information Science, libraries may make the calculation that they would be excluding perfectly good programmers from the job pool by putting that sort of requirement in. These are not terribly onerous programs on the whole, should you want to get an MLIS degree, but it does mean obtaining another credential. For what it's worth, I have one, but have never held a librarian position.

It's not all bad though: you will typically have a lot of freedom, loose deadlines, shorter than average work-weeks, and the opportunity to apply your skills to really interesting and hard problems. If you want to continue to pursue your academic interests however, you'll be doing it as a hobby. They don't want your research agenda unless you're a librarian. In a lot of ways, being a DH developer in a library is a DH developer's nirvana. I rant because it's so close to ideal.

Working for a .edu IT Organization

My first full time, permanent position post-Ph.D. was working for an IT organization that supports the College of Arts and Sciences at UNC Chapel Hill. I was one of a handful of programmers who did various kinds of administrative and faculty project support. It was a really good environment to work in. I got to try out new technologies, learned Java, really understood XSLT for the first time, got good at web development and had a lot of fun. I also learned to fear unfunded mandates, that projects without institutional support are doomed, and that if you're the last line of support for a web application, you'd better get good at making it scale.

IT organizations typically pay a bit better than, say, libraries and since it's an IT organization they actually understand technology and what it takes to build systems. There's less sense of being the odd man out in the organization. That said, if you're the academic/DH applications developer it's really easy to get overextended, and I did a bad job of avoiding that fate, "learning by suffering" as Aeschylus said.

Working in Industry

Working outside academia as a developer is a whole other world. Again, DH work is likely to have to be a hobby, but depending on where you work, it may be a relevant hobby. You will be paid (much) more, will probably have a budget for professional development, and may be able to use it for things such as attending DH conferences. Downsides are that you'll probably work longer hours and you'll have less freedom to choose what you do and how you do it, because you're working for an organization that has to make money. The capitalist imperative may strike you as distasteful if you've spent years in academia, but in fact it can be a wonderful feedback mechanism. Doing things the right way (in general) makes the organization money, and doing them wrong (again, in general) doesn't. It can make decision-making wonderfully straightforward. Companies, particularly small ones, can make decisions with a speed that seems bewilderingly quick when compared to libraries, which thrive on committees and meetings and change direction with all the flexibility of a supertanker.

Another advantage of working in industry is that you are more likely to be part of a team working on the same stuff as you. In DH we tend to only be able to assign one or two developers to a job. You will likely be the lone wolf on a project at some point in your career. Companies have money, and they want to get stuff done, so they hire teams of developers. Being on a team like this is nice, and I often miss it.

There are lots of companies that work in areas you may be interested in as someone with a DH background, including the semantic web, text mining, linked data, and digital publishing. In my opinion, working on DH projects is great preparation for a career outside academia.

Funding

As a DH developer, you will more likely than not end up working on grant-funded projects, where your salary is paid with "soft money". What this means in practical terms is that your funding will expire at a certain date. This can be good. It's not uncommon for programmers to change jobs every couple of years anyway, so a time-limited position gives you a free pass at job-switching without being accused of job-hopping. If you work for an organization that's good at attracting funding, then it's quite possible to string projects together and/or combine them. Though there can be institutional impedance mismatch problems here, in that it might be hard to renew a time-limited position, or to convert it to a permanent job without re-opening it for new applicants, or to fill in the gaps between funding cycles. So some institutions have a hard time mapping funding streams onto people efficiently. These aren't too hard to spot because they go though "boom and bust" cycles, staffing up to meet demand and then losing everybody when the funding is gone. This doesn't mean "don't apply for this job," just do it with your eyes open. Don't go in with the expectation (or even much hope) that it will turn into a permanent position. Learn what you can and move on. The upside is that these are often great learning opportunities.

In sum, being a DH developer is very rewarding. But I'm not sure it's a stable career path in most cases, which is a shame for DH as a "discipline" if nothing else. It would be nice if there were more senior positions for DH "makers" as well as "thinkers" (not that those categories are mutually exclusive). I suspect that the institutions that have figured this out will win the lion's share of DH funding in the future, because their brain trusts will just get better and better. The ideal situation (and what you should look for when you're looking to settle down) is a place

that has a good track record of getting funded,

where developers are first-class members of the organization (i.e. have "researcher" or similar status),

where there's a team in place and it's not just you, and

where there's some evidence of long-range planning.

For the most part, though, DH development may be the kind of thing you do for a few years while you're young before you go and do something else. I often wonder whether my DH developer expiration date is approaching. Grant funding often won't pay enough for an experienced programmer, unless those who wrote the budget knew what they were doing [Note: I've read too many grant proposals where the developer salary is < $50K (entry-level) but they have the title "Lead Developer" vel sim.— for what it's worth, this positively screams "We don't know what we're doing!"]. It may soon be time to go back to working for lots more money in industry; or to try to get another administrative DH job. For now, I still have about a year's worth of grant funding left. Better get back to work.

Tuesday, May 04, 2010

Addenda et Corrigenda

The proceedings of the 2009 Lawrence J. Schoenberg Symposium on Manuscript Studies in the Digital Age, at which I was a panelist were recently published at http://repository.upenn.edu/ljsproceedings/. I contributed a short piece arguing for the open licensing of content related to the study of medieval manuscripts.

Peter Hirtle, Senior Policy Advisor at the Cornell University Library, wrote me a message commenting on the piece and raising a point that I had elided (reproduced with permission):

Dear Dr. Cayless:

I read with great interest your article on “Digitized Manuscripts and Open Licensing.” Your arguments in favor of a CC-BY license for medieval scholarship are unusual, important, and convincing.

I was troubled, however, to see your comments on reproductions of medieval manuscripts. For example, you note that if you use a CC-BY license, “an entrepreneur can print t-shirts using your digital photograph of a nice initial from a manuscript page.” Later you add:

"Should reproductions of cultural objects that have never been subject to copyright (and that would no longer be, even if they once had) themselves be subject to copyright? The fact is that they are, and some uses of the copyright on photographs may be laudable, for example a museum or library funding its ongoing maintenance costs by selling digital or physical images of objects in its collection, but the existence of such examples does not provide an answer to the question: as an individual copyright owner, do you wish to exert control how other people use a photograph of something hundreds or thousands or years old?"

There is a fundamental mistaken concept here. While some reproductions of cultural objects are subject to copyright, most aren’t. Ever since the Bridgeman decision, the law in the New York circuit at least (and we believe in most other US courts) is that a “slavish” reproduction does not have enough originality to warrant its own copyright protection. If it is an image of a three-dimensional object, there would be copyright, but if it is just a reproduction of a manuscript page, there would be no copyright. It may take great skill to reproduce well a medieval manuscript, but it does not take originality. To claim that it does encourages what has been labeled as “copyfraud.”

You can read more about Bridgeman on pps. 34-35 in my book on Copyright and Cultural Institutions: Guidelines for Digitization for U.S. Libraries, Archives, and Museums, available for sale from Amazon and as a free download from SSRN at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1495365.

Sincerely,
Peter Hirtle

Peter is quite right that the copyright situation in the US, at least insofar as faithful digital reproductions of manuscript pages are concerned, is (with high probability) governed by the 1999 Bridgeman vs. Corel decision. So it is arguably best practice for a scholar who has photographed manuscripts to publish them and assert that they are in the public domain.

I skimmed over this in my article because, for one thing, much of the scholarly content of the symposium dealt with European institutions and their online publication (some behind a paywall—sometimes a crazily expensive one) of manuscripts, so US copyright law doesn't necessarily pertain. For another, I had in mind not just manuscripts but inscriptions, to which—to the extent that they are three-dimensional objects—Bridgeman doesn't pertain. Finally, while it is common practice to produce faithful digital reproductions of manuscript texts, it is also common to enhance those images for the sake of readability, at which point they are (may be?) no longer "slavish reproductions" and thus subject to copyright. The problem of course, as so often in copyright, is that there's no case law to back this up. If I crank up the contrast or run a Photoshop (or Gimp) filter on an image, have I altered it sufficiently to make it copyrightable? I don't know for certain, and I'm not sure anyone does. So on balance I'd still argue for doing the simple thing and putting a Creative Commons license (CC-BY is my recommendation) on everything. This is what the Archimedes Palimpsest project does, for example, who are arguably in just this situation with their publication of multispectral imaging of the palimpsest. And they are to be commended for doing so.

Anyway, I'd like to thank Peter, first for reading my article and second for prodding me into usefully complicating something I had oversimplified for the sake of argument.

Tuesday, March 02, 2010

Making a new Numbers Server for papyri.info

[UPDATE: added relationships diagram]

One of the components of papyri.info (a fairly well-hidden one) is a service that provides lookups of identifiers and correlates them with related records in other collections. Over the last few weeks, I've been working on replacing the old, Lucene-based numbers server with a new, triplestore-based one. One of the problems with the old version (though not the one that initially sent me on this quest, which was that I hated its identifiers) was that its structure didn't match the multidimensional nature of the data.

Dimensions:

Collections in the PN (there are four, so far) are hierarchical: for the Duke Databank of Documentary Papyri (DDbDP) and the Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens (HGV—which has two collections, of metadata and translations), there are series, volumes, and items, and for the Advanced Papyrological Information System (APIS) there are institutions and items.
FRBR: there's a Work (the ancient document itself), which has expression in a scholarly publication, from which the DDbDP transcription, HGV records and translations, and APIS records and translations are derived; these may be made manifest in a variety of ways, including EpiDoc XML, an HTML view, etc. The scholarly work has bibliography, which is surfaced in the HGV records. There is the possibility of attaching bibliography at the volume level as well (since these are actual books, sitting in libraries). Libraries may have series-level catalog records too.
There are relationships between items that describe the same thing. DDbDP and HGV usually have a 1::1 relationship (but not always). APIS has some overlap with both.
There are internal relationships as well. HGV has the idea of a "principal edition," the canonical publication of a document (there are also "andere publikationen"—other publications). DDbDP does as well, but expresses it slightly differently: older versions that have been superseded have stub records with a pointer to the replacement. The replacements point backward as well, and these can form sometimes complex chains (imagine two fragments published separately, but later recognized as belonging to the same document and republished together).

Relationships:

All of this is really hard to represent in a relational or document-oriented fashion. It turns out though, that a graph database does really well. I experimented with Mulgara and found that it does the job perfectly. I can write SPARQL queries that retrieve the data I need from the point of view of any component. Then I can map these to nice URLs so that they are easy to retrieve, using a servlet that does some URL rewriting. Some examples:

An HGV record:
http://papyri.info/hgv/8875/rdf
An HGV translation:
http://papyri.info/hgvtrans/8875/rdf
An HGV record's principal edition and andere pub.:
http://papyri.info/hgv/249/frbr:Work/rdf
An APIS record:
http://papyri.info/apis/berenike.apis.17/rdf
A (corresponding) DDb record:
http://papyri.info/ddbdp/o.berenike;1;17/rdf
A DDb series listing (with "human-readable" citation):
http://papyri.info/ddbdp/chla/rdf
A DDb volume listing:
http://papyri.info/ddbdp/chla;1/rdf
The DDb collection listing:
http://papyri.info/ddbdp/rdf
The HGV collection listing:
http://papyri.info/hgv/rdf

Results are also available in Notation3 or JSON formats (available by substituting "n3" or "json" for "rdf" in the URLs above). All of this makes for a nice machine interface to the relationships between papyri.info data. One that can be generated purely from the data files themselves, plus an RDF file that contains the abbreviated series citations from which DDbDP derives its identifiers. The new Papyrological Editor that will allow scholars to propose emendations to existing documents and to add new ones will use it to determine what files to pull for editing. I also plan to drive the new Solr-based search indexing (which is necessarily document-oriented) using it, since it provides a clear view of which documents should be aggregated.

The URL schemes above illustrate what I plan to do with the new version of papyri.info. Content URLs will be of the form http://papyri.info/ <collection name> / <collection specific identifier> [/ <format>]. Leaving off a format will give you a standard HTML view of the document + associated documents; /source will give you the EpiDoc source document by itself; /atom will give you an ATOM-based representation. I'm also thinking of /rdfa for an HTML-based view of the numbers server data, with embedded RDFa.

What's Next

I haven't done anything really sophisticated with this yet. I'd like to experiment with extending the DCTERMS vocabulary to deal with (e.g.) typed identifiers. Importing other vocabularies (like FRBR or BIBO) may make sense as well. We're talking about hooking this up to bibliography (via records in Zotero) and ancient places (via Pleiades). It all works well with my design philosophy for papyri.info, which is that it should consist of data (in the form of EpiDoc source files and representations of those files), retrievable via sensible URLs, with modular services surrounding the data to make it discoverable and usable.

I made a couple of changes to Mulgara during the course of this:

turned off its strange and repugnant habit of representing namespaces and other URIs by declaring them as entities in an internal DTD in returned RDF results. Please don't do this. It's 2010. For another thing it breaks if you have any URLEncoded characters (i.e. %something) in a URI, because your XML parser will think they are malformed parameter entities.
made the servlet return a 404 not found for queries with no hits (which seems more RESTfully correct)

Anyway, I need to revisit the Mulgara changes I made and try to either get them committed to the Mulgara codebase, or refactor them so that I'm not actually messing with Mulgara's internals. I guess trying another triplestore is a third option. Mulgara is fast, easy to use, and it solved my problem, so I went with it. But there still might be better alternatives out there.