Tuesday, March 02, 2010

Making a new Numbers Server for papyri.info

[UPDATE: added relationships diagram]

One of the components of papyri.info (a fairly well-hidden one) is a service that provides lookups of identifiers and correlates them with related records in other collections. Over the last few weeks, I've been working on replacing the old, Lucene-based numbers server with a new, triplestore-based one. One of the problems with the old version (though not the one that initially sent me on this quest, which was that I hated its identifiers) was that its structure didn't match the multidimensional nature of the data.

Dimensions:

  • Collections in the PN (there are four, so far) are hierarchical: for the Duke Databank of Documentary Papyri (DDbDP) and the Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens (HGV—which has two collections, of metadata and translations), there are series, volumes, and items, and for the Advanced Papyrological Information System (APIS) there are institutions and items.
  • FRBR: there's a Work (the ancient document itself), which has expression in a scholarly publication, from which the DDbDP transcription, HGV records and translations, and APIS records and translations are derived; these may be made manifest in a variety of ways, including EpiDoc XML, an HTML view, etc. The scholarly work has bibliography, which is surfaced in the HGV records. There is the possibility of attaching bibliography at the volume level as well (since these are actual books, sitting in libraries). Libraries may have series-level catalog records too.
  • There are relationships between items that describe the same thing. DDbDP and HGV usually have a 1::1 relationship (but not always). APIS has some overlap with both.
  • There are internal relationships as well. HGV has the idea of a "principal edition," the canonical publication of a document (there are also "andere publikationen"—other publications). DDbDP does as well, but expresses it slightly differently: older versions that have been superseded have stub records with a pointer to the replacement. The replacements point backward as well, and these can form sometimes complex chains (imagine two fragments published separately, but later recognized as belonging to the same document and republished together).

Relationships:

All of this is really hard to represent in a relational or document-oriented fashion. It turns out though, that a graph database does really well. I experimented with Mulgara and found that it does the job perfectly. I can write SPARQL queries that retrieve the data I need from the point of view of any component. Then I can map these to nice URLs so that they are easy to retrieve, using a servlet that does some URL rewriting. Some examples:

An HGV record:
http://papyri.info/hgv/8875/rdf
An HGV translation:
http://papyri.info/hgvtrans/8875/rdf
An HGV record's principal edition and andere pub.:
http://papyri.info/hgv/249/frbr:Work/rdf
An APIS record:
http://papyri.info/apis/berenike.apis.17/rdf
A (corresponding) DDb record:
http://papyri.info/ddbdp/o.berenike;1;17/rdf
A DDb series listing (with "human-readable" citation):
http://papyri.info/ddbdp/chla/rdf
A DDb volume listing:
http://papyri.info/ddbdp/chla;1/rdf
The DDb collection listing:
http://papyri.info/ddbdp/rdf
The HGV collection listing:
http://papyri.info/hgv/rdf

Results are also available in Notation3 or JSON formats (available by substituting "n3" or "json" for "rdf" in the URLs above). All of this makes for a nice machine interface to the relationships between papyri.info data. One that can be generated purely from the data files themselves, plus an RDF file that contains the abbreviated series citations from which DDbDP derives its identifiers. The new Papyrological Editor that will allow scholars to propose emendations to existing documents and to add new ones will use it to determine what files to pull for editing. I also plan to drive the new Solr-based search indexing (which is necessarily document-oriented) using it, since it provides a clear view of which documents should be aggregated.

The URL schemes above illustrate what I plan to do with the new version of papyri.info. Content URLs will be of the form http://papyri.info/ <collection name> / <collection specific identifier> [/ <format>]. Leaving off a format will give you a standard HTML view of the document + associated documents; /source will give you the EpiDoc source document by itself; /atom will give you an ATOM-based representation. I'm also thinking of /rdfa for an HTML-based view of the numbers server data, with embedded RDFa.

What's Next

I haven't done anything really sophisticated with this yet. I'd like to experiment with extending the DCTERMS vocabulary to deal with (e.g.) typed identifiers. Importing other vocabularies (like FRBR or BIBO) may make sense as well. We're talking about hooking this up to bibliography (via records in Zotero) and ancient places (via Pleiades). It all works well with my design philosophy for papyri.info, which is that it should consist of data (in the form of EpiDoc source files and representations of those files), retrievable via sensible URLs, with modular services surrounding the data to make it discoverable and usable.

I made a couple of changes to Mulgara during the course of this:
  1. turned off its strange and repugnant habit of representing namespaces and other URIs by declaring them as entities in an internal DTD in returned RDF results. Please don't do this. It's 2010. For another thing it breaks if you have any URLEncoded characters (i.e. %something) in a URI, because your XML parser will think they are malformed parameter entities.
  2. made the servlet return a 404 not found for queries with no hits (which seems more RESTfully correct)
Anyway, I need to revisit the Mulgara changes I made and try to either get them committed to the Mulgara codebase, or refactor them so that I'm not actually messing with Mulgara's internals. I guess trying another triplestore is a third option. Mulgara is fast, easy to use, and it solved my problem, so I went with it. But there still might be better alternatives out there.