Wednesday, December 31, 2008

OpenLayers and Djatoka

For the last few weeks, I've been playing around with the new JPEG2000 image server released by the Los Alamos National Labs (http://african.lanl.gov/aDORe/projects/djatoka/). I never could get the image viewer released along with it to work, and I immediately thought of OpenLayers (http://openlayers.org/), a javascript API for embedding maps. OpenLayers is like Google Maps in many ways, but Free. Besides maps, it works very well for any image, and provides a lot of tools developed for mapping, but also useful for displaying and working with any large image. I wanted to use OpenLayers support for tiled images in conjunction with Djatoka's ability to render arbitrary sections of an image at a number of zoom levels (the number of levels available depends on how the image was compressed).

After a lot of messing around and some false starts, I've developed a Javascript class that supports Djatoka's OpenURL API. I've been testing it on JPEG2000 images created with ContentDM in the UNC Library's digital collections, with a good deal of success. The results are not yet available online, because I don't have a public-facing server I can host it on, but the source code is up on github here.

Instructions:

Install Djatoka. Incidentally, in order to get this in the queue for installation on our systems, I had to make Djatoka work on Tomcat 6. The binary doesn't work out of the box, but when I rebuilt it on my system (RHEL 5), it worked fine.

Copy the adore-djatoka WAR into your Tomcat webapps directory. Follow the instructions on the Djatoka site to start the webapp.

Grab a copy of OpenLayers. Put the OpenURL.js file in lib/OpenLayers/Layer/ and run the build.py script.

To just run the demo, copy the djatoka.html, the OpenLayers.js you just built, and the .css files from OpenLayers/theme/ and from the examples/ directory, as well as the OpenLayers control images from OpenLayers/img into the adore-djatoka directory in webapps. You should then be able to access the djatoka.html file and see the demo.

This all comes with no guarantees, of course. It seems to work quite well with the JPEG2000 images I've tested, and the tiling means that each request of Djatoka consumes an equal amount of resources. I've run into OutOfMemoryErrors when requesting full-size images, but this method loads them without any problem.

Update (2009-01-05 14:37): I've posted a fix to the OpenURL.js script for a bug pointed out to me by John Fereira on the djatoka-devel list. If you grabbed a copy before now, you should update.

Update: screenshots --







Wednesday, October 29, 2008

Thoughts on crosswalking

For the second Integrating Digital Papyrology project, we need to develop a method for crosswalking between EpiDoc (which is a dialect of TEI) and various database formats. We've thought about this quite a bit in the past and we think that we don't just want to write a one-off conversion because (a) there will be more than one such conversion and (b) we want to be able to document the mappings between data sources in a stable format that isn't just code (script, XSLT, etc.)

Some of the requirements for this notional tool are:


  • should document mappings between data formats in a declarative fashion

  • certain fields will require complex transformations. For example, the document text will likely be encoded in some variant of Leiden in the database, and will need to be converted to EpiDoc XML. This is currently accomplished by a fairly complex Python script, so it should be possible to define categories of transformation which would signal a call to an external process.

  • some mappings will involve the combination of database fields into a single EpiDoc element, and others, the division of a single field into multiple EpiDoc elements

  • Context-specific information (not included in the database) will need to be inserted into the EpiDoc document, so some sort of templating mechanism should be supported.

  • The mapping should be bidirectional. We aren't just talking about exporting from a database to EpiDoc, but also about importing from EpiDoc, which is envisioned as an interchange format as well as a publication format. This is why a single mapping document, rather than a set of instructions on how to get from one to the other would be nice.


So far, my questions to various lists have turned up favorable responses (i.e. "yes, that would be a good thing") but no existing standards....

Monday, October 20, 2008

On Bamboo the 2nd

I spent Thursday - Saturday last week at the second Bamboo workshop in San Francisco. So some reactions:

1) The organizers are well-intentioned and are sincerely trying to wrestle with the problem of cyberinfrastructure for Digital Humanities.

2) That said, it isn't clear that the Bamboo approach is workable. The team is very IT focused, and while they seem to have a solid grasp of large-scale software architecture, the ways in which that might be applied to the Humanities with any success aren't obvious. There was a lot of misdirected effort between B1 and B2 by some very smart people, who I must say had the good grace to admit it was a nonstarter. Their attempt to factor the practices of scholars into implementable activities resulted in something that lacked enough context and specificity to be useful. A refocusing on context and on the processes that contain and help define the activities happened at the workshop and seems likely to go forward.

3) The workshops themselves seem to have been quite useful. I wasn't at any or the round one workshops, and I doubt I'll be at any of the others (I represented the UNC Library because the usual candidates weren't available), but everyone I talked to was very engaged (if often skeptical). The connections and discussion that seem to have emerged so far probably make the investment worthwhile, even if "Bamboo" as conceived doesn't work.

4) The best idea I heard came (not surprisingly) from Martin Mueller, who suggested Bamboo become a way to focus Mellon funding on projects that conform to certain criteria (such as reusable components and standards) for a defined period (say five years). The actual outcome of the current Bamboo would be the criteria for the RFP. Simple, encourages institutions to think along the right lines, might actually do some good, and might allow participation by smaller groups as well.

5) There was a lot of talk about the people who are both researchers and technologists (guilty). These were variously defined as "hybrids," "translators," and, most offensively, "the white stuff inside the Oreo." None of this was meant to be offensive, but in the end, it is. People who can operate comfortably in both the worlds of scholarship and IT can certainly be useful go-betweens for those who can't, but that is not our sole raison d'être. Until recently there haven't been many jobs for us, but that seems to be changing, and I hope it continues to. See Lisa Spiro's excellent recent post on Digital Humanities Jobs and Sean Gillies, who without having been there, manages to capture some of the reservations I feel about the current enterprise and pick up on the educational aspect. One possible useful future for Bamboo would be simply to foster the development of more "hybrids."

6) The Bamboo folks have set themselves a truly difficult task. They are making a real effort to tackle it in an open way, and should be commended for it. But it is a very hard problem, and one for which there is still not a clear definition. The software engineer part of my hybrid brain wants problems defined before it will even consider solutions. The classicist part believes some things are just hard, and you can't expect technology to make them easy for you.

Sunday, September 28, 2008

Go Zotero!

The Thomson Reuters lawsuit against the developers of Zotero is getting a lot of notice, which is good.

I've noticed that in the library world, when people mention getting sued, it's with fear and the implication that this represents the end of the world. It's an interesting contrast coming from working for a startup (albeit a pretty well-funded one) where lawsuits == a) publicity, and are not to be feared (perhaps even to be provoked) and/or b) are a signal that you've scared your competitors enough to make them go running to Daddy, thus unequivocally validating your business model.

This is an act of sheer desperation on the part of Thomson Reuters. They're hoping GMU will crumble and shut the project down. I do hope Dan has contacted the EFF (donate!) and that the GMU administration will take this for what it is: fantastic publicity for one of their most important departments and an indicator that they are doing something truly great.

Friday, August 15, 2008

Back from Balisage

I never made it to Extreme, Balisage's predecessor, despite wanting to very badly, so I'm very glad I did go to its new incarnation. I'm still processing the week's very rich diet of information, but it was very, very cool.

Simon St. Laurent, who wrote one of the first XML books I bought back in 1999, Inside XML DTDs has a photo of one of the slides from my presentation in his Balisage roundup post. This is the kind of κλέος I can appreciate!

Thursday, August 14, 2008

Balisage Presentation online

I just rsynced up my presentation on linking manuscript images to transcriptions using SVG for Balisage, that I gave this morning. It's at http://www.unc.edu/~hcayless/img2xml/presentation.html. The image viewer embedded into the presentation is at http://www.unc.edu/~hcayless/img2xml/viewer.html. Text paths are still busted at the highest resolution, as you'll see if you zoom all the way in, but apart from that it seems to work.

Balisage has been a really great conference so far. I highly recommend it.

Saturday, May 31, 2008

New TransCoder release

This is something I've been meaning to wrap up and write up for a while now: thanks to the Duke Integrating Digital Papyrology grant from the Andrew W. Mellon Foundation, I've been able to make a bunch of updates to the Transcoder, a piece of software I originally wrote for the EpiDoc project. Transcoder is a Java program that handles switching the encodings of Greek text, for example from Beta Code to Unicode (or back again). It's used in initiatives like Perseus and Demos. I've been modifying it to work with Duke Databank of Documentary Papyri XML files (which are TEI based). Besides a variety of bug fixes, there is now also included in Transcoder a fully-functional SAX ContentHandler that allows the processing of XML files containing Greek text to be transcoded.

There are a lot of complex edge cases in this sort of work. For example, Beta Code (or at least the DDbDP's Beta) doesn't distinguish between medial (σ) and final (ς) sigmas. That's an easy conversion in the abstract (just look for 's' at the end of a word, and it's final), but when your text is embedded in XML, and there may be an expansion (<expan>) tag in the middle of a word, for example, it becomes a lot harder. You can't just convert the contents of a particular element--you have to be able to look ahead. The problem with SAX, of course, is that it's stream-based, so no lookahead is possible unless you do some buffering. In the end what I did was buffer SAX events when an element (say a paragraph) marked as Greek begins, and keep track of all the text therein. That let me do the lookahead I needed to do, since I have a buffer containing the whole textual content of the <p> tag. When the end of the element comes, I then flush the buffer, and all the queued-up SAX events fire, with the transcoded text in them.

That's a lot of work for one letter, but I'm happy to say that it functions well now, and is being used to process the whole DDbDP. Another edge case that I chose not to solve in the Transcoder program is the problem of apparatus and their contents in TEI. An <app> element can contain a <lem> (lemma) and one or more <rdg> (readings). The problem with it is that the lemma and readings are conceptually parallel in the text. For example:

The quick brown <lem>fox</lem> jumped over the lazy dog.
                <rdg>cat</rdg>


The TEI would be:

The quick brown <app><lem>fox</lem><rdg>cat</rdg></app> jumped over the lazy dog

So "cat" follows immediately after "fox" in the text stream, but both words occupy the same space as far as the markup is concerned. In other words, I couldn't rely only on my fancy new lookahead scheme, because it broke down in edge cases like this. The solution I went with is dumb, but effective: format the apparatus so that there is a newline after the lemma (and the reading, if there are multiple readings). That way my code will still be able to figure out what's going on. The whitespace so introduced really needs to be flagged as significant, so that it doesn't get clobbered by other XML processes though. That has already happened to us once. It caused a bug for me too, because I wasn't buffering ignorable whitespace.

All that trouble over one little letter. Lunate sigmas would have made life so much easier...

Sunday, March 16, 2008

D·M·S· Allen Ross Scaife 1960-2008

On Saturday afternoon, March 15th, I learned that my friend Ross had died that morning after a long and hard-fought struggle with cancer. He was at his home in Lexington, Kentucky, surrounded by his family.

Ross was one of the giants of the Digital Classics community. He was the guiding force behind the Stoa, and the founder of many of its projects. Ross was always generous with his time and resources and has been responsible for incubating many fledgling Digital Humanities initiatives. His loss leaves a gap that will be impossible to fill.

Ross was also a good friend, easy to talk to, and always ready to encourage me to experiment with new ideas. I miss him very much.

What he began will continue without him, and though we cannot ever replace Ross, we can honour his memory by carrying on his good work.

update (March 21, 21:04)

Dot posted a lovely obituary of Ross at the Stoa. Tom and several others have posted nice memorials as well.

On a happier note: my daughter, Caroline Emma Ross Cayless was born at 11:52 pm, March 19th.

Wednesday, January 23, 2008

Catching up

My New Year's resolution was to write more, and specifically to blog more, but so far all of my writing has been internally focussed at my job.  So I shall have another go...

Speaking of New Year's, I spent a chunk of New Year's Eve getting The Colonial and State Records of North Carolina off the ground.  It's driven by the eXist XML database, of which I've grown rather fond.  XQuery has a lot of promise as a tool for digital humanists with large collections of XML.