Open Library Catalog Data
The Open Libraries blog reported earlier this month that Casey Bisson, information architect at Plymouth State University, was presented with a $50,000 Mellon award for Technology Collaboration by Tim Berners-Lee for his creation of WP-OPAC, a mash-up of a library catalog with the tagging functionality of blogging software.
The revolutionary part of the announcement, however, was that Plymouth State University would use the $50,000 to purchase Library of Congress catalog records and redistribute them free under a Creative Commons Share-Alike license or GNU. OCLC has been the source for catalog records for libraries, and its license restrictions do not permit reuse or distribution. However, catalog records have been shared via Z39.50 for several years without incident.
"Libraries' online presence is broken. We are more than study halls in the digital age. For too long, libraries have have been coming up with unique solutions for common problems," Bisson said. "Users are looking for an online presence that serves them in the way they expect." He said "The intention is to bring together the free or nearly-free services available to the user."
Bisson said Plymouth State University is committed to supporting it, and will be offering it as a free download from its site, likely in the form of sample records plus WordPress with WP-OPAC included. "With nearly 140,000 registered users of Amazon Web Services, it's time to use common solutions for our unique problems," Bisson said.
Following the announcement, there was some low-key discussion about it on the Next Generation Catalog listserv and other library discussion forums congratulating Casey Bisson, but the remarkable nature of the proposal for an open data release of the cataloging records went largely unremarked.
I find it reassuring, therefore, that last week, Ross Singer, Tim Spalding, Rob Styles, Richard Wallis, and Paul Miller recorded a podcast about this award which did focus on the open data aspects of the announcement. It's worth a listen.
Much of the podcast focused on the apparent ambiguity about the legality of Bisson's intention. While LC's data is, at least in the US, free of copyright restrictions, it is not clear that this is true outside the US. Purchase of the data does not in itself change that situation. Certainly, Talis in the UK was mentioned as paying quite a lot (unspecified) to use LC's MARC records. And OCLC, the primary distributor of MARC records to libraries, as mentioned above has a set of restrictions that governs the use and transfer of these records by member libraries.
Ultimately, I am left unsure how to sort out the contradictions made apparent by this announcement.
Most importantly, LC's MARC cataloging data is already freely accessible, both by Z39.50 and by their cool SRU gateway. There was a passing mention of SRU in the podcast but they didn't dwell on it.
Construct a valid query to the SRU gateway and you get back a complete cataloging record in MARCXML. Both the MARCXML schema and the SRW schema are fully documented and available for use and LC's explanation of the MARCXML schema is sufficiently complete that, even without a cataloguer's or programmer's understanding of MARC format, I am confident that MARCXML delivers something that could imported as is into a library's ILS for cataloguing purposes.
My question, therefore, is why is having this data made available in one huge chunk being hailed a godsend for open library data, when I already can retrieve the exact record I need at the time I need it via SRU?
Along the same lines, Ross Singer mentioned the bittorrent release of Barton's MIT catalog data which was followed by an almost immediate retraction of that data over ambiguities of whether OCLC would, in fact, allow this. Again, why would I torrent this massive file when I can go "just in time" via SRU to get the MARC record I actually need?
Furthermore, Ed Summers and others have written code libraries, first in Perl and more recently in Python and Ruby, for manipulating MARCXML. The Ruby libraries, especially, seem designed to ease integrating MARCXML data into web-based applications.
I'm also surprised that Tim Spaulding, who most certainly has amassed a lot of expertise with using LC MARC records in the context of building LibraryThing and who must have had occasion to use the SRU interface that LC provides, didn't talk more about that in the context of Bisson's announcement.
Is the significance of the announcement then, assuming this goes through unchallenged by OCLC, that the ambiguity of ownership of this data will be finally resolved? There was a (again only passing) mention that, since OCLC is member-organization, that the members themselves -- y'know, libraries -- can change the rules of distribution for the data.
Lastly, any cataloguer will tell you that having a bibliographic record is, in itself, insufficient and requires local holdings enhancements specific to a given library. And, as Ross Singer wondered about, don't we also want the LC authority records released as open data as well? (Later on that last one, I guess.)
One idea that does linger in my head is the idea of creating a bittorrent distribution channel for library cataloging data. In the podcast, a concern was raised on whether a single server such as Library of Congress's might be serverely impacted if a lot of requests were made against its SRU server. If all the hopes of Casey Bisson's gift to the library community are realized, what if libraries were to contribute their individual cataloging and authority records to a global torrent? Again, I don't see the value of a single large file, like MIT's Barton data, over distribution of individual records. In the real world, torrent sharing is mostly at the work-level and that would seem to be the logical way to handle library records.
Posted by Tom on December 18, 2006