When exposing sets of MARC records as linked data, do you think it is better to 
expose them in batch (collection) files or as individual RDF serializations? To 
bastardize the Bard — “To batch or not to batch? That is the question.”

Suppose I am a medium-sized academic research library. Suppose my collection is 
comprised of approximately 3.5 million bibliographic records. Suppose I want to 
expose those records via linked data. Suppose further that this will be done by 
“simply” making RDF serialization files (XML, Turtle, etc.) accessible via an 
HTTP filesystem. No scripts. No programs. No triple stores. Just files on an 
HTTP file system coupled with content negotiation. Given these assumptions, 
would you:

  1. create batches of MARC records, convert them to MARCXML
     and then to RDF, and save these files to disc, or

  2. parse the batches of MARC record sets into individual
     records, convert them into MARCXML and then RDF, and
     save these files to disc

Option #1 would require heavy lifting against large files, but the number of 
resulting files to save to disc would be relatively few — reasonably managed in 
a single directory on disc. On the other hand, individual URIs pointing to 
individual serializations would not be accessible. They would only be 
accessible by retrieving the collection file in which they reside. Moreover, a 
mapping of individual URIs to collection files would need to be maintained. 

Option #2 would be easier on the computing resources because processing little 
files is generally easier than processing bigger ones. On the other hand, the 
number of files generated by this option is not easily be managed without the 
use of a sophisticated directory structure. (It is not feasible to put 3.5 
million files in a single directory.) But I would still need to create a 
mapping from URI to directory.

In either case, I would probably create a bunch of site map files denoting the 
locations of my serializations — YAP (Yet Another Mapping).

I’m leaning towards Option #2 because individual URIs could be resolved more 
easily with “simple” content negotiation.

(Given my particular use case — archival MARC records — I don’t think I’d 
really have more than a few thousand items, but I’m asking the question on a 
large scale anyway.)

—
Eric Morgan

Reply via email to