Re: [CODE4LIB] transforming marc to rdf
I have created an initial pile of RDF, mostly. I am in the process of experimenting with linked data for archives. My goal is to use existing (EAD and MARC) metadata to create RDF/XML, and then to expose this RDF/XML using linked data principles. Once I get that far I hope to slurp up the RDF/XML into a triple store, analyse the data, and learn how the whole process could be improved. This is what I have done to date: * accumulated sets of EAD files and MARC records * identified and cached a few XSL stylesheets transforming EAD and MARCXML into RDF/XML * wrote a couple of Perl script that combine Bullet #1 and Bullet #2 to create HTML and RDF/XML * write a mod_perl module implementing rudimentary content negotiation * made the whole thing (scripts, sets of data, HTML, RDF/XML, etc.) available on the Web You can see the fruits of these labors at http://infomotions.com/sandbox/liam/, and there you will find a few directories: * bin - my Perl scripts live here as well as a couple of support files * data - full of RDF/XML files -- about 4,000 of them * etc - mostly stylesheets * id - a placeholder for the URIs and content negotiation * lib - where the actual content negotiation script lives * pages - HTML versions of the original metadata * src - a cache for my original metadata * tmp - things of brief importance; mostly trash My Perl scripts read the metadata, create HTML and RDF/XML, and save the result in the pages and data directories, respectively. A person can browse these directories, but browsing will be difficult because there is nothing there except cryptic file names. Selecting any of the files should return valid HTML or RDF/XML. Each cryptic name is the leaf of a URI prefixed with http://infomotions.com/sandbox/liam/id/;. For example, if the leaf is mshm510, then the combined leaf and prefix form a resolvable URI -- http://infomotions.com/sandbox/liam/id/mshm510. When user-agent says it can accept text/html, then the HTTP server redirects the user-agent to http://infomotions.com/sandbox/liam/pages/mshm510.html. If the user agent does not request a text/html representation, then the RDF/XML version is returned -- http://infomotions.com/sandbox/liam/data/mshm510.rdf. This is rudimentary content-negotiation. For a good time, here are a few actionable URIs: * http://infomotions.com/sandbox/liam/id/4042gwbo * http://infomotions.com/sandbox/liam/id/httphdllocgovlocmusiceadmusmu004002 * http://infomotions.com/sandbox/liam/id/ma117 * http://infomotions.com/sandbox/liam/id/mshm509 * http://infomotions.com/sandbox/liam/id/stcmarcocm11422551 * http://infomotions.com/sandbox/liam/id/vilmarcvil_155543 For a good time, feed them to the W3C RDF Validator. The next step is to figure out how to handle file not found errors when a URI does not exist. Another thing to figure out is how to make potential robots aware of the data set. The bigger problem is to simply make the dataset more meaningful the the inclusion of more URIs in the RDF/XML as well as the use of a more consistent and standardized set of ontologies. Fun with linked data? — Eric Morgan
Re: [CODE4LIB] transforming marc to rdf
On 12/5/13 8:11 AM, Eric Lease Morgan wrote: Where will I get the URIs from? I will get them by combining some sort of unique code (like an OCLC symbol) or namespace with the value of the MARC records' 001 fields. You actually need 3 URIs per triple: subject URI (which is what I believe you are creating, above) predicate URI (the data element URI, like http://purl.org/dc/terms/title) http://purl.org/dc/terms/title object URI (the URI for the data you are providing, like http://id.loc.gov/authorities/names/n94036700) The first two MUST be URIs. The third SHOULD be a URI but can also be a string. However, strings, in the linked data space, do NOT LINK. If you only have strings in the object/value space then you can run searches against your data, but your data cannot link to other data. Creating linked data that doesn't link isn't terribly useful. (In case this doesn't make sense to anyone reading, I have a slide deck that illustrates this. I've uploaded it to: http://kcoyle.net/presentations/3webIntro.pptx ) A key first step for all of us is to start getting identifiers into our data, even before we start thinking about linked data. MARC records in systems that recognize authority control should be able to store or provide on output the URI of every authority-controlled entity. This should not be terribly difficult (ok, famous last words, I know). But if your vendor system can flip headings then it should also be able to provide a URI (especially since LC has conveniently made their URIs derivable from the LC record numbers). With identifiers for things, THEN you are really linking. kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] transforming marc to rdf
I have successfully been able to being the systematic transformation process of EAD and MARC to RDF/XML, and consequently been able to literally illustrate the resulting triples. [1, 2] From the blog posting [3]: The resulting images are huge, and the astute/diligent reader will see a preponderance of literals in the results. This is not a good thing, but it all that is available right now. On the other hand the same astute/diligent reader will see the root of the RDF/XML pointing to a meaningful URI. This URI will be resolvable in the near future via content negotiation. This is a simple first step. The next steps will be to apply this process to an entire collection of EAD files and MARC records. After that the two other things can happen: 1) the original metadata files can begin to include URIs, and 2) the XSL used to process the metadata can employ a more standardized ontology. It is not an easy process, but it is a beginning. Right now, something is better than nothing. [1] EAD illustration- http://sites.tufts.edu/liam/files/2013/12/hou00096.png [2] MARC illustration - http://sites.tufts.edu/liam/files/2013/12/003078076.png [3] blog posting - http://sites.tufts.edu/liam/2013/12/06/illustrating-rdf/ — Eric Morgan
Re: [CODE4LIB] transforming marc to rdf
On Dec 5, 2013, at 12:35 PM, Ross Singer rossfsin...@gmail.com wrote: You still haven't really answered my question about what you're hoping to achieve and who stands to benefit from it. I don't see how assigning a bunch of arbitrary identifiers, properties, and values to a description of a collection of archival materials (especially since you're talking about doing this in XSLT, so your archival collections can't even really be related to /each other/ much less anything else). Who is going to use going to use this data? What are they supposed to do with it? What will libraries and archives get from it? My goal is three-fold: * to describe to the neophyte what linked data is and why they should care * to describe the to archivist who appreciates the value of linked data but does not know how to achieve its goals, possible approaches to improving there metadata, specifically, the robust inclusion of URIs * to describe to the technologist the principles of archival practice, to make them understand that things like EAD files describe “collections” and not necessarily individual things, moreover to demonstrate the utter simplicity of linked data principles Yes, the EAD files and thus RDF/XML, etc. will not necessarily be linked to other things. That’s the point. By implementing my recipe, I will demonstrate who both the archivist as well as the technologist the need to work differently in order to achieve the linked data goal. My goal is not necessarily to provide a robust information system. While the information system I create will be useful, it is not intended to be the be-all end-all of linked data for archivists. In fact, it will painfully illustrate the deficiencies in our existing practices. Linked data suffers from a chicken-and-egg problem. By implementing my simple recipe, I believe I will be making it easier for the community to lay an egg. — Eric Lease Morgan
Re: [CODE4LIB] transforming marc to rdf
On Dec 5, 2013, at 1:17 PM, Kevin Ford k...@3windmills.com wrote: Frankly, I don't see how you can generate RDF that anybody would want to use from XSLT: where would your URIs come from? What, exactly, are you modeling? -- Our experience getting to good, URI rich RDF has been basically a two-step process. First there is the raw conversion, which certainly results in verbose blank-node-rich RDF, but we follow that pass with a second one during which blank nodes are replaced with URIs. The posting above is exactly the approach I am advocating. As long as the linked data is not incorrect but merely not best practice, then implement linked data with what one has in hand. This will accomplish two goals: 1) make cultural heritage institution metadata more widely available, and 2) provide practice for the technologist for implementation. Once the data is available, then enhance it and repeat the process. It is a never-ending thing. —Eric Morgan
Re: [CODE4LIB] transforming marc to rdf
Hi Eric, you seem to have missed the Catmandu tutorial at SWIB13. Luckily there is a basic tutorial and a demo online: http://librecat.org/ The demo happens to be about transforming MARC to RDF using the Catmandu Perl framework. It gives you full flexibility by separating the importer from the exporter and providing a domain specific language for “fixing” the data in between. Catmandu also has easy to use wrappers for popular search engines and databases (both SQL and NoSQL), making it a complete ETL (extract, transform, load) toolkit. Disclosure: I am a Catmandu contributor. It's free and open source software. Cheers, Christian On Wed, Dec 04, 2013 at 09:59:46PM -0500, Eric Lease Morgan wrote: Converting MARC to RDF has been more problematic. There are various tools enabling me to convert my original MARC into MARCXML and/or MODS. After that I can reportably use a few tools to convert to RDF: * MARC21slim2RDFDC.xsl [3] - functions, but even for my tastes the resulting RDF is too vanilla. [4] * modsrdf.xsl [5] - optimal, but when I use my transformation engine (Saxon), I do not get XML but rather plain text * BIBFRAME Tools [6] - sports nice ontologies, but the online tools won’t scale for large operations -- Christian Pietsch · http://www.ub.uni-bielefeld.de/~cpietsch/ LibTec · Library Technology and Knowledge Management Bielefeld University Library, Bielefeld, Germany
Re: [CODE4LIB] transforming marc to rdf [comet]
On Dec 4, 2013, at 10:29 PM, Corey A Harper corey.har...@nyu.edu wrote: Have you had a look at Ed Chamberlain's work on COMET: https://github.com/edchamberlain/COMET It's been a while since I've run this, but if I remember correctly, it was fairly easy-to-use. Thank you for the pointer. I downloaded the COMET “suite”, and got good output, but only after I enhanced/tweaked the source code to require the Perl Encode module: ./marc2rdf_batch.pl pamphlets.marc The result was a huge set of triples saved as RDF/Turtle. I then used a Java archive (RDF2RDF [0]) to painlessly convert the Turtle to RDF/XML. The process worked. It was “easy” more me, sort of, but it employes quite a number of sophisticated and underlying technologies. I could integrate everything into a whole, but… On to explore other options. [1] RDF2RDF - http://www.l3s.de/~minack/rdf2rdf/ — Sleepless In South Bend
Re: [CODE4LIB] transforming marc to rdf [mods_rdfizer]
On Dec 4, 2013, at 10:29 PM, Corey A Harper corey.har...@nyu.edu wrote: Also, though much older, I seem to remember the Simile MARC RDFizer being a pretty straightforward one to run: http://simile.mit.edu/wiki/MARC/MODS_RDFizer MODS aficionados will point to some problems with some of it's choices for representing that data, but still a good starting point (IMO). Again, thanks for the pointer. I downloaded MODS_RDFizer and got it to run, but it was a good thing that I already had mvn installed. The output did created an RDF/XML file, and I concur, the implemented ontology is “interesting”. The distribution include a possibly cool stylesheet — mods2rdf.xslt. Maybe I can use this. Hmm… —Still Sleepless
Re: [CODE4LIB] transforming marc to rdf [mods_rdfizer]
On Dec 5, 2013, at 6:54 AM, Eric Lease Morgan emor...@nd.edu wrote: http://simile.mit.edu/wiki/MARC/MODS_RDFizer ...The distribution includes a possibly cool stylesheet — mods2rdf.xslt. Ah ha! The MODS_RDFizer’s mods2rdf.xslt file functioned very well against one of my MODS files: $ xsltproc mods2rdf.xslt pamphlets.mods pamphlets.rdf Mods2rdf.xslt could very easily be configured at the beginning of the file to suit the needs of a local “cultural heritage institution”. I like the use of XSL to create a serialized RDF as opposed to the use of an application because less infrastructure is needed to make things happen. — Too Much Coffee?
Re: [CODE4LIB] transforming marc to rdf [catmandu]
On Dec 5, 2013, at 3:07 AM, Christian Pietsch chr.pietsch+web4...@googlemail.com wrote: you seem to have missed the Catmandu tutorial at SWIB13. Luckily there is a basic tutorial and a demo online: http://librecat.org/ I did attend SWIB13, and I really wanted to go to the Catmandu workshop, but since I’m a Perl “affectionato I figured I could play with it later on my own. Instead I attended the workshop on provenance. (Travelogue is pending.) In any event, playing with the Catmandu demo was insightful. [1] I see and understand the workflow: import data, fix it, store it, fix it, export it. I see how it is designed to use many import and export formats. The key to the software seems to be two-fold: 1) the ability to read and write Perl programs, and 2) understanding Catmandu’s “fix” language. There are great possibilities here for us Perl folks. Thank you for re-brining it to my attention. [1] demo - http://demo.librecat.org — Eric Lease Morgan
Re: [CODE4LIB] transforming marc to rdf
Eric, I'm having a hard time figuring out exactly what you're hoping to get. Going from MARC to RDF was my great white whale for years while Talis' main business interests involved both of those (although not archival collections). Anything that will remodel MARC to (decent) RDF is going be: - Non-trivial to install - Non-trivial to use - Slow - Require massive amounts of memory/disk space Choose any two. Frankly, I don't see how you can generate RDF that anybody would want to use from XSLT: where would your URIs come from? What, exactly, are you modeling? I guess, to me, it would be a lot more helpful for you to take an archival MARC record, and, by hand, build an RDF graph from it, then figure out your mappings. I just don't see any way to make it easy-to-use, at least, not until you have an agreed upon model to map to. -Ross. On Thu, Dec 5, 2013 at 3:07 AM, Christian Pietsch chr.pietsch+web4...@googlemail.com wrote: Hi Eric, you seem to have missed the Catmandu tutorial at SWIB13. Luckily there is a basic tutorial and a demo online: http://librecat.org/ The demo happens to be about transforming MARC to RDF using the Catmandu Perl framework. It gives you full flexibility by separating the importer from the exporter and providing a domain specific language for “fixing” the data in between. Catmandu also has easy to use wrappers for popular search engines and databases (both SQL and NoSQL), making it a complete ETL (extract, transform, load) toolkit. Disclosure: I am a Catmandu contributor. It's free and open source software. Cheers, Christian On Wed, Dec 04, 2013 at 09:59:46PM -0500, Eric Lease Morgan wrote: Converting MARC to RDF has been more problematic. There are various tools enabling me to convert my original MARC into MARCXML and/or MODS. After that I can reportably use a few tools to convert to RDF: * MARC21slim2RDFDC.xsl [3] - functions, but even for my tastes the resulting RDF is too vanilla. [4] * modsrdf.xsl [5] - optimal, but when I use my transformation engine (Saxon), I do not get XML but rather plain text * BIBFRAME Tools [6] - sports nice ontologies, but the online tools won’t scale for large operations -- Christian Pietsch · http://www.ub.uni-bielefeld.de/~cpietsch/ LibTec · Library Technology and Knowledge Management Bielefeld University Library, Bielefeld, Germany
Re: [CODE4LIB] transforming marc to rdf [to batch or not to batch]
When exposing sets of MARC records as linked data, do you think it is better to expose them in batch (collection) files or as individual RDF serializations? To bastardize the Bard — “To batch or not to batch? That is the question.” Suppose I am a medium-sized academic research library. Suppose my collection is comprised of approximately 3.5 million bibliographic records. Suppose I want to expose those records via linked data. Suppose further that this will be done by “simply” making RDF serialization files (XML, Turtle, etc.) accessible via an HTTP filesystem. No scripts. No programs. No triple stores. Just files on an HTTP file system coupled with content negotiation. Given these assumptions, would you: 1. create batches of MARC records, convert them to MARCXML and then to RDF, and save these files to disc, or 2. parse the batches of MARC record sets into individual records, convert them into MARCXML and then RDF, and save these files to disc Option #1 would require heavy lifting against large files, but the number of resulting files to save to disc would be relatively few — reasonably managed in a single directory on disc. On the other hand, individual URIs pointing to individual serializations would not be accessible. They would only be accessible by retrieving the collection file in which they reside. Moreover, a mapping of individual URIs to collection files would need to be maintained. Option #2 would be easier on the computing resources because processing little files is generally easier than processing bigger ones. On the other hand, the number of files generated by this option is not easily be managed without the use of a sophisticated directory structure. (It is not feasible to put 3.5 million files in a single directory.) But I would still need to create a mapping from URI to directory. In either case, I would probably create a bunch of site map files denoting the locations of my serializations — YAP (Yet Another Mapping). I’m leaning towards Option #2 because individual URIs could be resolved more easily with “simple” content negotiation. (Given my particular use case — archival MARC records — I don’t think I’d really have more than a few thousand items, but I’m asking the question on a large scale anyway.) — Eric Morgan
Re: [CODE4LIB] transforming marc to rdf
On Dec 5, 2013, at 8:55 AM, Ross Singer rossfsin...@gmail.com wrote: Eric, I'm having a hard time figuring out exactly what you're hoping to get. Going from MARC to RDF was my great white whale for years while Talis' main business interests involved both of those (although not archival collections). Anything that will remodel MARC to (decent) RDF is going be: - Non-trivial to install - Non-trivial to use - Slow - Require massive amounts of memory/disk space Choose any two. Frankly, I don't see how you can generate RDF that anybody would want to use from XSLT: where would your URIs come from? What, exactly, are you modeling? I guess, to me, it would be a lot more helpful for you to take an archival MARC record, and, by hand, build an RDF graph from it, then figure out your mappings. I just don't see any way to make it easy-to-use, at least, not until you have an agreed upon model to map to. Ross, good questions. I’m hoping to articulate and implement a simple and functional method for exposing EAD and MARC metadata as linked data. “Simple and functional” are the operative words; I’m not necessarily looking for “fast”, “best” nor “perfect”. I am trying to articulate something that requires the least amount of infrastructure and technical expertise. Reasonable RDF through XSLT? Good point. I like the use of XSLT because it does not require very much technical infrastructure — just ubiquitous XSLT processors like Saxon or xsltproc. I have identified two or three stylesheets transforming MARCXML/MODS into RDF/XML. 1. The first comes from the Library of Congress and uses Dublin Core as its ontology, but the resulting RDF has no URIs and the Dublin Core is not good enough, even for my tastes. [1] 2. The second also comes from the Library of Congress, and it uses a richer, more standard ontology, but I can’t get it to work. All I get as output is a plain text file. I must be doing something wrong. [2] 3. The found the third stylesheet buried the MARC/MODS RDFizer. The sheet uses XSLT 1.0 which is good for my xsltproc-like tools. I get output, which is better than Sheet #2. The ontology is a bit MIT-specific, but it is one heck of a lot richer than Sheet #1. Moreover, the RDF includes URIs. [3, 4] In none of these cases will the ontology be best nor perfect, but for right now I don’t care. The ontology is good enough. Heck, the ontologies don’t even come close to the ontology I get when transforming my EAD to RDF using the Archives Hub stylesheet. [5] I just want to expose the content as linked data. Somebody else — the community — can come behind to improve the stylesheets and their ontologies. Where will I get the URIs from? I will get them by combining some sort of unique code (like an OCLC symbol) or namespace with the value of the MARC records' 001 fields. Here is an elaboration of my original recipe for making MARC metadata accessible via linked data: 1. obtain a set of MARC records 2. parse out a record from the set 3. convert it to MARCXML 4. transform MARCXML into HTML 5. transform MARCXML into RDF (probably through MODS first) 6. save HTML and RDF to disc 7. update a mapping file / data structure denoting where things are located 7. go to Step #2 for each record in the set 8. use the mapping to create a set of site map files 9. use the mapping to support HTTP content negotiation 10. create an index.html file allowing humans to browse the collection as well as point robots to the RDF 11. for extra credit, import all the RDF into a triple store and provide access via SPARQL I think I can do the same thing with EAD files. Moreover, I think I an do this with a small number of (Perl) scripts easily readable by others enabling them to implement the scripts in a programming language of their choice. Once I get this far metadata experts can improve the ontologies, and computer scientists can improve the infrastructure. In the meantime the linked data can be harvested for the good purposes link data was articulated. It is in my head. It really is. All I need is the time, focus, and energy to implement it. On my mark. Get set. Go. [1] MARC21slim2RDFDC.xsl - http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl [2] modsrdf.xsl - http://www.loc.gov/standards/mods/modsrdf/xsl-files/modsrdf.xsl [3] mods2rdf.xslt - http://infomotions.com/tmp/mods2rdf.xslt [4] MARC/MODS RDFizer - http://simile.mit.edu/wiki/MARC/MODS_RDFizer [5] ead2rdf.xsl - http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl — Eric Lease Morgan
Re: [CODE4LIB] transforming marc to rdf
On Thu, Dec 5, 2013 at 11:11 AM, Eric Lease Morgan emor...@nd.edu wrote: I’m hoping to articulate and implement a simple and functional method for exposing EAD and MARC metadata as linked data. Isn't the point of this to expose archival description as linked data? What about description maintained in applications like a collection management system, say, ArchivesSpace or Archivists' Toolkit? Mark -- Mark A. Matienzo m...@matienzo.org Director of Technology, Digital Public Library of America
Re: [CODE4LIB] transforming marc to rdf
On Dec 5, 2013, at 11:17 AM, Mark A. Matienzo mark.matie...@gmail.com wrote: I’m hoping to articulate and implement a simple and functional method for exposing EAD and MARC metadata as linked data. Isn't the point of this to expose archival description as linked data? What about description maintained in applications like a collection management system, say, ArchivesSpace or Archivists' Toolkit? Good question! At the very least, these applications (ArchivesSpace, Archivists’ Toolkit, etc.) can regularly and systematically export their data as EAD, and the EAD can be made available as linked data. It would be ideal if the applications where to natively make their metadata available as linked data, but exporting their content as EAD is a functional stopgap solution. —Eric Morgan
Re: [CODE4LIB] transforming marc to rdf
On Thu, Dec 5, 2013 at 11:26 AM, Eric Lease Morgan emor...@nd.edu wrote: Good question! At the very least, these applications (ArchivesSpace, Archivists’ Toolkit, etc.) can regularly and systematically export their data as EAD, and the EAD can be made available as linked data. It would be ideal if the applications where to natively make their metadata available as linked data, but exporting their content as EAD is a functional stopgap solution. —Eric Morgan Wouldn't it make more sense, especially with a system like ArchivesSpace, which provides a backend HTTP API and a public UI, to publish linked data directly instead of adding yet another stopgap? Mark -- Mark A. Matienzo m...@matienzo.org Director of Technology, Digital Public Library of America
Re: [CODE4LIB] transforming marc to rdf
On Dec 5, 2013, at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.com wrote: At the very least, these applications (ArchivesSpace, Archivists’ Toolkit, etc.) can regularly and systematically export their data as EAD, and the EAD can be made available as linked data. Wouldn't it make more sense, especially with a system like ArchivesSpace, which provides a backend HTTP API and a public UI, to publish linked data directly instead of adding yet another stopgap? Publishing via a content management system would make more sense, if: 1. the archivist uses the specific content management system 2. the content management system supported the functionality “There is more than one way to skin a cat.” There are advantages and disadvantages to every software solution. — Eric
Re: [CODE4LIB] transforming marc to rdf
With apologies to Eric to others from the LiAM project, I feel like I want to jump in here with a little more context. Eric, or Aaron, or Anne, please feel free to correct any of what I say below. I agree with the points made and concerns raised by both Ross Mark -- most significantly, that a sustainable infrastructure for linked archival metadata is not going to come from an XSLT stylesheet. However, I also see tremendous value in what Eric is putting together here. The prospectus for the LiAM project, which is the context for Eric's questions, is about developing guiding principles and educational tools for the archival community to better understand, prepare for, and contribute to the kind of infrastructure both Ross Mark are talking about: http://sites.tufts.edu/liam/deliverables/prospectus-for-linked-archival-metadata-a-guidebook/ While I agree that converting legacy data in EAD MARC formats to RDF is not the approach this work will take in the future, I also believe that these are formats that the archival community is very familiar with, and XSLT is a tool that many archivists work with regularly. A workflow for that community to experiment is a laudable goal. In short, I think we need approaches that illustrate the potential of linked data in archives, to highlight some of the shortcomings in our current metadata management frameworks, to help archivists be in a position to get their metadata ready for what Mark is describing in the context of ArchivesSpace (e.g. please use id attributes in c tags!!), and to have a more complete picture of why doing so is of some value. Sorry for the long message, and I hope that the context is helpful. Regards, -Corey On Thu, Dec 5, 2013 at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.comwrote: On Thu, Dec 5, 2013 at 11:26 AM, Eric Lease Morgan emor...@nd.edu wrote: Good question! At the very least, these applications (ArchivesSpace, Archivists’ Toolkit, etc.) can regularly and systematically export their data as EAD, and the EAD can be made available as linked data. It would be ideal if the applications where to natively make their metadata available as linked data, but exporting their content as EAD is a functional stopgap solution. —Eric Morgan Wouldn't it make more sense, especially with a system like ArchivesSpace, which provides a backend HTTP API and a public UI, to publish linked data directly instead of adding yet another stopgap? Mark -- Mark A. Matienzo m...@matienzo.org Director of Technology, Digital Public Library of America -- Corey A Harper Metadata Services Librarian New York University Libraries 20 Cooper Square, 3rd Floor New York, NY 10003-7112 212.998.2479 corey.har...@nyu.edu
Re: [CODE4LIB] transforming marc to rdf
I've been following this conversation as a non-coder. I'm really interested in getting a better understanding of linked data and how to use existing metadata for proof of concept linked data outputs. So, I totally think Eric's approaches are valuable and would be something I would use. I also understand there are many ways to do something better and more in the flow. So, just encouraging you all to keep posting thoughts in both directions! Best, Lisa - Elizabeth Lisa McAulay Librarian for Digital Collection Development UCLA Digital Library Program http://digital.library.ucla.edu/ email: emcaulay [at] library.ucla.edu From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Eric Lease Morgan [emor...@nd.edu] Sent: Thursday, December 05, 2013 8:57 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] transforming marc to rdf On Dec 5, 2013, at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.com wrote: At the very least, these applications (ArchivesSpace, Archivists’ Toolkit, etc.) can regularly and systematically export their data as EAD, and the EAD can be made available as linked data. Wouldn't it make more sense, especially with a system like ArchivesSpace, which provides a backend HTTP API and a public UI, to publish linked data directly instead of adding yet another stopgap? Publishing via a content management system would make more sense, if: 1. the archivist uses the specific content management system 2. the content management system supported the functionality “There is more than one way to skin a cat.” There are advantages and disadvantages to every software solution. — Eric
Re: [CODE4LIB] transforming marc to rdf
On Thu, Dec 5, 2013 at 11:57 AM, Eric Lease Morgan emor...@nd.edu wrote: On Dec 5, 2013, at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.com wrote: Wouldn't it make more sense, especially with a system like ArchivesSpace, which provides a backend HTTP API and a public UI, to publish linked data directly instead of adding yet another stopgap? Publishing via a content management system would make more sense, if: 1. the archivist uses the specific content management system 2. the content management system supported the functionality “There is more than one way to skin a cat.” There are advantages and disadvantages to every software solution. I recognized that not everyone uses a collection management system and instead may author description using EAD or something else directly, but I think we really need to acknowledge the affordance of that kind of software here. I can tell you for certain there are certain aspects of the ArchivesSpace data model that are not serializable in any good way - or at all - using EAD or MARC. Per Corey's message: I have no objection in principle to using XSLT to provide examples of ways to do this transformation (I know lots of people have piles of existing EAD) as long as the resulting data is acknowledged to be less than ideal. EAD is also not a data model, it's a document model for a finding aid. EAD3 will improve this somewhat, but it's still not a representation of a conceptual model of archival entities. My concern about using something like XSLT *specifically* to transform archival description stored in MARC is that the existing stylesheets assume that the MARC description is bibliographic description. Archival description is not bibliographic description. Mark
Re: [CODE4LIB] transforming marc to rdf
On Thu, Dec 5, 2013 at 11:57 AM, Eric Lease Morgan emor...@nd.edu wrote: “There is more than one way to skin a cat.” There are advantages and disadvantages to every software solution. I think what Mark and I are trying to say is that the first step to this solution is not by applying software at existing data, but by trying to figure out the problem you're actually trying to solve. Any linked data future cannot be a simple as a technologist giving some magic tool to archivists and librarians. You still haven't really answered my question about what you're hoping to achieve and who stands to benefit from it. I don't see how assigning a bunch of arbitrary identifiers, properties, and values to a description of a collection of archival materials (especially since you're talking about doing this in XSLT, so your archival collections can't even really be related to /each other/ much less anything else). Who is going to use going to use this data? What are they supposed to do with it? What will libraries and archives get from it? I am certainly not above academic exercises (or without my own), but I absolutely can see *no* beneficial archival linked data created simply by pointing an XSLT at a bunch of EAD and MARCXML and I certainly can't without a clear vision of the model that said XSLT is supposed to generate. The key part here is the data model, and taking a 'software solution'-first approach does nothing to address that. -Ross.
Re: [CODE4LIB] transforming marc to rdf
* BIBFRAME Tools [6] - sports nice ontologies, but the online tools won’t scale for large operations -- The code running the transformation at [6] is available here: https://github.com/lcnetdev/marc2bibframe We've run several million records through it at one time. As with everything, the data needs to be properly prepared and we have a script that processes those millions in smaller (but still sizeable) batches. Yours, Kevin On 12/04/2013 09:59 PM, Eric Lease Morgan wrote: I have to eat some crow, and I hope somebody here can give me some advice for transforming MARC to RDF. I am in the midst of writing a book describing the benefits of linked data for archives. Archival metadata usually comes in two flavors: EAD and MARC. I found a nifty XSL stylesheet from the Archives Hub (that’s in the United Kingdom) transforming EAD to RDF/XML. [1] With a bit of customization I think it could be used quite well for just about anybody with EAD files. I have retained a resulting RDF/XML file online. [2] Converting MARC to RDF has been more problematic. There are various tools enabling me to convert my original MARC into MARCXML and/or MODS. After that I can reportably use a few tools to convert to RDF: * MARC21slim2RDFDC.xsl [3] - functions, but even for my tastes the resulting RDF is too vanilla. [4] * modsrdf.xsl [5] - optimal, but when I use my transformation engine (Saxon), I do not get XML but rather plain text * BIBFRAME Tools [6] - sports nice ontologies, but the online tools won’t scale for large operations In short, I have discovered nothing that is “easy-to-use”. Can you provide me with any other links allowing me to convert MARC to serialized RDF? [1] ead2rdf.xsl - http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl [2] transformed EAD file - http://infomotions.com/tmp/una-ano.rdf [3] MARC21slim2RDFDC.xsl - http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl [4] vanilla RDF - http://infomotions.com/tmp/pamphlets.rdf [5] modsrdf.xsl - http://www.loc.gov/standards/mods/modsrdf/xsl-files/modsrdf.xsl [6] BIBFRAME Tools - http://bibframe.org/tools/transform/start — Eric Lease Morgan
Re: [CODE4LIB] transforming marc to rdf
Anything that will remodel MARC to (decent) RDF is going be: - Non-trivial to install - Non-trivial to use - Slow - Require massive amounts of memory/disk space Choose any two. -- I'll second this. Frankly, I don't see how you can generate RDF that anybody would want to use from XSLT: where would your URIs come from? What, exactly, are you modeling? -- Our experience getting to good, URI rich RDF has been basically a two-step process. First there is the raw conversion, which certainly results in verbose blank-node-rich RDF, but we follow that pass with a second one during which blank nodes are replaced with URIs. This has most certainly been the case with BIBFRAME because X number of MARC records may represent varying manifestations of a single work. We don't want X number of instances (manifestations basically) referencing X number of works in the end, but X number of instances referencing 1 work (all other things being equal). We consolidate - for the lack of a better word - X number of works created in the first pass into 1 work (identified by an HTTP URI) and then we make sure X number of instances point to that one work, removing all the duplicate blank-node-identified resources created during the first pass. Granted this consolidation scenario is not scalable without a fairly robust backend solution, but the process at bibframe.org (the code on github) nevertheless does the type of consolidation described above in memory with small MARC collections. Yours, Kevin On 12/05/2013 08:55 AM, Ross Singer wrote: Eric, I'm having a hard time figuring out exactly what you're hoping to get. Going from MARC to RDF was my great white whale for years while Talis' main business interests involved both of those (although not archival collections). Anything that will remodel MARC to (decent) RDF is going be: - Non-trivial to install - Non-trivial to use - Slow - Require massive amounts of memory/disk space Choose any two. -- Frankly, I don't see how you can generate RDF that anybody would want to use from XSLT: where would your URIs come from? What, exactly, are you modeling? I guess, to me, it would be a lot more helpful for you to take an archival MARC record, and, by hand, build an RDF graph from it, then figure out your mappings. I just don't see any way to make it easy-to-use, at least, not until you have an agreed upon model to map to. -Ross. On Thu, Dec 5, 2013 at 3:07 AM, Christian Pietsch chr.pietsch+web4...@googlemail.com wrote: Hi Eric, you seem to have missed the Catmandu tutorial at SWIB13. Luckily there is a basic tutorial and a demo online: http://librecat.org/ The demo happens to be about transforming MARC to RDF using the Catmandu Perl framework. It gives you full flexibility by separating the importer from the exporter and providing a domain specific language for “fixing” the data in between. Catmandu also has easy to use wrappers for popular search engines and databases (both SQL and NoSQL), making it a complete ETL (extract, transform, load) toolkit. Disclosure: I am a Catmandu contributor. It's free and open source software. Cheers, Christian On Wed, Dec 04, 2013 at 09:59:46PM -0500, Eric Lease Morgan wrote: Converting MARC to RDF has been more problematic. There are various tools enabling me to convert my original MARC into MARCXML and/or MODS. After that I can reportably use a few tools to convert to RDF: * MARC21slim2RDFDC.xsl [3] - functions, but even for my tastes the resulting RDF is too vanilla. [4] * modsrdf.xsl [5] - optimal, but when I use my transformation engine (Saxon), I do not get XML but rather plain text * BIBFRAME Tools [6] - sports nice ontologies, but the online tools won’t scale for large operations -- Christian Pietsch · http://www.ub.uni-bielefeld.de/~cpietsch/ LibTec · Library Technology and Knowledge Management Bielefeld University Library, Bielefeld, Germany
Re: [CODE4LIB] transforming marc to rdf
Eric, Have you had a look at Ed Chamberlain's work on COMET: https://github.com/edchamberlain/COMET It's been a while since I've run this, but if I remember correctly, it was fairly easy-to-use. Also, though much older, I seem to remember the Simile MARC RDFizer being a pretty straightforward one to run: http://simile.mit.edu/wiki/MARC/MODS_RDFizer MODS aficionados will point to some problems with some of it's choices for representing that data, but still a good starting point (IMO). Hope that helps, -Corey On Wed, Dec 4, 2013 at 9:59 PM, Eric Lease Morgan emor...@nd.edu wrote: I have to eat some crow, and I hope somebody here can give me some advice for transforming MARC to RDF. I am in the midst of writing a book describing the benefits of linked data for archives. Archival metadata usually comes in two flavors: EAD and MARC. I found a nifty XSL stylesheet from the Archives Hub (that’s in the United Kingdom) transforming EAD to RDF/XML. [1] With a bit of customization I think it could be used quite well for just about anybody with EAD files. I have retained a resulting RDF/XML file online. [2] Converting MARC to RDF has been more problematic. There are various tools enabling me to convert my original MARC into MARCXML and/or MODS. After that I can reportably use a few tools to convert to RDF: * MARC21slim2RDFDC.xsl [3] - functions, but even for my tastes the resulting RDF is too vanilla. [4] * modsrdf.xsl [5] - optimal, but when I use my transformation engine (Saxon), I do not get XML but rather plain text * BIBFRAME Tools [6] - sports nice ontologies, but the online tools won’t scale for large operations In short, I have discovered nothing that is “easy-to-use”. Can you provide me with any other links allowing me to convert MARC to serialized RDF? [1] ead2rdf.xsl - http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl [2] transformed EAD file - http://infomotions.com/tmp/una-ano.rdf [3] MARC21slim2RDFDC.xsl - http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl [4] vanilla RDF - http://infomotions.com/tmp/pamphlets.rdf [5] modsrdf.xsl - http://www.loc.gov/standards/mods/modsrdf/xsl-files/modsrdf.xsl [6] BIBFRAME Tools - http://bibframe.org/tools/transform/start — Eric Lease Morgan -- Corey A Harper Metadata Services Librarian New York University Libraries 20 Cooper Square, 3rd Floor New York, NY 10003-7112 212.998.2479 corey.har...@nyu.edu