Re: [CODE4LIB] transforming marc to rdf

2013-12-09 Thread Eric Lease Morgan
I have created an initial pile of RDF, mostly.

I am in the process of experimenting with linked data for archives. My goal is 
to use existing (EAD and MARC) metadata to create RDF/XML, and then to expose 
this RDF/XML using linked data principles. Once I get that far I hope to slurp 
up the RDF/XML into a triple store, analyse the data, and learn how the whole 
process could be improved. 

This is what I have done to date:

  * accumulated sets of EAD files and MARC
records

  * identified and cached a few XSL stylesheets
transforming EAD and MARCXML into RDF/XML

  * wrote a couple of Perl script that combine
Bullet #1 and Bullet #2 to create HTML and
RDF/XML

  * write a mod_perl module implementing
rudimentary content negotiation

  * made the whole thing (scripts, sets of data,
HTML, RDF/XML, etc.) available on the Web

You can see the fruits of these labors at http://infomotions.com/sandbox/liam/, 
and there you will find a few directories:

  * bin - my Perl scripts live here as well as
a couple of support files

  * data - full of RDF/XML files -- about 4,000
of them

  * etc - mostly stylesheets

  * id - a placeholder for the URIs and content
negotiation

  * lib - where the actual content negotiation
script lives

  * pages - HTML versions of the original metadata

  * src - a cache for my original metadata

  * tmp - things of brief importance; mostly trash

My Perl scripts read the metadata, create HTML and RDF/XML, and save the result 
in the pages and data directories, respectively. A person can browse these 
directories, but browsing will be difficult because there is nothing there 
except cryptic file names. Selecting any of the files should return valid HTML 
or RDF/XML. 

Each cryptic name is the leaf of a URI prefixed with 
http://infomotions.com/sandbox/liam/id/;. For example, if the leaf is 
mshm510, then the combined leaf and prefix form a resolvable URI -- 
http://infomotions.com/sandbox/liam/id/mshm510. When user-agent says it can 
accept text/html, then the HTTP server redirects the user-agent to 
http://infomotions.com/sandbox/liam/pages/mshm510.html. If the user agent does 
not request a text/html representation, then the RDF/XML version is returned -- 
http://infomotions.com/sandbox/liam/data/mshm510.rdf. This is rudimentary 
content-negotiation. For a good time, here are a few actionable URIs:

  * http://infomotions.com/sandbox/liam/id/4042gwbo
  * http://infomotions.com/sandbox/liam/id/httphdllocgovlocmusiceadmusmu004002
  * http://infomotions.com/sandbox/liam/id/ma117
  * http://infomotions.com/sandbox/liam/id/mshm509
  * http://infomotions.com/sandbox/liam/id/stcmarcocm11422551
  * http://infomotions.com/sandbox/liam/id/vilmarcvil_155543

For a good time, feed them to the W3C RDF Validator. 

The next step is to figure out how to handle file not found errors when a URI 
does not exist. Another thing to figure out is how to make potential robots 
aware of the data set. The bigger problem is to simply make the dataset more 
meaningful the the inclusion of more URIs in the RDF/XML as well as the use of 
a more consistent and standardized set of ontologies. 

Fun with linked data?

— 
Eric Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-06 Thread Karen Coyle

On 12/5/13 8:11 AM, Eric Lease Morgan wrote:
Where will I get the URIs from? I will get them by combining some sort 
of unique code (like an OCLC symbol) or namespace with the value of 
the MARC records' 001 fields.


You actually need 3 URIs per triple:

subject URI (which is what I believe you are creating, above)
predicate URI (the data element URI, like 
http://purl.org/dc/terms/title) http://purl.org/dc/terms/title
object URI (the URI for the data you are providing, like 
http://id.loc.gov/authorities/names/n94036700)


The first two MUST be URIs. The third SHOULD be a URI but can also be a 
string. However, strings, in the linked data space, do NOT LINK. If you 
only have strings in the object/value space then you can run searches 
against your data, but your data cannot link to other data. Creating 
linked data that doesn't link isn't terribly useful.


(In case this doesn't make sense to anyone reading, I have a slide deck 
that illustrates this. I've uploaded it to: 
http://kcoyle.net/presentations/3webIntro.pptx )


A key first step for all of us is to start getting identifiers into our 
data, even before we start thinking about linked data. MARC records in 
systems that recognize authority control should be able to store or 
provide on output the URI of every authority-controlled entity. This 
should not be terribly difficult (ok, famous last words, I know). But if 
your vendor system can flip headings then it should also be able to 
provide a URI (especially since LC has conveniently made their URIs 
derivable from the LC record numbers).


With identifiers for things, THEN you are really linking.

kc


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] transforming marc to rdf

2013-12-06 Thread Eric Lease Morgan
I have successfully been able to being the systematic transformation process of 
EAD and MARC to RDF/XML, and consequently been able to literally illustrate the 
resulting triples. [1, 2] From the blog posting [3]:

  The resulting images are huge, and the astute/diligent reader
  will see a preponderance of literals in the results. This is not
  a good thing, but it all that is available right now.
 
  On the other hand the same astute/diligent reader will see the
  root of the RDF/XML pointing to a meaningful URI. This URI will
  be resolvable in the near future via content negotiation. This is
  a simple first step. The next steps will be to apply this process
  to an entire collection of EAD files and MARC records. After that
  the two other things can happen: 1) the original metadata files
  can begin to include URIs, and 2) the XSL used to process the
  metadata can employ a more standardized ontology. It is not an
  easy process, but it is a beginning.

  Right now, something is better than nothing.

[1] EAD illustration- http://sites.tufts.edu/liam/files/2013/12/hou00096.png
[2] MARC illustration - http://sites.tufts.edu/liam/files/2013/12/003078076.png
[3] blog posting - http://sites.tufts.edu/liam/2013/12/06/illustrating-rdf/

—
Eric Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-06 Thread Eric Lease Morgan
On Dec 5, 2013, at 12:35 PM, Ross Singer rossfsin...@gmail.com wrote:

 You still haven't really answered my question about what you're hoping to
 achieve and who stands to benefit from it.  I don't see how assigning a
 bunch of arbitrary identifiers, properties, and values to a description of
 a collection of archival materials (especially since you're talking about
 doing this in XSLT, so your archival collections can't even really be
 related to /each other/ much less anything else).
 
 Who is going to use going to use this data?  What are they supposed to do
 with it?  What will libraries and archives get from it?


My goal is three-fold:

  * to describe to the neophyte what linked data is and why they should care

  * to describe the to archivist who appreciates the value of linked data 
but does not know how to achieve its goals, possible approaches to
improving there metadata, specifically, the robust inclusion of URIs

  * to describe to the technologist the principles of archival practice,
to make them understand that things like EAD files describe
“collections” and not necessarily individual things, moreover to 
   demonstrate the utter simplicity of linked data principles

Yes, the EAD files and thus RDF/XML, etc. will not necessarily be linked to 
other things. That’s the point. By implementing my recipe, I will demonstrate 
who both the archivist as well as the technologist the need to work differently 
in order to achieve the linked data goal.

My goal is not necessarily to provide a robust information system. While the 
information system I create will be useful, it is not intended to be the be-all 
end-all of linked data for archivists. In fact, it will painfully illustrate 
the deficiencies in our existing practices.

Linked data suffers from a chicken-and-egg problem. By implementing my simple 
recipe, I believe I will be making it easier for the community to lay an egg. 

— 
Eric Lease Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-06 Thread Eric Lease Morgan
On Dec 5, 2013, at 1:17 PM, Kevin Ford k...@3windmills.com wrote:

 Frankly, I don't see how you can generate RDF that anybody would want to
 use from XSLT: where would your URIs come from?  What, exactly, are you
 modeling?
 
 -- Our experience getting to good, URI rich RDF has been basically a 
 two-step process.  First there is the raw conversion, which certainly 
 results in verbose blank-node-rich RDF, but we follow that pass with a 
 second one during which blank nodes are replaced with URIs.


The posting above is exactly the approach I am advocating. As long as the 
linked data is not incorrect but merely not best practice, then implement 
linked data with what one has in hand. This will accomplish two goals: 1) make 
cultural heritage institution metadata more widely available, and 2) provide 
practice for the technologist for implementation. Once the data is available, 
then enhance it and repeat the process. It is a never-ending thing. —Eric Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Christian Pietsch
Hi Eric,

you seem to have missed the Catmandu tutorial at SWIB13. Luckily there
is a basic tutorial and a demo online: http://librecat.org/

The demo happens to be about transforming MARC to RDF using the
Catmandu Perl framework. It gives you full flexibility by separating
the importer from the exporter and providing a domain specific
language for “fixing” the data in between. Catmandu also has easy
to use wrappers for popular search engines and databases (both SQL and
NoSQL), making it a complete ETL (extract, transform, load) toolkit.

Disclosure: I am a Catmandu contributor. It's free and open source
software.

Cheers,
Christian


On Wed, Dec 04, 2013 at 09:59:46PM -0500, Eric Lease Morgan wrote:
 Converting MARC to RDF has been more problematic. There are various
 tools enabling me to convert my original MARC into MARCXML and/or
 MODS. After that I can reportably use a few tools to convert to RDF:
 
   * MARC21slim2RDFDC.xsl [3] - functions, but even for
 my tastes the resulting RDF is too vanilla. [4]
 
   * modsrdf.xsl [5] - optimal, but when I use my
 transformation engine (Saxon), I do not get XML
 but rather plain text
 
   * BIBFRAME Tools [6] - sports nice ontologies, but
 the online tools won’t scale for large operations

-- 
  Christian Pietsch · http://www.ub.uni-bielefeld.de/~cpietsch/
  LibTec · Library Technology and Knowledge Management
  Bielefeld University Library, Bielefeld, Germany


Re: [CODE4LIB] transforming marc to rdf [comet]

2013-12-05 Thread Eric Lease Morgan
On Dec 4, 2013, at 10:29 PM, Corey A Harper corey.har...@nyu.edu wrote:

 Have you had a look at Ed Chamberlain's work on COMET:
 https://github.com/edchamberlain/COMET
 
 It's been a while since I've run this, but if I remember correctly, it was
 fairly easy-to-use.


Thank you for the pointer. I downloaded the COMET “suite”, and got good output, 
but only after I enhanced/tweaked the source code to require the Perl Encode 
module:

  ./marc2rdf_batch.pl pamphlets.marc

The result was a huge set of triples saved as RDF/Turtle. I then used a Java 
archive (RDF2RDF [0]) to painlessly convert the Turtle to RDF/XML. The process 
worked. It was “easy” more me, sort of, but it employes quite a number of 
sophisticated and underlying technologies. I could integrate everything into a 
whole, but… On to explore other options.

[1] RDF2RDF - http://www.l3s.de/~minack/rdf2rdf/

—
Sleepless In South Bend


Re: [CODE4LIB] transforming marc to rdf [mods_rdfizer]

2013-12-05 Thread Eric Lease Morgan
On Dec 4, 2013, at 10:29 PM, Corey A Harper corey.har...@nyu.edu wrote:

 Also, though much older, I seem to remember the Simile MARC RDFizer being
 a pretty straightforward one to run:
 http://simile.mit.edu/wiki/MARC/MODS_RDFizer
 
 MODS aficionados will point to some problems with some of it's choices for
 representing that data, but still a good starting point (IMO).


Again, thanks for the pointer. I downloaded MODS_RDFizer and got it to run, but 
it was a good thing that I already had mvn installed. The output did created an 
RDF/XML file, and I concur, the implemented ontology is “interesting”. The 
distribution include a possibly cool stylesheet — mods2rdf.xslt. Maybe I can 
use this. Hmm…  —Still Sleepless


Re: [CODE4LIB] transforming marc to rdf [mods_rdfizer]

2013-12-05 Thread Eric Lease Morgan
On Dec 5, 2013, at 6:54 AM, Eric Lease Morgan emor...@nd.edu wrote:

 http://simile.mit.edu/wiki/MARC/MODS_RDFizer
 
 ...The distribution includes a possibly cool stylesheet — mods2rdf.xslt.


Ah ha! The MODS_RDFizer’s mods2rdf.xslt file functioned very well against one 
of my MODS files:

  $ xsltproc mods2rdf.xslt pamphlets.mods  pamphlets.rdf

Mods2rdf.xslt could very easily be configured at the beginning of the file to 
suit the needs of a local “cultural heritage institution”. I like the use of 
XSL to create a serialized RDF as opposed to the use of an application because 
less infrastructure is needed to make things happen. 

—
Too Much Coffee?


Re: [CODE4LIB] transforming marc to rdf [catmandu]

2013-12-05 Thread Eric Lease Morgan
On Dec 5, 2013, at 3:07 AM, Christian Pietsch 
chr.pietsch+web4...@googlemail.com wrote:

 you seem to have missed the Catmandu tutorial at SWIB13. Luckily there
 is a basic tutorial and a demo online: http://librecat.org/


I did attend SWIB13, and I really wanted to go to the Catmandu workshop, but 
since I’m a Perl “affectionato I figured I could play with it later on my own. 
Instead I attended the workshop on provenance. (Travelogue is pending.)

In any event, playing with the Catmandu demo was insightful. [1] I see and 
understand the workflow: import data, fix it, store it, fix it, export it. I 
see how it is designed to use many import and export formats. The key to the 
software seems to be two-fold: 1) the ability to read and write Perl programs, 
and 2) understanding Catmandu’s “fix” language. There are great possibilities 
here for us Perl folks. Thank you for re-brining it to my attention.

[1] demo - http://demo.librecat.org

— 
Eric Lease Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Ross Singer
Eric, I'm having a hard time figuring out exactly what you're hoping to get.

Going from MARC to RDF was my great white whale for years while Talis' main
business interests involved both of those (although not archival
collections).  Anything that will remodel MARC to (decent) RDF is going be:

   - Non-trivial to install
   - Non-trivial to use
   - Slow
   - Require massive amounts of memory/disk space

Choose any two.

Frankly, I don't see how you can generate RDF that anybody would want to
use from XSLT: where would your URIs come from?  What, exactly, are you
modeling?

I guess, to me, it would be a lot more helpful for you to take an archival
MARC record, and, by hand, build an RDF graph from it, then figure out your
mappings.  I just don't see any way to make it easy-to-use, at least, not
until you have an agreed upon model to map to.

-Ross.


On Thu, Dec 5, 2013 at 3:07 AM, Christian Pietsch 
chr.pietsch+web4...@googlemail.com wrote:

 Hi Eric,

 you seem to have missed the Catmandu tutorial at SWIB13. Luckily there
 is a basic tutorial and a demo online: http://librecat.org/

 The demo happens to be about transforming MARC to RDF using the
 Catmandu Perl framework. It gives you full flexibility by separating
 the importer from the exporter and providing a domain specific
 language for “fixing” the data in between. Catmandu also has easy
 to use wrappers for popular search engines and databases (both SQL and
 NoSQL), making it a complete ETL (extract, transform, load) toolkit.

 Disclosure: I am a Catmandu contributor. It's free and open source
 software.

 Cheers,
 Christian


 On Wed, Dec 04, 2013 at 09:59:46PM -0500, Eric Lease Morgan wrote:
  Converting MARC to RDF has been more problematic. There are various
  tools enabling me to convert my original MARC into MARCXML and/or
  MODS. After that I can reportably use a few tools to convert to RDF:
 
* MARC21slim2RDFDC.xsl [3] - functions, but even for
  my tastes the resulting RDF is too vanilla. [4]
 
* modsrdf.xsl [5] - optimal, but when I use my
  transformation engine (Saxon), I do not get XML
  but rather plain text
 
* BIBFRAME Tools [6] - sports nice ontologies, but
  the online tools won’t scale for large operations

 --
   Christian Pietsch · http://www.ub.uni-bielefeld.de/~cpietsch/
   LibTec · Library Technology and Knowledge Management
   Bielefeld University Library, Bielefeld, Germany



Re: [CODE4LIB] transforming marc to rdf [to batch or not to batch]

2013-12-05 Thread Eric Lease Morgan
When exposing sets of MARC records as linked data, do you think it is better to 
expose them in batch (collection) files or as individual RDF serializations? To 
bastardize the Bard — “To batch or not to batch? That is the question.”

Suppose I am a medium-sized academic research library. Suppose my collection is 
comprised of approximately 3.5 million bibliographic records. Suppose I want to 
expose those records via linked data. Suppose further that this will be done by 
“simply” making RDF serialization files (XML, Turtle, etc.) accessible via an 
HTTP filesystem. No scripts. No programs. No triple stores. Just files on an 
HTTP file system coupled with content negotiation. Given these assumptions, 
would you:

  1. create batches of MARC records, convert them to MARCXML
 and then to RDF, and save these files to disc, or

  2. parse the batches of MARC record sets into individual
 records, convert them into MARCXML and then RDF, and
 save these files to disc

Option #1 would require heavy lifting against large files, but the number of 
resulting files to save to disc would be relatively few — reasonably managed in 
a single directory on disc. On the other hand, individual URIs pointing to 
individual serializations would not be accessible. They would only be 
accessible by retrieving the collection file in which they reside. Moreover, a 
mapping of individual URIs to collection files would need to be maintained. 

Option #2 would be easier on the computing resources because processing little 
files is generally easier than processing bigger ones. On the other hand, the 
number of files generated by this option is not easily be managed without the 
use of a sophisticated directory structure. (It is not feasible to put 3.5 
million files in a single directory.) But I would still need to create a 
mapping from URI to directory.

In either case, I would probably create a bunch of site map files denoting the 
locations of my serializations — YAP (Yet Another Mapping).

I’m leaning towards Option #2 because individual URIs could be resolved more 
easily with “simple” content negotiation.

(Given my particular use case — archival MARC records — I don’t think I’d 
really have more than a few thousand items, but I’m asking the question on a 
large scale anyway.)

—
Eric Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Eric Lease Morgan
On Dec 5, 2013, at 8:55 AM, Ross Singer rossfsin...@gmail.com wrote:

 Eric, I'm having a hard time figuring out exactly what you're hoping to get.
 
 Going from MARC to RDF was my great white whale for years while Talis' main
 business interests involved both of those (although not archival
 collections).  Anything that will remodel MARC to (decent) RDF is going be:
 
   - Non-trivial to install
   - Non-trivial to use
   - Slow
   - Require massive amounts of memory/disk space
 
 Choose any two.
 
 Frankly, I don't see how you can generate RDF that anybody would want to
 use from XSLT: where would your URIs come from?  What, exactly, are you
 modeling?
 
 I guess, to me, it would be a lot more helpful for you to take an archival
 MARC record, and, by hand, build an RDF graph from it, then figure out your
 mappings.  I just don't see any way to make it easy-to-use, at least, not
 until you have an agreed upon model to map to.


Ross, good questions. I’m hoping to articulate and implement a simple and 
functional method for exposing EAD and MARC metadata as linked data. “Simple 
and functional” are the operative words; I’m not necessarily looking for 
“fast”, “best” nor “perfect”. I am trying to articulate something that requires 
the least amount of infrastructure and technical expertise.

Reasonable RDF through XSLT? Good point. I like the use of XSLT because it does 
not require very much technical infrastructure — just ubiquitous XSLT 
processors like Saxon or xsltproc. I have identified two or three stylesheets 
transforming MARCXML/MODS into RDF/XML.

  1. The first comes from the Library of Congress and uses Dublin
 Core as its ontology, but the resulting RDF has no URIs and
 the Dublin Core is not good enough, even for my tastes. [1]

  2. The second also comes from the Library of Congress, and it
 uses a richer, more standard ontology, but I can’t get it to
 work. All I get as output is a plain text file. I must be
 doing something wrong. [2]

  3. The found the third stylesheet buried the MARC/MODS RDFizer.
 The sheet uses XSLT 1.0 which is good for my xsltproc-like
 tools. I get output, which is better than Sheet #2. The
 ontology is a bit MIT-specific, but it is one heck of a lot
 richer than Sheet #1. Moreover, the RDF includes URIs. [3, 4]

In none of these cases will the ontology be best nor perfect, but for right now 
I don’t care. The ontology is good enough. Heck, the ontologies don’t even come 
close to the ontology I get when transforming my EAD to RDF using the Archives 
Hub stylesheet. [5] I just want to expose the content as linked data. Somebody 
else — the community — can come behind to improve the stylesheets and their 
ontologies. 

Where will I get the URIs from? I will get them by combining some sort of 
unique code (like an OCLC symbol) or namespace with the value of the MARC 
records' 001 fields.

Here is an elaboration of my original recipe for making MARC metadata 
accessible via linked data:

  1. obtain a set of MARC records
  2. parse out a record from the set
  3. convert it to MARCXML
  4. transform MARCXML into HTML
  5. transform MARCXML into RDF (probably through MODS first)
  6. save HTML and RDF to disc
  7. update a mapping file / data structure denoting where things are located
  7. go to Step #2 for each record in the set
  8. use the mapping to create a set of site map files
  9. use the mapping to support HTTP content negotiation
 10. create an index.html file allowing humans to browse the collection as well 
as point robots to the RDF
 11. for extra credit, import all the RDF into a triple store and provide 
access via SPARQL

I think I can do the same thing with EAD files. Moreover, I think I an do this 
with a small number of (Perl) scripts easily readable by others enabling them 
to implement the scripts in a programming language of their choice. Once I get 
this far metadata experts can improve the ontologies, and computer scientists 
can improve the infrastructure. In the meantime the linked data can be 
harvested for the good purposes link data was articulated.

It is in my head. It really is. All I need is the time, focus, and energy to 
implement it. On my mark. Get set. Go.


[1] MARC21slim2RDFDC.xsl - 
http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
[2] modsrdf.xsl - 
http://www.loc.gov/standards/mods/modsrdf/xsl-files/modsrdf.xsl
[3] mods2rdf.xslt - http://infomotions.com/tmp/mods2rdf.xslt
[4] MARC/MODS RDFizer - http://simile.mit.edu/wiki/MARC/MODS_RDFizer
[5] ead2rdf.xsl - http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl

— 
Eric Lease Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Mark A. Matienzo
On Thu, Dec 5, 2013 at 11:11 AM, Eric Lease Morgan emor...@nd.edu wrote:


 I’m hoping to articulate and implement a simple and functional method for
 exposing EAD and MARC metadata as linked data.


Isn't the point of this to expose archival description as linked data? What
about description maintained in applications like a collection management
system, say, ArchivesSpace or Archivists' Toolkit?

Mark

--
Mark A. Matienzo m...@matienzo.org
Director of Technology, Digital Public Library of America


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Eric Lease Morgan
On Dec 5, 2013, at 11:17 AM, Mark A. Matienzo mark.matie...@gmail.com wrote:

 I’m hoping to articulate and implement a simple and functional method for
 exposing EAD and MARC metadata as linked data.
 
 Isn't the point of this to expose archival description as linked data? What
 about description maintained in applications like a collection management
 system, say, ArchivesSpace or Archivists' Toolkit?


Good question! At the very least, these applications (ArchivesSpace, 
Archivists’ Toolkit, etc.) can regularly and systematically export their data 
as EAD, and the EAD can be made available as linked data. It would be ideal if 
the applications where to natively make their metadata available as linked 
data, but exporting their content as EAD is a functional stopgap solution. 
—Eric Morgan


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Mark A. Matienzo
On Thu, Dec 5, 2013 at 11:26 AM, Eric Lease Morgan emor...@nd.edu wrote:


 Good question! At the very least, these applications (ArchivesSpace,
 Archivists’ Toolkit, etc.) can regularly and systematically export their
 data as EAD, and the EAD can be made available as linked data. It would be
 ideal if the applications where to natively make their metadata available
 as linked data, but exporting their content as EAD is a functional stopgap
 solution. —Eric Morgan


Wouldn't it make more sense, especially with a system like ArchivesSpace,
which provides a backend HTTP API and a public UI, to publish linked data
directly instead of adding yet another stopgap?

Mark

--
Mark A. Matienzo m...@matienzo.org
Director of Technology, Digital Public Library of America


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Eric Lease Morgan
On Dec 5, 2013, at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.com wrote:

 At the very least, these applications (ArchivesSpace,
 Archivists’ Toolkit, etc.) can regularly and systematically export their
 data as EAD, and the EAD can be made available as linked data.
 
 Wouldn't it make more sense, especially with a system like ArchivesSpace,
 which provides a backend HTTP API and a public UI, to publish linked data
 directly instead of adding yet another stopgap?


Publishing via a content management system would make more sense, if:

  1. the archivist uses the specific content management system
  2. the content management system supported the functionality

“There is more than one way to skin a cat.” There are advantages and 
disadvantages to every software solution.

—
Eric


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Corey A Harper
With apologies to Eric  to others from the LiAM project, I feel like I
want to jump in here with a little more context.

Eric, or Aaron, or Anne, please feel free to correct any of what I say
below.

I agree with the points made and concerns raised by both Ross  Mark --
most significantly, that a sustainable infrastructure for linked archival
metadata is not going to come from an XSLT stylesheet. However, I also see
tremendous value in what Eric is putting together here.

The prospectus for the LiAM project, which is the context for Eric's
questions, is about developing guiding principles and educational tools for
the archival community to better understand, prepare for, and contribute to
the kind of infrastructure both Ross  Mark are talking about:
http://sites.tufts.edu/liam/deliverables/prospectus-for-linked-archival-metadata-a-guidebook/

While I agree that converting legacy data in EAD  MARC formats to RDF is
not the approach this work will take in the future, I also believe that
these are formats that the archival community is very familiar with, and
XSLT is a tool that many archivists work with regularly. A workflow for
that community to experiment is a laudable goal.

In short, I think we need approaches that illustrate the potential of
linked data in archives, to highlight some of the shortcomings in our
current metadata management frameworks, to help archivists be in a position
to get their metadata ready for what Mark is describing in the context of
ArchivesSpace (e.g. please use id attributes in c tags!!), and to have a
more complete picture of why doing so is of some value.

Sorry for the long message, and I hope that the context is helpful.

Regards,
-Corey



On Thu, Dec 5, 2013 at 11:33 AM, Mark A. Matienzo
mark.matie...@gmail.comwrote:

 On Thu, Dec 5, 2013 at 11:26 AM, Eric Lease Morgan emor...@nd.edu wrote:

 
  Good question! At the very least, these applications (ArchivesSpace,
  Archivists’ Toolkit, etc.) can regularly and systematically export their
  data as EAD, and the EAD can be made available as linked data. It would
 be
  ideal if the applications where to natively make their metadata available
  as linked data, but exporting their content as EAD is a functional
 stopgap
  solution. —Eric Morgan
 

 Wouldn't it make more sense, especially with a system like ArchivesSpace,
 which provides a backend HTTP API and a public UI, to publish linked data
 directly instead of adding yet another stopgap?

 Mark

 --
 Mark A. Matienzo m...@matienzo.org
 Director of Technology, Digital Public Library of America




-- 
Corey A Harper
Metadata Services Librarian
New York University Libraries
20 Cooper Square, 3rd Floor
New York, NY 10003-7112
212.998.2479
corey.har...@nyu.edu


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread McAulay, Elizabeth
I've been following this conversation as a non-coder. I'm really interested in 
getting a better understanding of linked data and how to use existing metadata 
for proof of concept linked data outputs. So, I totally think Eric's approaches 
are valuable and would be something I would use. I also understand there are 
many ways to do something better and more in the flow. So, just encouraging 
you all to keep posting thoughts in both directions!

Best,
Lisa 
-
Elizabeth Lisa McAulay
Librarian for Digital Collection Development
UCLA Digital Library Program
http://digital.library.ucla.edu/
email: emcaulay [at] library.ucla.edu

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Eric Lease 
Morgan [emor...@nd.edu]
Sent: Thursday, December 05, 2013 8:57 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] transforming marc to rdf

On Dec 5, 2013, at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.com wrote:

 At the very least, these applications (ArchivesSpace,
 Archivists’ Toolkit, etc.) can regularly and systematically export their
 data as EAD, and the EAD can be made available as linked data.

 Wouldn't it make more sense, especially with a system like ArchivesSpace,
 which provides a backend HTTP API and a public UI, to publish linked data
 directly instead of adding yet another stopgap?


Publishing via a content management system would make more sense, if:

  1. the archivist uses the specific content management system
  2. the content management system supported the functionality

“There is more than one way to skin a cat.” There are advantages and 
disadvantages to every software solution.

—
Eric


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Mark A. Matienzo
On Thu, Dec 5, 2013 at 11:57 AM, Eric Lease Morgan emor...@nd.edu wrote:

 On Dec 5, 2013, at 11:33 AM, Mark A. Matienzo mark.matie...@gmail.com
 wrote:

  Wouldn't it make more sense, especially with a system like ArchivesSpace,
  which provides a backend HTTP API and a public UI, to publish linked data
  directly instead of adding yet another stopgap?


 Publishing via a content management system would make more sense, if:

   1. the archivist uses the specific content management system
   2. the content management system supported the functionality

 “There is more than one way to skin a cat.” There are advantages and
 disadvantages to every software solution.


I recognized that not everyone uses a collection management system and
instead may author description using EAD or something else directly, but I
think we really need to acknowledge the affordance of that kind of software
here.

I can tell you for certain there are certain aspects of the ArchivesSpace
data model that are not serializable in any good way - or at all - using
EAD or MARC.

Per Corey's message:

I have no objection in principle to using XSLT to provide examples of ways
to do this transformation (I know lots of people have piles of existing
EAD) as long as the resulting data is acknowledged to be less than ideal.
EAD is also not a data model, it's a document model for a finding aid. EAD3
will improve this somewhat, but it's still not a representation of a
conceptual model of archival entities.

My concern about using something like XSLT *specifically* to transform
archival description stored in MARC is that the existing stylesheets assume
that the MARC description is bibliographic description. Archival
description is not bibliographic description.

Mark


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Ross Singer
On Thu, Dec 5, 2013 at 11:57 AM, Eric Lease Morgan emor...@nd.edu wrote:


 “There is more than one way to skin a cat.” There are advantages and
 disadvantages to every software solution.


I think what Mark and I are trying to say is that the first step to this
solution is not by applying software at existing data, but by trying to
figure out the problem you're actually trying to solve.  Any linked data
future cannot be a simple as a technologist giving some magic tool to
archivists and librarians.

You still haven't really answered my question about what you're hoping to
achieve and who stands to benefit from it.  I don't see how assigning a
bunch of arbitrary identifiers, properties, and values to a description of
a collection of archival materials (especially since you're talking about
doing this in XSLT, so your archival collections can't even really be
related to /each other/ much less anything else).

Who is going to use going to use this data?  What are they supposed to do
with it?  What will libraries and archives get from it?

I am certainly not above academic exercises (or without my own), but I
absolutely can see *no* beneficial archival linked data created simply by
pointing an XSLT at a bunch of EAD and MARCXML and I certainly can't
without a clear vision of the model that said XSLT is supposed to generate.
 The key part here is the data model, and taking a 'software
solution'-first approach does nothing to address that.

-Ross.


Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Kevin Ford

* BIBFRAME Tools [6] - sports nice ontologies, but
  the online tools won’t scale for large operations
-- The code running the transformation at [6] is available here:

https://github.com/lcnetdev/marc2bibframe

We've run several million records through it at one time.  As with 
everything, the data needs to be properly prepared and we have a script 
that processes those millions in smaller (but still sizeable) batches.


Yours,
Kevin


On 12/04/2013 09:59 PM, Eric Lease Morgan wrote:

I have to eat some crow, and I hope somebody here can give me some advice for 
transforming MARC to RDF.

I am in the midst of writing a book describing the benefits of linked data for 
archives. Archival metadata usually comes in two flavors: EAD and MARC. I found 
a nifty XSL stylesheet from the Archives Hub (that’s in the United Kingdom) 
transforming EAD to RDF/XML. [1] With a bit of customization I think it could 
be used quite well for just about anybody with EAD files. I have retained a 
resulting RDF/XML file online. [2]

Converting MARC to RDF has been more problematic. There are various tools 
enabling me to convert my original MARC into MARCXML and/or MODS. After that I 
can reportably use a few tools to convert to RDF:

   * MARC21slim2RDFDC.xsl [3] - functions, but even for
 my tastes the resulting RDF is too vanilla. [4]

   * modsrdf.xsl [5] - optimal, but when I use my
 transformation engine (Saxon), I do not get XML
 but rather plain text

   * BIBFRAME Tools [6] - sports nice ontologies, but
 the online tools won’t scale for large operations

In short, I have discovered nothing that is “easy-to-use”. Can you provide me 
with any other links allowing me to convert MARC to serialized RDF?

[1] ead2rdf.xsl - http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl
[2] transformed EAD file - http://infomotions.com/tmp/una-ano.rdf
[3] MARC21slim2RDFDC.xsl - 
http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
[4] vanilla RDF - http://infomotions.com/tmp/pamphlets.rdf
[5] modsrdf.xsl - 
http://www.loc.gov/standards/mods/modsrdf/xsl-files/modsrdf.xsl
[6] BIBFRAME Tools - http://bibframe.org/tools/transform/start

—
Eric Lease Morgan



Re: [CODE4LIB] transforming marc to rdf

2013-12-05 Thread Kevin Ford

 Anything that will remodel MARC to (decent) RDF is going be:

 - Non-trivial to install
 - Non-trivial to use
 - Slow
 - Require massive amounts of memory/disk space

 Choose any two.
-- I'll second this.


 Frankly, I don't see how you can generate RDF that anybody would want to
 use from XSLT: where would your URIs come from?  What, exactly, are you
 modeling?
-- Our experience getting to good, URI rich RDF has been basically a 
two-step process.  First there is the raw conversion, which certainly 
results in verbose blank-node-rich RDF, but we follow that pass with a 
second one during which blank nodes are replaced with URIs.


This has most certainly been the case with BIBFRAME because X number of 
MARC records may represent varying manifestations of a single work.  We 
don't want X number of instances (manifestations basically) referencing 
X number of works in the end, but X number of instances referencing 1 
work (all other things being equal).  We consolidate - for the lack of a 
better word - X number of works created in the first pass into 1 work 
(identified by an HTTP URI) and then we make sure X number of instances 
point to that one work, removing all the duplicate blank-node-identified 
resources created during the first pass.


Granted this consolidation scenario is not scalable without a fairly 
robust backend solution, but the process at bibframe.org (the code on 
github) nevertheless does the type of consolidation described above in 
memory with small MARC collections.


Yours,
Kevin





On 12/05/2013 08:55 AM, Ross Singer wrote:

Eric, I'm having a hard time figuring out exactly what you're hoping to get.

Going from MARC to RDF was my great white whale for years while Talis' main
business interests involved both of those (although not archival
collections).  Anything that will remodel MARC to (decent) RDF is going be:

- Non-trivial to install
- Non-trivial to use
- Slow
- Require massive amounts of memory/disk space

Choose any two.

--



Frankly, I don't see how you can generate RDF that anybody would want to
use from XSLT: where would your URIs come from?  What, exactly, are you
modeling?

I guess, to me, it would be a lot more helpful for you to take an archival
MARC record, and, by hand, build an RDF graph from it, then figure out your
mappings.  I just don't see any way to make it easy-to-use, at least, not
until you have an agreed upon model to map to.

-Ross.


On Thu, Dec 5, 2013 at 3:07 AM, Christian Pietsch 
chr.pietsch+web4...@googlemail.com wrote:


Hi Eric,

you seem to have missed the Catmandu tutorial at SWIB13. Luckily there
is a basic tutorial and a demo online: http://librecat.org/

The demo happens to be about transforming MARC to RDF using the
Catmandu Perl framework. It gives you full flexibility by separating
the importer from the exporter and providing a domain specific
language for “fixing” the data in between. Catmandu also has easy
to use wrappers for popular search engines and databases (both SQL and
NoSQL), making it a complete ETL (extract, transform, load) toolkit.

Disclosure: I am a Catmandu contributor. It's free and open source
software.

Cheers,
Christian


On Wed, Dec 04, 2013 at 09:59:46PM -0500, Eric Lease Morgan wrote:

Converting MARC to RDF has been more problematic. There are various
tools enabling me to convert my original MARC into MARCXML and/or
MODS. After that I can reportably use a few tools to convert to RDF:

   * MARC21slim2RDFDC.xsl [3] - functions, but even for
 my tastes the resulting RDF is too vanilla. [4]

   * modsrdf.xsl [5] - optimal, but when I use my
 transformation engine (Saxon), I do not get XML
 but rather plain text

   * BIBFRAME Tools [6] - sports nice ontologies, but
 the online tools won’t scale for large operations


--
   Christian Pietsch · http://www.ub.uni-bielefeld.de/~cpietsch/
   LibTec · Library Technology and Knowledge Management
   Bielefeld University Library, Bielefeld, Germany



Re: [CODE4LIB] transforming marc to rdf

2013-12-04 Thread Corey A Harper
Eric,

Have you had a look at Ed Chamberlain's work on COMET:
https://github.com/edchamberlain/COMET

It's been a while since I've run this, but if I remember correctly, it was
fairly easy-to-use.

Also, though much older, I seem to remember the Simile MARC RDFizer being
a pretty straightforward one to run:
http://simile.mit.edu/wiki/MARC/MODS_RDFizer

MODS aficionados will point to some problems with some of it's choices for
representing that data, but still a good starting point (IMO).

Hope that helps,
-Corey



On Wed, Dec 4, 2013 at 9:59 PM, Eric Lease Morgan emor...@nd.edu wrote:

 I have to eat some crow, and I hope somebody here can give me some advice
 for transforming MARC to RDF.

 I am in the midst of writing a book describing the benefits of linked data
 for archives. Archival metadata usually comes in two flavors: EAD and MARC.
 I found a nifty XSL stylesheet from the Archives Hub (that’s in the United
 Kingdom) transforming EAD to RDF/XML. [1] With a bit of customization I
 think it could be used quite well for just about anybody with EAD files. I
 have retained a resulting RDF/XML file online. [2]

 Converting MARC to RDF has been more problematic. There are various tools
 enabling me to convert my original MARC into MARCXML and/or MODS. After
 that I can reportably use a few tools to convert to RDF:

   * MARC21slim2RDFDC.xsl [3] - functions, but even for
 my tastes the resulting RDF is too vanilla. [4]

   * modsrdf.xsl [5] - optimal, but when I use my
 transformation engine (Saxon), I do not get XML
 but rather plain text

   * BIBFRAME Tools [6] - sports nice ontologies, but
 the online tools won’t scale for large operations

 In short, I have discovered nothing that is “easy-to-use”. Can you provide
 me with any other links allowing me to convert MARC to serialized RDF?

 [1] ead2rdf.xsl - http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl
 [2] transformed EAD file - http://infomotions.com/tmp/una-ano.rdf
 [3] MARC21slim2RDFDC.xsl -
 http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl
 [4] vanilla RDF - http://infomotions.com/tmp/pamphlets.rdf
 [5] modsrdf.xsl -
 http://www.loc.gov/standards/mods/modsrdf/xsl-files/modsrdf.xsl
 [6] BIBFRAME Tools - http://bibframe.org/tools/transform/start

 —
 Eric Lease Morgan




-- 
Corey A Harper
Metadata Services Librarian
New York University Libraries
20 Cooper Square, 3rd Floor
New York, NY 10003-7112
212.998.2479
corey.har...@nyu.edu