Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-11 Thread [Your Name]
---
A. Soroka
Digital Research and Scholarship R  D
the University of Virginia Library

Tom--

Yes and no. Yes, in the sense that nothing of policy prevents us from sharing 
it, but no, in the sense that it is currently -very- tightly bound up with our 
workflow machinery, so I don't know how useful it could immediately be to you. 
I can put you in touch with the programmer who constructed that workflow, if 
you like. Anyone else interested in that tooling is also welcome to contact me 
off-list.


---
A. Soroka
Digital Research and Scholarship R  D
the University of Virginia Library


On Aug 9, 2010, at 11:00 PM, CODE4LIB automatic digest system wrote:

 From: Tom Cramer tcra...@stanford.edu
 Date: August 9, 2010 11:09:02 AM EDT
 Subject: Re: EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
 
 
 Adam,
 
 Is the EAD-to-RDF graphinator code you describe shareable? I'd like to 
 experiment with it for some ongoing work that involves ingesting archival 
 collections into Fedora, and then editing them with Hydra and viewing them 
 via Blacklight. 
 
 - Tom
 
 
 On Aug 8, 2010, at 8:13 AM, [Your Name] wrote:
 
 I'd like to share an alternative approach that we're pursuing here at UVa. 
 It doesn't speak quite directly to operations on finding aids by themselves, 
 with no attention to representing on-line the collection so described, but 
 more to those situations where you make an attempt at a full digital 
 surrogate for a collection, using repository machinery. I hope, though, that 
 it might be useful to hear about. We started from a few principles as 
 follows. (All of them have exceptions, of course. {grin})


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-10 Thread Eric Lease Morgan
On Aug 10, 2010, at 11:59 AM, Ethan Gruber wrote:

 Sounds like a good plan, but I wanted to throw in my two cents on your
 workflow.  unitid is intended to be an optional element and describe an
 actual unique identifier that the object or collection has been given by the
 hosting institution.  For example, accession number.  unitid isn't
 absolutely intended to be a machine-readable value (for example, xml:id),
 though it could be.  I think what you want to do is populate the id
 attribute for each component with a unique xml:id.  This way you can make
 all your components have a machine-readable identifier while preserving any
 actual unique identifiers that describe the component.


Ethan, thank you for the feedback.

Yes, I have come to learn that unitid is not necessarily intended to be 
machine-readable, but I intended to make it such in my locally cached versions 
of the EAD. This is because of Archon. For better or for worse, it is possible 
to create URL from the unitid that will point directly to an Archon and the 
sub-section of the EAD file. Ultimately this solves the problem of formatting 
the EAD for display as well as navigating it. An xml:id would not do the same 
trick because Archon would not know how to handle it.

-- 
Eric Morgan


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-08 Thread [Your Name]
I'd like to share an alternative approach that we're pursuing here at UVa. It 
doesn't speak quite directly to operations on finding aids by themselves, with 
no attention to representing on-line the collection so described, but more to 
those situations where you make an attempt at a full digital surrogate for a 
collection, using repository machinery. I hope, though, that it might be useful 
to hear about. We started from a few principles as follows. (All of them have 
exceptions, of course. {grin})

1) EAD is a wonderful markup language, but not always an optimal metadata 
standard. 

2) XML is for serializing, not for storage.

3) Solr is a fantastic indexing tool, but it's neither a datastore nor a 
database.

4) Collections do not have an absolutely correct structure. Archivists and 
scholars disagree sometimes.

5) The best ways to describe an individual entity are not necessarily the best 
ways to describe the relationships between entities.

We assemble digital surrogates for archival collections as assemblages of 
Fedora objects linked together by RDF. When we start with a finding aid, we 
disassemble the EAD to develop a graph of documents, containers, series, etc. 
in Fedora, with RDF predicates along the lines of isConstituentOf, 
hasCollectionMember, etc. When we haven't got a finding aid, we build up the 
graph from annotations on the physical objects (boxes, folders, etc.) as they 
are processed for scanning. Obviously, we get a much simpler graph that way, 
because no claims have been made by archivists about the structure of the 
collection. Descriptive and other metadata is stored with each object in MODS 
and other good -metadata- formats. A document object has metadata that pertain 
only to the document (along with any data that permits us to represent the 
document on-line, e.g. a scanned image or TEI text ), a folder object has 
metadata for that folder, etc. Since we want to offer EAD for a collection (or!
  any piece thereof), we supply a Fedora behavior (dissemination) against any 
object, which behavior assembles a collection structure as seen from that 
object (by following the RDF graph), then recursively assembles the appropriate 
metadata and transforms it to produce EAD.

We like this approach because it offers a great deal of extensibility (we could 
imagine using more sophisticated RDF to account for different opinions about a 
collection, or offering a METS or other structured view as well) and it keeps 
the repository contents idiomatic. We haven't yet figured out entirely how we 
bring this kind of content to Blacklight, but we'll be aided by the fact that 
we have appropriately-attached metadata for anything that should appear as a 
record in our indexes.

We're bringing the first part of this scheme (the assembly of object graphs) to 
production in the next fortnight or so. We've got the code ready and tested and 
are now enjoying the really fun stuff-- moving servers around and tinkering 
with clustering and the like. The second part (producing EAD live) is waiting 
to go to production on some work from our cataloging dep't, who have assigned 
some staff to polish up the mappings involved. We have very simple mappings in 
place now, but not ones good enough to publish publicly. They're working away, 
and we hope to see something in production later this fall. As for how we 
provide discoverability, we'll start simply by indexing all these objects into 
our local Blacklight instance. There's no need to consider how to index 
highly-structured XML because we're not storing it. We can move on to providing 
special views for records with awareness of the relationships that Fedora has 
recorded on those objects and tools for discovering, v!
 isualizing, and following them. Unfortunately, our one Blacklight developer 
has plenty on her plate already, so I don't know how quickly we'll be able to 
look at that. In the meanwhile, we can simply style out the 
dynamically-constructed EAD as part of a Blacklight view for a given record, 
which isn't particularly exciting, but is useful.

---
A. Soroka
Digital Research and Scholarship R  D
the University of Virginia Library


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-07 Thread Mark A. Matienzo
On Fri, Aug 6, 2010 at 2:17 PM, Bess Sadler bess.sad...@gmail.com wrote:
 +1. Potential options could include using an XML database like eXist,
 or using our approach at Yale (where EAD finding aids are stored as
 datastreams in Fedora objects). I've been eager to look at rethinking
 our approach, especially given the availability of the Hydra codebase.

 Absolutely. Also, this is one example i can think of where fedora 
 disseminators make perfect sense. Fedora can serve as your repository, and 
 then each guide can be accessed as

 http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER

 and each section can be grabbed via

 http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER/bioghist (or 
 whatever naming scheme makes sense to those with stronger opinions about EAD 
 than I do)

That's interesting, although I'm not sure how those disseminators
would operate. Are you predicting some sort of XSL transform that just
extracts that element?

 What I'd love to see is each item represented and described independently in 
 the repository, and then a full XML serialization of the EAD would just be 
 constructed on the fly, bringing in as serialization time any objects that 
 belong in a given section of the document.

 Institutionally, the biggest problem with EAD is version control and workflow 
 for keeping the documents up to date. I think splitting things up into 
 separate objects and only contructing the full EAD document as needed is a 
 good potential solution to this.

In theory this is a good idea, but Manuscripts and Archives at Yale
(along with a sizable number of other institutions) use a
database-backed collection management system like Archivists' Toolkit
or Archon to create archival description. The resulting EAD is then an
export from these systems. For us, we're not doing any major massaging
of the data or contents other than an XSL transform to make it conform
to a University-wide best practice schema.

I'd be interested to see how this would work though - it think t's a
potentially great implementation.

On Fri, Aug 6, 2010 at 2:23 PM, Adam Wead aw...@rockhall.org wrote:
 Mark,

 How are you creating the EAD docs in Fedora?  At present, we're using 
 archivist's toolkit to dump out ead xml files and then I index them in solr, 
 with blacklight displaying the entire document as well.  It's messy and it 
 would be nice to make a more efficient connection between the three (BL, 
 Fedora and Solr).  I'd love to show everyone what I have, but they keep us on 
 a private network here.

Adam - we're doing the same, essentially, except Fedora serves as the
data store for the exported EAD. I've been considering potential
implementations that would allow uploads of EAD datastreams using a
Hydra-based web application, and then using Solrizer to manipulate the
EAD into an update document or set thereof. I just haven't had time to
commit to looking into this.

Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Ethan Gruber
Jason,

Thanks for the info.  Nokogiri is alright, but I've found that, as far as
XML processing goes, Saxon is above and beyond the best.  Is it possible
fire off a Java call from Ruby to have Saxon handle it, or not?  Are you
using Nokogiri to call an XSLT process or using Ruby to generate the view?

I've also heard about scalability issues with Solr and large XML documents,
but I've never seen benchmarks.

Ethan

On Mon, Aug 2, 2010 at 8:46 AM, Jason Ronallo jrona...@gmail.com wrote:

 Ethan,
 The plugin I wrote for Blacklight is just a start and was a proof of
 concept/template. Having said that this is basically code I extracted
 from another application I have in production. In that case it wasn't
 necessary to display every detail in the EAD, so it is really just a
 short view.

 The plugin does some very basic indexing of the EAD to conform to the
 default Blacklight Solr schema. It could certainly be expanded to get
 better faceting and fielded search in a customized Blacklight. Lots of
 possibilities for expansion. The indexing also takes the simple
 approach of one EAD XML document being one Solr document. Other folks
 have played around with splitting an EAD doc into different Solr
 documents, but I haven't been satisfied with either the display of the
 search results or show views, which have seemed too fragmented to me.

 The display in the plugin is one page for the whole finding aid. The
 display is concise, but that's not the biggest problem with it. The
 EAD XML is stored as a Solr field. I've heard conflicting information
 about this, but it may be slow to retrieve large fields from Solr.
 (Anyone want to put that idea to rest?) The biggest problem with this
 implementation, though, is that the XML parsing is done using the
 Nokogiri DOM parser. Nokogiri is fast enough, but still loading up the
 whole DOM into memory and looping through a long container list can
 take a very long time. I've worked around that with partial caching in
 my applications.

 If you want to see it in action, it is very easy to set up if you
 already have Ruby installed. Just one template command to build the
 Rails app and then answer yes to all the questions. Remember to start
 jetty before trying to index.
 http://github.com/jronallo/blacklight_ext_ead_simple

 I have been fooling around with creating a new library that uses
 Nokogiri's SAX parser. This makes parsing on the fly much faster. I'm
 also attempting to deal with more of the content as found in a basic
 Archivists' Toolkit EAD XML doc. The problem with the SAX parsing is
 that you have to deal with all the craziness of EAD as it is streaming
 at you. I have something basically working, if messy, which I hope to
 have up on github soon.

 Please let me know if you have any other questions about this.

 Jason

 On Fri, Jul 30, 2010 at 11:17 AM, Ethan Gruber ewg4x...@gmail.com wrote:
  By displays it, do you mean there is a view for displaying some
 metadata
  about the EAD guide in the blacklight search results or that the entire
  guide is rendered out in blacklight somehow?  Hopefully Jason is on the
  list.  I'm curious about this.
 
  Thanks,
  Ethan
 
  On Fri, Jul 30, 2010 at 11:06 AM, Adam Wead aw...@rockhall.org wrote:
 
  Takes an ead doc, indexes it solr, and displays it via blacklight.  I
 think
  Jason's on this list, so he could tell you more about it.  I took it and
  modified the display a bit.  It's available via git:
 
  http://github.com/jronallo/blacklight_ext_ead_simple
 
 
 
  -Original Message-
  From: Code for Libraries on behalf of Ethan Gruber
  Sent: Fri 7/30/2010 10:06 AM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] Batch loading in fedora
 
  What does the EAD plugin do?  I haven't heard much about it.
 
  Ethan
 
  On Fri, Jul 30, 2010 at 10:03 AM, Adam Wead aw...@rockhall.org wrote:
 
   Hardy,
  
   Here's the task:
  
   http://github.com/awead/rocklight/blob/master/lib/tasks/fedora.rake
  
   I just threw up the project on git, so there's not much explanation of
   anything.  It's very much a work-in-progress.  It uses blacklight, an
 ead
   plugin that Jason Ronallo wrote, and a bunch of
 active-fedora/hydrangea
   code.  The image ingest process is designed to attach an image pid to
 an
   existing pid in fedora that is the archival collection.  I've been
 only
   testing this, so right now it ingests some jpg files and uses image
  magick
   to resize them into a thumbnail and access version.  In real life
 the
   preservation stream would be tiff and the thumbnail and access version
  would
   be jpegs.  I also threw in a jhove datastream for fun, but I'm not
 doing
   anything with it at this point other than just storing it.
  
   The three descriptive medata streams are from the active-fedora model.
Ideally, we'd use a mods schema for all the descriptive data instead
 of
   these three different ones, but that'll be the next step.
  
   let me know if you have comments or questions.  

Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Bess Sadler
Hi, Ethan.

You can see another example of blacklight being used to search and display EAD 
guides at 

http://nwda.projectblacklight.org/?f%5Bformat_facet%5D%5B%5D=Archival+Collection+Guide

I've used solr and/or lucene for EAD documents a few times, and here are some 
observations: 

 I've also heard about scalability issues with Solr and large XML documents,
 but I've never seen benchmarks.

Solr is incredibly scalable, so describing this as a solr scalability issue 
isn't really accurate. What might be more accurate would be to say that Solr is 
designed for searching, while most people looking for an EAD solution are 
trying to get it to do a lot more than that. The problem is that you want to be 
able to discover and view an EAD guide at several levels, right? You want to be 
able to discover at the collection level, and at the item level, and presumably 
at the level of some section of the EAD document (e.g., biographical history or 
whatever). Solr and lucene really just know how to tell you whether a given 
document in the index matches a query you've entered, though, so if you want to 
be able to discover on each of those levels, you have to index your document 
once to represent the collection, then again for each section you want to be 
independently discoverable, then again for each item you want to be 
discoverable. Creating a UI that is going to represent a sing!
 le EAD, which has now been transformed into potentially hundreds or thousands 
of independently discoverable items and EAD sections is quite challenging. I 
liked what Matt Mitchell and I did for the Northwest Digital Archives, but I'm 
always interested in other ways one might approach this. 

We indexed each EAD guide into separate lucene documents for each EAD section, 
then collapsed them under the main EAD title in the search results, so that 
when you search for an archival collection you only see the EAD guide 
represented once, but each section of it is still independently viewable and 
bookmarkable:

Here is the guide for the Bing Crosby Historical Society in a search result:

http://nwda.projectblacklight.org/catalog?q=crosbyqt=searchper_page=10f%5Bformat_facet%5D%5B%5D=Archival+Collection+Guidecommit=search

But in order to look at the guide, you have to look at a specific part of it: 
http://nwda.projectblacklight.org/catalog/bcc_1-summary

Additionally, we treated each item as a first class independently discoverable 
object, but still linked them all to the section of the EAD document where they 
came from:

http://nwda.projectblacklight.org/catalog/bcc_1-v

Matt and I were thinking it would be nice to allow blacklight to handle all of 
the display of the EAD too, which is why we stored a lot of EAD markup in the 
solr document, and that can potentially have scalability problems, because 
lucene is not a database but we were treating it like one. This works, but it's 
a bit of a hack. 

Bess


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Bess Sadler
On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote:

 We indexed each EAD guide into separate lucene documents for each EAD 
 section, then collapsed them under the main EAD title in the search results,
 
 Curious how you impelemented that: Did you use the Solr field collapsing 
 patch that's not yet part of a standard distro?

Yes, exactly. 

 
 Matt and I were thinking it would be nice to allow blacklight to handle all 
 of the display of the EAD too, 
 which is why we stored a lot of EAD markup in the solr document, and that 
 can potentially have scalability
 problems, because lucene is not a database but we were treating it like one. 
 This works, but it's a bit of a hack.
 
 You can definitely have Blacklight handle the display while still keeping the 
 EAD out of solr stored fields. There's no reason Blacklight can't fetch the 
 EAD from some external store, keyed by Solr document ID (or by some other 
 value in a solr document stored field).  That's my current thinking (informed 
 by y'alls experience) of how I'm going to handle future large object stuff in 
 BL, if/when I get around to developing it. 

Yeah, that's a good point. We were trying to self-contain the whole thing for 
ease of deployment, but I'm not sure that's a good approach. It's better if 
your EAD is in a real repository and Blacklight just presents it.

Bess


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Mark A. Matienzo
On Fri, Aug 6, 2010 at 1:09 PM, Bess Sadler bess.sad...@gmail.com wrote:
 On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote:

 We indexed each EAD guide into separate lucene documents for each EAD 
 section, then collapsed them under the main EAD title in the search results,

 Curious how you impelemented that: Did you use the Solr field collapsing 
 patch that's not yet part of a standard distro?

 Yes, exactly.

Bess - would you be willing to share code or brief notes about how to
set this up?

 Yeah, that's a good point. We were trying to self-contain the whole thing for 
 ease of deployment, but I'm not sure that's a good approach. It's better if 
 your EAD is in a real repository and Blacklight just presents it.

+1. Potential options could include using an XML database like eXist,
or using our approach at Yale (where EAD finding aids are stored as
datastreams in Fedora objects). I've been eager to look at rethinking
our approach, especially given the availability of the Hydra codebase.

Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Ethan Gruber
I also think it's better to store EAD in a separate system rather than in
the Solr index, that way you can use blacklight to serialize it or store a
reference to a separate delivery system.  Bess's and Matt's approach to
storing the whole collection (EAD file) as a solr document in addition to
making each item accessible in blacklight is a good one.  Hopefully this
will encourage institutions to use EAD in its fullest and encode items at a
very detailed level.  I'd say at least 95% of finding aids I have seen have
little more than title, date, and container for items, if items are
enumerated at all.  The thing about most EAD delivery systems is that they
assume you wish to use EAD in its traditional form, as an electronic finding
aid that you wish to render in full on the screen.  Most systems don't
accommodate an emphasis on item level encoding, display, and findability.

Ethan

On Fri, Aug 6, 2010 at 1:53 PM, Mark A. Matienzo m...@matienzo.org wrote:

 On Fri, Aug 6, 2010 at 1:09 PM, Bess Sadler bess.sad...@gmail.com wrote:
  On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote:
 
  We indexed each EAD guide into separate lucene documents for each EAD
 section, then collapsed them under the main EAD title in the search results,
 
  Curious how you impelemented that: Did you use the Solr field
 collapsing patch that's not yet part of a standard distro?
 
  Yes, exactly.

 Bess - would you be willing to share code or brief notes about how to
 set this up?

  Yeah, that's a good point. We were trying to self-contain the whole thing
 for ease of deployment, but I'm not sure that's a good approach. It's better
 if your EAD is in a real repository and Blacklight just presents it.

 +1. Potential options could include using an XML database like eXist,
 or using our approach at Yale (where EAD finding aids are stored as
 datastreams in Fedora objects). I've been eager to look at rethinking
 our approach, especially given the availability of the Hydra codebase.

 Mark A. Matienzo
 Digital Archivist, Manuscripts and Archives
 Yale University Library



Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Bess Sadler
On Aug 6, 2010, at 10:53 AM, Mark A. Matienzo wrote:

 On Fri, Aug 6, 2010 at 1:09 PM, Bess Sadler bess.sad...@gmail.com wrote:
 On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote:
 
 We indexed each EAD guide into separate lucene documents for each EAD 
 section, then collapsed them under the main EAD title in the search 
 results,
 
 Curious how you impelemented that: Did you use the Solr field collapsing 
 patch that's not yet part of a standard distro?
 
 Yes, exactly.
 
 Bess - would you be willing to share code or brief notes about how to
 set this up?

Gladly. I will write it as a separate message though, for ease of future 
reference. 

 
 Yeah, that's a good point. We were trying to self-contain the whole thing 
 for ease of deployment, but I'm not sure that's a good approach. It's better 
 if your EAD is in a real repository and Blacklight just presents it.
 
 +1. Potential options could include using an XML database like eXist,
 or using our approach at Yale (where EAD finding aids are stored as
 datastreams in Fedora objects). I've been eager to look at rethinking
 our approach, especially given the availability of the Hydra codebase.

Absolutely. Also, this is one example i can think of where fedora disseminators 
make perfect sense. Fedora can serve as your repository, and then each guide 
can be accessed as 

http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER

and each section can be grabbed via 

http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER/bioghist (or whatever 
naming scheme makes sense to those with stronger opinions about EAD than I do) 

What I'd love to see is each item represented and described independently in 
the repository, and then a full XML serialization of the EAD would just be 
constructed on the fly, bringing in as serialization time any objects that 
belong in a given section of the document. 

Institutionally, the biggest problem with EAD is version control and workflow 
for keeping the documents up to date. I think splitting things up into separate 
objects and only contructing the full EAD document as needed is a good 
potential solution to this. 

Bess


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Adam Wead
Mark,

How are you creating the EAD docs in Fedora?  At present, we're using 
archivist's toolkit to dump out ead xml files and then I index them in solr, 
with blacklight displaying the entire document as well.  It's messy and it 
would be nice to make a more efficient connection between the three (BL, Fedora 
and Solr).  I'd love to show everyone what I have, but they keep us on a 
private network here.

...adam

-Original Message-
From: Code for Libraries on behalf of Mark A. Matienzo
Sent: Fri 8/6/2010 1:53 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in 
fedora)
 
+1. Potential options could include using an XML database like eXist,
or using our approach at Yale (where EAD finding aids are stored as
datastreams in Fedora objects). I've been eager to look at rethinking
our approach, especially given the availability of the Hydra codebase.

Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library


 
Join us on Friday, September 3, at the 
http://rockhall.com/event/rock-hall-ball/ 15th Anniversary Celebration at the 
Rock and Roll Hall of Fame and Museum. 
http://rockhall.com/event/rock-hall-ball/!
 
 
Rock  Roll: (noun) African American slang dating back to the early 20th 
Century. In the early 1950s, the term came to be used to describe a new form of 
music, steeped in the blues, rhythm  blues, country and gospel. Today, it 
refers to a wide variety of popular music -- frequently music with an edge and 
attitude, music with a good beat and --- often --- loud guitars.© 2005 Rock and 
Roll Hall of Fame and Museum.
 
This communication is a confidential and proprietary business communication. It 
is intended solely for the use of the designated recipient(s). If this 
communication is received in error, please contact the sender and delete this 
communication.


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Ethan Gruber
Hi Adam,

I posted an update last Friday on a project I have been working on since
last fall called EADitor, an XForms application for creating, managing, and
publishing EAD collections.  I'm using an eXist datastore, but one could
adapt the XForms application to load and save data to and from another REST
service, like Fedora.  Plus I have XForm submissions to transform the EAD
document to a Solr doc and post it to the index (as well as deleting).  That
could be adapted to post to a blacklight index instead of the one I packaged
internally to the application as part of its own publication mechanism.

Heres a link to the code page: http://code.google.com/p/eaditor/

I'm presenting it at the EAD roundtable at SAA next week.

Ethan

On Fri, Aug 6, 2010 at 2:23 PM, Adam Wead aw...@rockhall.org wrote:

 Mark,

 How are you creating the EAD docs in Fedora?  At present, we're using
 archivist's toolkit to dump out ead xml files and then I index them in solr,
 with blacklight displaying the entire document as well.  It's messy and it
 would be nice to make a more efficient connection between the three (BL,
 Fedora and Solr).  I'd love to show everyone what I have, but they keep us
 on a private network here.

 ...adam

 -Original Message-
 From: Code for Libraries on behalf of Mark A. Matienzo
 Sent: Fri 8/6/2010 1:53 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch
 loading in fedora)

 +1. Potential options could include using an XML database like eXist,
 or using our approach at Yale (where EAD finding aids are stored as
 datastreams in Fedora objects). I've been eager to look at rethinking
 our approach, especially given the availability of the Hydra codebase.

 Mark A. Matienzo
 Digital Archivist, Manuscripts and Archives
 Yale University Library



 Join us on Friday, September 3, at the
 http://rockhall.com/event/rock-hall-ball/ 15th Anniversary Celebration at
 the Rock and Roll Hall of Fame and Museum.
 http://rockhall.com/event/rock-hall-ball/!


 Rock  Roll: (noun) African American slang dating back to the early 20th
 Century. In the early 1950s, the term came to be used to describe a new form
 of music, steeped in the blues, rhythm  blues, country and gospel. Today,
 it refers to a wide variety of popular music -- frequently music with an
 edge and attitude, music with a good beat and --- often --- loud guitars.©
 2005 Rock and Roll Hall of Fame and Museum.

 This communication is a confidential and proprietary business
 communication. It is intended solely for the use of the designated
 recipient(s). If this communication is received in error, please contact the
 sender and delete this communication.



Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Jason Ronallo
On Fri, Aug 6, 2010 at 12:10 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 You can definitely have Blacklight handle the display while still keeping the 
 EAD out of solr stored fields. There's no reason Blacklight can't fetch the 
 EAD from some external store, keyed by Solr document ID (or by some other 
 value in a solr document stored field).  That's my current thinking (informed 
 by y'alls experience) of how I'm going to handle future large object stuff in 
 BL, if/when I get around to developing it.

I've wanted to try an approach similar to how paperclip [1] works
where a filesystem storage location is chosen that survives between
deployments when using something like Capistrano. Using the filesystem
would allow the EAD XML to be easily served up directly from the
public directory.

Jason

[1] http://github.com/thoughtbot/paperclip


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Ranti Junus
On Fri, Aug 6, 2010 at 2:17 PM, Bess Sadler bess.sad...@gmail.com wrote:
 On Aug 6, 2010, at 10:53 AM, Mark A. Matienzo wrote:


 We indexed each EAD guide into separate lucene documents for each EAD 
 section, then collapsed them under the main EAD title in the search 
 results,

 Curious how you impelemented that: Did you use the Solr field collapsing 
 patch that's not yet part of a standard distro?

 Yes, exactly.

 Bess - would you be willing to share code or brief notes about how to set 
 this up?

 Gladly. I will write it as a separate message though, for ease of future 
 reference.

Bess, is this something you could put on blacklight project website or
on code4lib wiki?


thanks,
ranti.


-- 
Bulk mail.  Postage paid.


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Jonathan Rochkind
I wonder how the field collapsing patch holds up on an index that contains 3 
million documents, probably larger than your EAD-only one, but thinking about 
combining EAD in an index with many many other documents (like with a library 
catalog).  Might be fine, might not.

(Even without field collapsing, my solr index is really straining against the 
numerous facets I'm making it calculate and the dismax queries involving a 
dozen or more fields -- I plan to reduce my fields, reduce my facets if 
possible, and most importantly give my Solr a LOT more RAM than it has now. 
Complex queries with complex facetting on a several-million-doc index requires 
giving Solr a LOT more RAM for caches etc than we initially expected, I throw 
this in as a note to anyone else in the planning stages). 

I've been brainstorming other weird ways to do this. This one is totally wacky 
and possibly a bad idea, but I'll throw it out there anyway. What if you only 
indexed the entire EAD as one document, BUT threw the entire EAD in a stored 
field, and used solr highlightning on that field.  NOT to show the highlighter 
results to the user, but to sort of trick the highlighter, using 
hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to 
telling you _which_ sub-sections of the EAD matched, and your software could 
then display the matching sub-sections (possibly with direct links to display) 
in the search results, under the actual document hit. 

Possibly a really screwy idea, just throwing it out there. Solr highlightning 
can be a performance problem on very large stored documents too, not sure if 
typical EAD is 'very large' for these purposes, or if it's something that can 
be solved by throwing enough RAM at caches. But I guess something about the 
field collapsing patch makes me nervous, comments about it's performance being 
uncertain on very large result sets, or just nervousness about applying a patch 
to solr and counting on someone else to keep it working against solr master as 
it develops. 

Jonathan


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Bess Sadler
On Aug 6, 2010, at 8:07 PM, Jonathan Rochkind wrote:

 I've been brainstorming other weird ways to do this. This one is totally 
 wacky and possibly a bad idea, but I'll throw it out there anyway. What if 
 you only indexed the entire EAD as one document, BUT threw the entire EAD in 
 a stored field, and used solr highlightning on that field.  NOT to show the 
 highlighter results to the user, but to sort of trick the highlighter, using 
 hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to 
 telling you _which_ sub-sections of the EAD matched, and your software could 
 then display the matching sub-sections (possibly with direct links to 
 display) in the search results, under the actual document hit. 

Hi, Jonathan. I don't think this is a crazy idea, and in fact it is one of the 
approaches that Matt M. and I tried during our NWDA project. However, we found 
that it wasn't scalable. The highlighter was way too slow with the number of 
documents and fragments we were throwing at it. It wasn't even a huge number of 
documents, so we abandoned that idea. However, it's still a really elegant 
solution if only it were performant. Let me know if you decide to give it a 
try. 

Bess


Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)

2010-08-06 Thread Jonathan Rochkind
Huh, since the highlighter only needs to run on the documents in the actual 
returned section of the result set (10-50?), I wouldn't think total number of 
documents would matter much (I certainly could be wrong), but total size of 
each document's stored field definitely has a known performance impact on 
highlighter. Maybe some time I'll have time or the local requirement need to 
investigate; wonder if there'd be a way to write a custom highlighting 
component optimized for the EAD use case, or for the general case of identify 
matching section(s) in XML that would do better. 

I'm less nervous about custom components that do not require patches to Solr 
than I am about patches to Solr core that are not (yet?) included in solr 
tagged release or trunk.  

With some of the stuff I'm working with, RAM seems to have sometimes unexpected 
impacts on performance too. From thinking about what it does, and from looking 
at my cache hit/miss/eviction statistics, I didn''t really have reason to think 
that lack of RAM was what was slowing down my StatsComponent use, but adding 
RAM seems to help a lot. I need a hardware upgrade to be able to add enough RAM 
and avoid swap, to be sure that what I think I'm seeing about RAM effects on 
performance is what I'm seeing, but I think so.   Wonder if throwing monster 
amounts of RAM at Solr and increasing certain relevant caches a lot would have 
an impact on highlighter performance. 

I've  thought about using the highlighter in that way on Marc documents to 
provide matching snippets ala google in hits page -- the fact that Marc 
documents aren't full text', but are lists of structured (well, you know, they 
try :) ) fields, means that you can't just use the highlighter out of the box 
and get a reasonable snippet to show the user, but if you could use it to 
identify which _fields_ matched the query, and then throw each matching field 
(or the first N) through a display mapper that labels it and formats it 
appropriately (my as-of-yet not publically released marc mapping ruby framework 
could handle that nicely), that could provide a nice hit snippet perhaps. A 
large marc document is probably still smaller than a typical EAD document, so 
might have greater chance of success. 

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bess Sadler 
[bess.sad...@gmail.com]
Sent: Saturday, August 07, 2010 12:41 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in 
fedora)

On Aug 6, 2010, at 8:07 PM, Jonathan Rochkind wrote:

 I've been brainstorming other weird ways to do this. This one is totally 
 wacky and possibly a bad idea, but I'll throw it out there anyway. What if 
 you only indexed the entire EAD as one document, BUT threw the entire EAD in 
 a stored field, and used solr highlightning on that field.  NOT to show the 
 highlighter results to the user, but to sort of trick the highlighter, using 
 hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to 
 telling you _which_ sub-sections of the EAD matched, and your software could 
 then display the matching sub-sections (possibly with direct links to 
 display) in the search results, under the actual document hit.

Hi, Jonathan. I don't think this is a crazy idea, and in fact it is one of the 
approaches that Matt M. and I tried during our NWDA project. However, we found 
that it wasn't scalable. The highlighter was way too slow with the number of 
documents and fragments we were throwing at it. It wasn't even a huge number of 
documents, so we abandoned that idea. However, it's still a really elegant 
solution if only it were performant. Let me know if you decide to give it a try.

Bess