Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
--- A. Soroka Digital Research and Scholarship R D the University of Virginia Library Tom-- Yes and no. Yes, in the sense that nothing of policy prevents us from sharing it, but no, in the sense that it is currently -very- tightly bound up with our workflow machinery, so I don't know how useful it could immediately be to you. I can put you in touch with the programmer who constructed that workflow, if you like. Anyone else interested in that tooling is also welcome to contact me off-list. --- A. Soroka Digital Research and Scholarship R D the University of Virginia Library On Aug 9, 2010, at 11:00 PM, CODE4LIB automatic digest system wrote: From: Tom Cramer tcra...@stanford.edu Date: August 9, 2010 11:09:02 AM EDT Subject: Re: EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora) Adam, Is the EAD-to-RDF graphinator code you describe shareable? I'd like to experiment with it for some ongoing work that involves ingesting archival collections into Fedora, and then editing them with Hydra and viewing them via Blacklight. - Tom On Aug 8, 2010, at 8:13 AM, [Your Name] wrote: I'd like to share an alternative approach that we're pursuing here at UVa. It doesn't speak quite directly to operations on finding aids by themselves, with no attention to representing on-line the collection so described, but more to those situations where you make an attempt at a full digital surrogate for a collection, using repository machinery. I hope, though, that it might be useful to hear about. We started from a few principles as follows. (All of them have exceptions, of course. {grin})
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Aug 10, 2010, at 11:59 AM, Ethan Gruber wrote: Sounds like a good plan, but I wanted to throw in my two cents on your workflow. unitid is intended to be an optional element and describe an actual unique identifier that the object or collection has been given by the hosting institution. For example, accession number. unitid isn't absolutely intended to be a machine-readable value (for example, xml:id), though it could be. I think what you want to do is populate the id attribute for each component with a unique xml:id. This way you can make all your components have a machine-readable identifier while preserving any actual unique identifiers that describe the component. Ethan, thank you for the feedback. Yes, I have come to learn that unitid is not necessarily intended to be machine-readable, but I intended to make it such in my locally cached versions of the EAD. This is because of Archon. For better or for worse, it is possible to create URL from the unitid that will point directly to an Archon and the sub-section of the EAD file. Ultimately this solves the problem of formatting the EAD for display as well as navigating it. An xml:id would not do the same trick because Archon would not know how to handle it. -- Eric Morgan
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
I'd like to share an alternative approach that we're pursuing here at UVa. It doesn't speak quite directly to operations on finding aids by themselves, with no attention to representing on-line the collection so described, but more to those situations where you make an attempt at a full digital surrogate for a collection, using repository machinery. I hope, though, that it might be useful to hear about. We started from a few principles as follows. (All of them have exceptions, of course. {grin}) 1) EAD is a wonderful markup language, but not always an optimal metadata standard. 2) XML is for serializing, not for storage. 3) Solr is a fantastic indexing tool, but it's neither a datastore nor a database. 4) Collections do not have an absolutely correct structure. Archivists and scholars disagree sometimes. 5) The best ways to describe an individual entity are not necessarily the best ways to describe the relationships between entities. We assemble digital surrogates for archival collections as assemblages of Fedora objects linked together by RDF. When we start with a finding aid, we disassemble the EAD to develop a graph of documents, containers, series, etc. in Fedora, with RDF predicates along the lines of isConstituentOf, hasCollectionMember, etc. When we haven't got a finding aid, we build up the graph from annotations on the physical objects (boxes, folders, etc.) as they are processed for scanning. Obviously, we get a much simpler graph that way, because no claims have been made by archivists about the structure of the collection. Descriptive and other metadata is stored with each object in MODS and other good -metadata- formats. A document object has metadata that pertain only to the document (along with any data that permits us to represent the document on-line, e.g. a scanned image or TEI text ), a folder object has metadata for that folder, etc. Since we want to offer EAD for a collection (or! any piece thereof), we supply a Fedora behavior (dissemination) against any object, which behavior assembles a collection structure as seen from that object (by following the RDF graph), then recursively assembles the appropriate metadata and transforms it to produce EAD. We like this approach because it offers a great deal of extensibility (we could imagine using more sophisticated RDF to account for different opinions about a collection, or offering a METS or other structured view as well) and it keeps the repository contents idiomatic. We haven't yet figured out entirely how we bring this kind of content to Blacklight, but we'll be aided by the fact that we have appropriately-attached metadata for anything that should appear as a record in our indexes. We're bringing the first part of this scheme (the assembly of object graphs) to production in the next fortnight or so. We've got the code ready and tested and are now enjoying the really fun stuff-- moving servers around and tinkering with clustering and the like. The second part (producing EAD live) is waiting to go to production on some work from our cataloging dep't, who have assigned some staff to polish up the mappings involved. We have very simple mappings in place now, but not ones good enough to publish publicly. They're working away, and we hope to see something in production later this fall. As for how we provide discoverability, we'll start simply by indexing all these objects into our local Blacklight instance. There's no need to consider how to index highly-structured XML because we're not storing it. We can move on to providing special views for records with awareness of the relationships that Fedora has recorded on those objects and tools for discovering, v! isualizing, and following them. Unfortunately, our one Blacklight developer has plenty on her plate already, so I don't know how quickly we'll be able to look at that. In the meanwhile, we can simply style out the dynamically-constructed EAD as part of a Blacklight view for a given record, which isn't particularly exciting, but is useful. --- A. Soroka Digital Research and Scholarship R D the University of Virginia Library
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Fri, Aug 6, 2010 at 2:17 PM, Bess Sadler bess.sad...@gmail.com wrote: +1. Potential options could include using an XML database like eXist, or using our approach at Yale (where EAD finding aids are stored as datastreams in Fedora objects). I've been eager to look at rethinking our approach, especially given the availability of the Hydra codebase. Absolutely. Also, this is one example i can think of where fedora disseminators make perfect sense. Fedora can serve as your repository, and then each guide can be accessed as http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER and each section can be grabbed via http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER/bioghist (or whatever naming scheme makes sense to those with stronger opinions about EAD than I do) That's interesting, although I'm not sure how those disseminators would operate. Are you predicting some sort of XSL transform that just extracts that element? What I'd love to see is each item represented and described independently in the repository, and then a full XML serialization of the EAD would just be constructed on the fly, bringing in as serialization time any objects that belong in a given section of the document. Institutionally, the biggest problem with EAD is version control and workflow for keeping the documents up to date. I think splitting things up into separate objects and only contructing the full EAD document as needed is a good potential solution to this. In theory this is a good idea, but Manuscripts and Archives at Yale (along with a sizable number of other institutions) use a database-backed collection management system like Archivists' Toolkit or Archon to create archival description. The resulting EAD is then an export from these systems. For us, we're not doing any major massaging of the data or contents other than an XSL transform to make it conform to a University-wide best practice schema. I'd be interested to see how this would work though - it think t's a potentially great implementation. On Fri, Aug 6, 2010 at 2:23 PM, Adam Wead aw...@rockhall.org wrote: Mark, How are you creating the EAD docs in Fedora? At present, we're using archivist's toolkit to dump out ead xml files and then I index them in solr, with blacklight displaying the entire document as well. It's messy and it would be nice to make a more efficient connection between the three (BL, Fedora and Solr). I'd love to show everyone what I have, but they keep us on a private network here. Adam - we're doing the same, essentially, except Fedora serves as the data store for the exported EAD. I've been considering potential implementations that would allow uploads of EAD datastreams using a Hydra-based web application, and then using Solrizer to manipulate the EAD into an update document or set thereof. I just haven't had time to commit to looking into this. Mark A. Matienzo Digital Archivist, Manuscripts and Archives Yale University Library
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
Jason, Thanks for the info. Nokogiri is alright, but I've found that, as far as XML processing goes, Saxon is above and beyond the best. Is it possible fire off a Java call from Ruby to have Saxon handle it, or not? Are you using Nokogiri to call an XSLT process or using Ruby to generate the view? I've also heard about scalability issues with Solr and large XML documents, but I've never seen benchmarks. Ethan On Mon, Aug 2, 2010 at 8:46 AM, Jason Ronallo jrona...@gmail.com wrote: Ethan, The plugin I wrote for Blacklight is just a start and was a proof of concept/template. Having said that this is basically code I extracted from another application I have in production. In that case it wasn't necessary to display every detail in the EAD, so it is really just a short view. The plugin does some very basic indexing of the EAD to conform to the default Blacklight Solr schema. It could certainly be expanded to get better faceting and fielded search in a customized Blacklight. Lots of possibilities for expansion. The indexing also takes the simple approach of one EAD XML document being one Solr document. Other folks have played around with splitting an EAD doc into different Solr documents, but I haven't been satisfied with either the display of the search results or show views, which have seemed too fragmented to me. The display in the plugin is one page for the whole finding aid. The display is concise, but that's not the biggest problem with it. The EAD XML is stored as a Solr field. I've heard conflicting information about this, but it may be slow to retrieve large fields from Solr. (Anyone want to put that idea to rest?) The biggest problem with this implementation, though, is that the XML parsing is done using the Nokogiri DOM parser. Nokogiri is fast enough, but still loading up the whole DOM into memory and looping through a long container list can take a very long time. I've worked around that with partial caching in my applications. If you want to see it in action, it is very easy to set up if you already have Ruby installed. Just one template command to build the Rails app and then answer yes to all the questions. Remember to start jetty before trying to index. http://github.com/jronallo/blacklight_ext_ead_simple I have been fooling around with creating a new library that uses Nokogiri's SAX parser. This makes parsing on the fly much faster. I'm also attempting to deal with more of the content as found in a basic Archivists' Toolkit EAD XML doc. The problem with the SAX parsing is that you have to deal with all the craziness of EAD as it is streaming at you. I have something basically working, if messy, which I hope to have up on github soon. Please let me know if you have any other questions about this. Jason On Fri, Jul 30, 2010 at 11:17 AM, Ethan Gruber ewg4x...@gmail.com wrote: By displays it, do you mean there is a view for displaying some metadata about the EAD guide in the blacklight search results or that the entire guide is rendered out in blacklight somehow? Hopefully Jason is on the list. I'm curious about this. Thanks, Ethan On Fri, Jul 30, 2010 at 11:06 AM, Adam Wead aw...@rockhall.org wrote: Takes an ead doc, indexes it solr, and displays it via blacklight. I think Jason's on this list, so he could tell you more about it. I took it and modified the display a bit. It's available via git: http://github.com/jronallo/blacklight_ext_ead_simple -Original Message- From: Code for Libraries on behalf of Ethan Gruber Sent: Fri 7/30/2010 10:06 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Batch loading in fedora What does the EAD plugin do? I haven't heard much about it. Ethan On Fri, Jul 30, 2010 at 10:03 AM, Adam Wead aw...@rockhall.org wrote: Hardy, Here's the task: http://github.com/awead/rocklight/blob/master/lib/tasks/fedora.rake I just threw up the project on git, so there's not much explanation of anything. It's very much a work-in-progress. It uses blacklight, an ead plugin that Jason Ronallo wrote, and a bunch of active-fedora/hydrangea code. The image ingest process is designed to attach an image pid to an existing pid in fedora that is the archival collection. I've been only testing this, so right now it ingests some jpg files and uses image magick to resize them into a thumbnail and access version. In real life the preservation stream would be tiff and the thumbnail and access version would be jpegs. I also threw in a jhove datastream for fun, but I'm not doing anything with it at this point other than just storing it. The three descriptive medata streams are from the active-fedora model. Ideally, we'd use a mods schema for all the descriptive data instead of these three different ones, but that'll be the next step. let me know if you have comments or questions.
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
Hi, Ethan. You can see another example of blacklight being used to search and display EAD guides at http://nwda.projectblacklight.org/?f%5Bformat_facet%5D%5B%5D=Archival+Collection+Guide I've used solr and/or lucene for EAD documents a few times, and here are some observations: I've also heard about scalability issues with Solr and large XML documents, but I've never seen benchmarks. Solr is incredibly scalable, so describing this as a solr scalability issue isn't really accurate. What might be more accurate would be to say that Solr is designed for searching, while most people looking for an EAD solution are trying to get it to do a lot more than that. The problem is that you want to be able to discover and view an EAD guide at several levels, right? You want to be able to discover at the collection level, and at the item level, and presumably at the level of some section of the EAD document (e.g., biographical history or whatever). Solr and lucene really just know how to tell you whether a given document in the index matches a query you've entered, though, so if you want to be able to discover on each of those levels, you have to index your document once to represent the collection, then again for each section you want to be independently discoverable, then again for each item you want to be discoverable. Creating a UI that is going to represent a sing! le EAD, which has now been transformed into potentially hundreds or thousands of independently discoverable items and EAD sections is quite challenging. I liked what Matt Mitchell and I did for the Northwest Digital Archives, but I'm always interested in other ways one might approach this. We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, so that when you search for an archival collection you only see the EAD guide represented once, but each section of it is still independently viewable and bookmarkable: Here is the guide for the Bing Crosby Historical Society in a search result: http://nwda.projectblacklight.org/catalog?q=crosbyqt=searchper_page=10f%5Bformat_facet%5D%5B%5D=Archival+Collection+Guidecommit=search But in order to look at the guide, you have to look at a specific part of it: http://nwda.projectblacklight.org/catalog/bcc_1-summary Additionally, we treated each item as a first class independently discoverable object, but still linked them all to the section of the EAD document where they came from: http://nwda.projectblacklight.org/catalog/bcc_1-v Matt and I were thinking it would be nice to allow blacklight to handle all of the display of the EAD too, which is why we stored a lot of EAD markup in the solr document, and that can potentially have scalability problems, because lucene is not a database but we were treating it like one. This works, but it's a bit of a hack. Bess
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote: We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, Curious how you impelemented that: Did you use the Solr field collapsing patch that's not yet part of a standard distro? Yes, exactly. Matt and I were thinking it would be nice to allow blacklight to handle all of the display of the EAD too, which is why we stored a lot of EAD markup in the solr document, and that can potentially have scalability problems, because lucene is not a database but we were treating it like one. This works, but it's a bit of a hack. You can definitely have Blacklight handle the display while still keeping the EAD out of solr stored fields. There's no reason Blacklight can't fetch the EAD from some external store, keyed by Solr document ID (or by some other value in a solr document stored field). That's my current thinking (informed by y'alls experience) of how I'm going to handle future large object stuff in BL, if/when I get around to developing it. Yeah, that's a good point. We were trying to self-contain the whole thing for ease of deployment, but I'm not sure that's a good approach. It's better if your EAD is in a real repository and Blacklight just presents it. Bess
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Fri, Aug 6, 2010 at 1:09 PM, Bess Sadler bess.sad...@gmail.com wrote: On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote: We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, Curious how you impelemented that: Did you use the Solr field collapsing patch that's not yet part of a standard distro? Yes, exactly. Bess - would you be willing to share code or brief notes about how to set this up? Yeah, that's a good point. We were trying to self-contain the whole thing for ease of deployment, but I'm not sure that's a good approach. It's better if your EAD is in a real repository and Blacklight just presents it. +1. Potential options could include using an XML database like eXist, or using our approach at Yale (where EAD finding aids are stored as datastreams in Fedora objects). I've been eager to look at rethinking our approach, especially given the availability of the Hydra codebase. Mark A. Matienzo Digital Archivist, Manuscripts and Archives Yale University Library
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
I also think it's better to store EAD in a separate system rather than in the Solr index, that way you can use blacklight to serialize it or store a reference to a separate delivery system. Bess's and Matt's approach to storing the whole collection (EAD file) as a solr document in addition to making each item accessible in blacklight is a good one. Hopefully this will encourage institutions to use EAD in its fullest and encode items at a very detailed level. I'd say at least 95% of finding aids I have seen have little more than title, date, and container for items, if items are enumerated at all. The thing about most EAD delivery systems is that they assume you wish to use EAD in its traditional form, as an electronic finding aid that you wish to render in full on the screen. Most systems don't accommodate an emphasis on item level encoding, display, and findability. Ethan On Fri, Aug 6, 2010 at 1:53 PM, Mark A. Matienzo m...@matienzo.org wrote: On Fri, Aug 6, 2010 at 1:09 PM, Bess Sadler bess.sad...@gmail.com wrote: On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote: We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, Curious how you impelemented that: Did you use the Solr field collapsing patch that's not yet part of a standard distro? Yes, exactly. Bess - would you be willing to share code or brief notes about how to set this up? Yeah, that's a good point. We were trying to self-contain the whole thing for ease of deployment, but I'm not sure that's a good approach. It's better if your EAD is in a real repository and Blacklight just presents it. +1. Potential options could include using an XML database like eXist, or using our approach at Yale (where EAD finding aids are stored as datastreams in Fedora objects). I've been eager to look at rethinking our approach, especially given the availability of the Hydra codebase. Mark A. Matienzo Digital Archivist, Manuscripts and Archives Yale University Library
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Aug 6, 2010, at 10:53 AM, Mark A. Matienzo wrote: On Fri, Aug 6, 2010 at 1:09 PM, Bess Sadler bess.sad...@gmail.com wrote: On Aug 6, 2010, at 9:10 AM, Jonathan Rochkind wrote: We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, Curious how you impelemented that: Did you use the Solr field collapsing patch that's not yet part of a standard distro? Yes, exactly. Bess - would you be willing to share code or brief notes about how to set this up? Gladly. I will write it as a separate message though, for ease of future reference. Yeah, that's a good point. We were trying to self-contain the whole thing for ease of deployment, but I'm not sure that's a good approach. It's better if your EAD is in a real repository and Blacklight just presents it. +1. Potential options could include using an XML database like eXist, or using our approach at Yale (where EAD finding aids are stored as datastreams in Fedora objects). I've been eager to look at rethinking our approach, especially given the availability of the Hydra codebase. Absolutely. Also, this is one example i can think of where fedora disseminators make perfect sense. Fedora can serve as your repository, and then each guide can be accessed as http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER and each section can be grabbed via http://your.repository.edu/fedora/get/YOUR_EAD_IDENTIFIER/bioghist (or whatever naming scheme makes sense to those with stronger opinions about EAD than I do) What I'd love to see is each item represented and described independently in the repository, and then a full XML serialization of the EAD would just be constructed on the fly, bringing in as serialization time any objects that belong in a given section of the document. Institutionally, the biggest problem with EAD is version control and workflow for keeping the documents up to date. I think splitting things up into separate objects and only contructing the full EAD document as needed is a good potential solution to this. Bess
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
Mark, How are you creating the EAD docs in Fedora? At present, we're using archivist's toolkit to dump out ead xml files and then I index them in solr, with blacklight displaying the entire document as well. It's messy and it would be nice to make a more efficient connection between the three (BL, Fedora and Solr). I'd love to show everyone what I have, but they keep us on a private network here. ...adam -Original Message- From: Code for Libraries on behalf of Mark A. Matienzo Sent: Fri 8/6/2010 1:53 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora) +1. Potential options could include using an XML database like eXist, or using our approach at Yale (where EAD finding aids are stored as datastreams in Fedora objects). I've been eager to look at rethinking our approach, especially given the availability of the Hydra codebase. Mark A. Matienzo Digital Archivist, Manuscripts and Archives Yale University Library Join us on Friday, September 3, at the http://rockhall.com/event/rock-hall-ball/ 15th Anniversary Celebration at the Rock and Roll Hall of Fame and Museum. http://rockhall.com/event/rock-hall-ball/! Rock Roll: (noun) African American slang dating back to the early 20th Century. In the early 1950s, the term came to be used to describe a new form of music, steeped in the blues, rhythm blues, country and gospel. Today, it refers to a wide variety of popular music -- frequently music with an edge and attitude, music with a good beat and --- often --- loud guitars.© 2005 Rock and Roll Hall of Fame and Museum. This communication is a confidential and proprietary business communication. It is intended solely for the use of the designated recipient(s). If this communication is received in error, please contact the sender and delete this communication.
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
Hi Adam, I posted an update last Friday on a project I have been working on since last fall called EADitor, an XForms application for creating, managing, and publishing EAD collections. I'm using an eXist datastore, but one could adapt the XForms application to load and save data to and from another REST service, like Fedora. Plus I have XForm submissions to transform the EAD document to a Solr doc and post it to the index (as well as deleting). That could be adapted to post to a blacklight index instead of the one I packaged internally to the application as part of its own publication mechanism. Heres a link to the code page: http://code.google.com/p/eaditor/ I'm presenting it at the EAD roundtable at SAA next week. Ethan On Fri, Aug 6, 2010 at 2:23 PM, Adam Wead aw...@rockhall.org wrote: Mark, How are you creating the EAD docs in Fedora? At present, we're using archivist's toolkit to dump out ead xml files and then I index them in solr, with blacklight displaying the entire document as well. It's messy and it would be nice to make a more efficient connection between the three (BL, Fedora and Solr). I'd love to show everyone what I have, but they keep us on a private network here. ...adam -Original Message- From: Code for Libraries on behalf of Mark A. Matienzo Sent: Fri 8/6/2010 1:53 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora) +1. Potential options could include using an XML database like eXist, or using our approach at Yale (where EAD finding aids are stored as datastreams in Fedora objects). I've been eager to look at rethinking our approach, especially given the availability of the Hydra codebase. Mark A. Matienzo Digital Archivist, Manuscripts and Archives Yale University Library Join us on Friday, September 3, at the http://rockhall.com/event/rock-hall-ball/ 15th Anniversary Celebration at the Rock and Roll Hall of Fame and Museum. http://rockhall.com/event/rock-hall-ball/! Rock Roll: (noun) African American slang dating back to the early 20th Century. In the early 1950s, the term came to be used to describe a new form of music, steeped in the blues, rhythm blues, country and gospel. Today, it refers to a wide variety of popular music -- frequently music with an edge and attitude, music with a good beat and --- often --- loud guitars.© 2005 Rock and Roll Hall of Fame and Museum. This communication is a confidential and proprietary business communication. It is intended solely for the use of the designated recipient(s). If this communication is received in error, please contact the sender and delete this communication.
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Fri, Aug 6, 2010 at 12:10 PM, Jonathan Rochkind rochk...@jhu.edu wrote: You can definitely have Blacklight handle the display while still keeping the EAD out of solr stored fields. There's no reason Blacklight can't fetch the EAD from some external store, keyed by Solr document ID (or by some other value in a solr document stored field). That's my current thinking (informed by y'alls experience) of how I'm going to handle future large object stuff in BL, if/when I get around to developing it. I've wanted to try an approach similar to how paperclip [1] works where a filesystem storage location is chosen that survives between deployments when using something like Capistrano. Using the filesystem would allow the EAD XML to be easily served up directly from the public directory. Jason [1] http://github.com/thoughtbot/paperclip
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Fri, Aug 6, 2010 at 2:17 PM, Bess Sadler bess.sad...@gmail.com wrote: On Aug 6, 2010, at 10:53 AM, Mark A. Matienzo wrote: We indexed each EAD guide into separate lucene documents for each EAD section, then collapsed them under the main EAD title in the search results, Curious how you impelemented that: Did you use the Solr field collapsing patch that's not yet part of a standard distro? Yes, exactly. Bess - would you be willing to share code or brief notes about how to set this up? Gladly. I will write it as a separate message though, for ease of future reference. Bess, is this something you could put on blacklight project website or on code4lib wiki? thanks, ranti. -- Bulk mail. Postage paid.
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
I wonder how the field collapsing patch holds up on an index that contains 3 million documents, probably larger than your EAD-only one, but thinking about combining EAD in an index with many many other documents (like with a library catalog). Might be fine, might not. (Even without field collapsing, my solr index is really straining against the numerous facets I'm making it calculate and the dismax queries involving a dozen or more fields -- I plan to reduce my fields, reduce my facets if possible, and most importantly give my Solr a LOT more RAM than it has now. Complex queries with complex facetting on a several-million-doc index requires giving Solr a LOT more RAM for caches etc than we initially expected, I throw this in as a note to anyone else in the planning stages). I've been brainstorming other weird ways to do this. This one is totally wacky and possibly a bad idea, but I'll throw it out there anyway. What if you only indexed the entire EAD as one document, BUT threw the entire EAD in a stored field, and used solr highlightning on that field. NOT to show the highlighter results to the user, but to sort of trick the highlighter, using hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to telling you _which_ sub-sections of the EAD matched, and your software could then display the matching sub-sections (possibly with direct links to display) in the search results, under the actual document hit. Possibly a really screwy idea, just throwing it out there. Solr highlightning can be a performance problem on very large stored documents too, not sure if typical EAD is 'very large' for these purposes, or if it's something that can be solved by throwing enough RAM at caches. But I guess something about the field collapsing patch makes me nervous, comments about it's performance being uncertain on very large result sets, or just nervousness about applying a patch to solr and counting on someone else to keep it working against solr master as it develops. Jonathan
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
On Aug 6, 2010, at 8:07 PM, Jonathan Rochkind wrote: I've been brainstorming other weird ways to do this. This one is totally wacky and possibly a bad idea, but I'll throw it out there anyway. What if you only indexed the entire EAD as one document, BUT threw the entire EAD in a stored field, and used solr highlightning on that field. NOT to show the highlighter results to the user, but to sort of trick the highlighter, using hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to telling you _which_ sub-sections of the EAD matched, and your software could then display the matching sub-sections (possibly with direct links to display) in the search results, under the actual document hit. Hi, Jonathan. I don't think this is a crazy idea, and in fact it is one of the approaches that Matt M. and I tried during our NWDA project. However, we found that it wasn't scalable. The highlighter was way too slow with the number of documents and fragments we were throwing at it. It wasn't even a huge number of documents, so we abandoned that idea. However, it's still a really elegant solution if only it were performant. Let me know if you decide to give it a try. Bess
Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora)
Huh, since the highlighter only needs to run on the documents in the actual returned section of the result set (10-50?), I wouldn't think total number of documents would matter much (I certainly could be wrong), but total size of each document's stored field definitely has a known performance impact on highlighter. Maybe some time I'll have time or the local requirement need to investigate; wonder if there'd be a way to write a custom highlighting component optimized for the EAD use case, or for the general case of identify matching section(s) in XML that would do better. I'm less nervous about custom components that do not require patches to Solr than I am about patches to Solr core that are not (yet?) included in solr tagged release or trunk. With some of the stuff I'm working with, RAM seems to have sometimes unexpected impacts on performance too. From thinking about what it does, and from looking at my cache hit/miss/eviction statistics, I didn''t really have reason to think that lack of RAM was what was slowing down my StatsComponent use, but adding RAM seems to help a lot. I need a hardware upgrade to be able to add enough RAM and avoid swap, to be sure that what I think I'm seeing about RAM effects on performance is what I'm seeing, but I think so. Wonder if throwing monster amounts of RAM at Solr and increasing certain relevant caches a lot would have an impact on highlighter performance. I've thought about using the highlighter in that way on Marc documents to provide matching snippets ala google in hits page -- the fact that Marc documents aren't full text', but are lists of structured (well, you know, they try :) ) fields, means that you can't just use the highlighter out of the box and get a reasonable snippet to show the user, but if you could use it to identify which _fields_ matched the query, and then throw each matching field (or the first N) through a display mapper that labels it and formats it appropriately (my as-of-yet not publically released marc mapping ruby framework could handle that nicely), that could provide a nice hit snippet perhaps. A large marc document is probably still smaller than a typical EAD document, so might have greater chance of success. From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bess Sadler [bess.sad...@gmail.com] Sent: Saturday, August 07, 2010 12:41 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] EAD in Blacklight (was: Re: [CODE4LIB] Batch loading in fedora) On Aug 6, 2010, at 8:07 PM, Jonathan Rochkind wrote: I've been brainstorming other weird ways to do this. This one is totally wacky and possibly a bad idea, but I'll throw it out there anyway. What if you only indexed the entire EAD as one document, BUT threw the entire EAD in a stored field, and used solr highlightning on that field. NOT to show the highlighter results to the user, but to sort of trick the highlighter, using hl.fragmenter/fragmentsBuilder (possibly with a custom component in a jar) to telling you _which_ sub-sections of the EAD matched, and your software could then display the matching sub-sections (possibly with direct links to display) in the search results, under the actual document hit. Hi, Jonathan. I don't think this is a crazy idea, and in fact it is one of the approaches that Matt M. and I tried during our NWDA project. However, we found that it wasn't scalable. The highlighter was way too slow with the number of documents and fragments we were throwing at it. It wasn't even a huge number of documents, so we abandoned that idea. However, it's still a really elegant solution if only it were performant. Let me know if you decide to give it a try. Bess