[CODE4LIB] Indexing MARC(-JSON) with MongoDB?
There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. According to the MongoDB docs [1], you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index. Examples of this are: db.things.ensureIndex({city: 1}) db.things.ensureIndex({address.city: 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such path. For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca Antonio Monteiro INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
Re: [CODE4LIB] audio transcription software
Eric, I tried Docsoft:AV (http://www.docsoft.com/Products/AV/), a server-based solution, about a year ago to see whether we could use it to automatically transcribe and timestamp our oral history recordings. It might work nicely if you had multiple recordings with the same speakers where it would be feasible to train the software by setting up speaker profiles for the individual speaker's voice. The software can output the results in a variety of formats and it handles audio and video recordings. However, we only had one recording per interviewee (intermixed with the interviewer) and thus would have had to spend way more time and money on training the software (and cleaning up the results, which were hardly comprehensible) than if we had an actual person listen to the recordings and transcribe them. To be fair to Docsoft, some speakers had strong accents and the audio quality was not ideal, but that's what we needed it for. So, it did not seem to be a feasible solution for this particular problem and we stuck with a wetware-based approach. Markus Markus Wust Digital Collections and Preservation Librarian Digital Scholarship and Publishing Center North Carolina State University Libraries
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
On 5/13/10 8:59 AM, Fernando Gómez wrote: Any suggestions? Do other document oriented databases offer a better solution for this? Hey Fernando, I'd suggest you checkout CouchDB. CouchDB uses JSON as it's document format, provides advanced indexing (anywhere in the JSON docs) via map/reduce queries that are typically written in JavaScript. The map/reduce queries are simple lamda JavaScript functions that are part of a design document (also a simple JSON object) in CouchDB. Check out the following two links for more info: http://books.couchdb.org/relax/design-documents/design-documents http://books.couchdb.org/relax/design-documents/views A simple map reduce query using your city and address.city keys would look something like this: function (doc) { if (doc.city) { emit(doc.city, doc); } else if (doc.address.city) { emit(doc.address.city, doc); } } That function would return the full document representation keyed by their cities (which is handy for sorting and later reducing by counting unique cities). CouchDB lets you focus on pulling out the data you want, and it handles the indexing. Pretty handy. :) Let me know if you have other questions about CouchDB. Take care, Benjamin
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Hi Fernando, Yesterday I changed the Ubuntu to 64 bit version, because I'd like to try out MongoDB indexing library records, and the 32 bit version has some limitation (the maximal database could not exceed 2 GB). I haven't tried MARC yet, only XC records, which is a derivative of MARC, but from the documentation I read that the idea is absolutely possible. This is an example from Mongo's document [1]: doc = { author: 'joe', created : new Date('03-28-2009'), title : 'Yet another blog post', text : 'Here is the text...', tags : [ 'example', 'joe' ], comments : [ { author: 'jim', comment: 'I disagree' }, { author: 'nancy', comment: 'Good post' } ] } db.post.insert(doc) db.posts.find( { comments.author : jim } ) The most exciting here - for me - that is is not just a simple key-value storage (a Lucene/Solr), but provides embeding field, so you can bravely insert subfields, indicators etc. The will remain compact and findable. So you can combine the relations known from traditional relational databases and the flexibility and speed known from Solr. I will let you know as soon I could insert first MARC records to Mongo. [1] http://www.mongodb.org/display/DOCS/Inserting regards, Péter eXtensible Catalog - Original Message - From: Fernando Gómez fjgo...@gmail.com To: CODE4LIB@LISTSERV.ND.EDU Sent: Thursday, May 13, 2010 2:59 PM Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. According to the MongoDB docs [1], you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index. Examples of this are: db.things.ensureIndex({city: 1}) db.things.ensureIndex({address.city: 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such path. For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca Antonio Monteiro INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. First things first : it depends on what kind of indexing you're looking to do — I haven't worked with CouchDB (yet), but I have with MongoDB, and although it's a great (and fast) data store, it has a basic style of indexing as SQL databases. That is, you can do exact-match, some simple regex (usually left-anchored) and then of course all the power of map/reduce (Mongo does map/reduce as well as Couch). Doing funkier full-text indexing is one of the priorities for upcoming MongoDB development, as I understand. In the interim, it might be worth having a look at ElasticSearch: http://www.elasticsearch.com/ — It's based on Lucene and has its own DSL to support fuzzy querying. I've been playing with it and it seems like a smart NoSQL implementation, albeit subtly different from Mongo or Couch. { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? I think you would do this using dot notation, eg. db.records.find( { controlfield.tag : 001 } ) But I don't know enough about MARC-in-JSON to say exactly. Have a look at: http://www.mongodb.org/display/DOCS/Dot+Notation+%28Reaching+into+Objects%29 It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? My understanding is that indexes, in MongoDB at least, operate much like they do in SQL RDBMS — that is, they are used to pre-hash field values for performance, rather than having to be explicitly defined. ie. I *believe* if you don't explicitly do an ensureIndex() on a field, you can still query it, but it'll be slower. But I may be wrong. BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 Not to start a flame war, but from my point of view, it seems rather strange for us to go through all this learning of new technology only to stuff MARC into it. That's not to say it can't be done, or there aren't valid use cases for doing such a thing, but just that it seems like an odd juxtaposition. I realize this is a bit at odds with my evangelizing at C4LN on merging old and new, but really, being limited to the MARC data model with all the flexibility of NoSQL seems kind of like having a Ferarri and then setting the speed limiter at 50km/h. Fun to drive, I _suppose_. MJ
Re: [CODE4LIB] audio transcription software
I've saw a reference to some software called IBM ViaScribe when reading about a project that converts lectures to text (http://www.liberatedlearning.com/technology/index.shtml) a while back. Kristina CODE4LIB automatic digest system wrote: Subject: CODE4LIB Digest - 11 May 2010 to 12 May 2010 (#2010-115) From: CODE4LIB automatic digest system lists...@listserv.nd.edu Date: Wed, 12 May 2010 23:00:16 -0400 To: CODE4LIB@LISTSERV.ND.EDU To: CODE4LIB@LISTSERV.ND.EDU There are 12 messages totalling 394 lines in this issue. Topics of the day: 1. audio transcription software (12) -- Kristina Long Programmer reSearcher Software Suite -- researcher.sfu.ca Simon Fraser University Library kl...@sfu.ca
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
On 05/13/2010 09:59 AM, MJ Suhonos wrote: First things first : it depends on what kind of indexing you're looking to do — I haven't worked with CouchDB (yet), but I have with MongoDB, and although it's a great (and fast) data store, it has a basic style of indexing as SQL databases. That is, you can do exact-match, some simple regex (usually left-anchored) and then of course all the power of map/reduce (Mongo does map/reduce as well as Couch). Out of curiosity, are there libraries to export records from MongoDB into Solr? -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Sorry, meant to include this link, which compares Elastic Search and Solr: http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/ MJ
[CODE4LIB] Job Posting: Director of Digital Technologies at Brown University
*DIRECTOR OF DIGITAL TECHNOLOGIES* The Brown University Library invites applications for a dynamic and innovative Director of Digital Technologies to provide leadership, vision, and strategic direction for the Brown University Library in the development, delivery and integration of new and existing systems and technology services and digital initiatives across the libraries. S/he will oversee the management of the department's three units: Integrated Technology Services, Systems and Technical Support, and the Center for Digital Scholarship. As a member of the Library’s senior management team, the Director of Digital Technologies will serve as the Library's chief liaison with the University's Office of Computing and Information Services and related technology units on campus. S/he will actively seek partnerships with other Library departments and organizations external to the Library, solicit input from and manage collaborations with a broad spectrum of partners, and ensure that the Library's digital services support a wide array of user needs within the teaching, learning and research mission of the University. This includes developing strategies to assess the effectiveness of digital services and operations. The incumbent will stay abreast of emerging developments, issues, and trends in the use of digital technologies in higher education, and will contribute to and be active in local, regional, and national projects and developments. S/he will be a leading force in the introduction and application of new technologies that improve, enhance, and extend Library services. Qualifications: Bachelor's degree required. A graduate degree is preferred, such as an MLS or MIS from an ALA-accredited program, or an MS/MA or PhD in a relevant subject. The successful candidate must have at least 5 years of progressively responsible management experience in information technology in an academic library with substantial technical knowledge of systems and digital technologies including significant experience in developing and managing technical projects. S/he will have prior experience in some or all of the following: digital repository development (Fedora), digital libraries, data curation, and digital scholarship. The candidate will have an excellent grasp of advanced information technologies and their applications, detailed knowledge of project management and a demonstrated ability to estimate the scope of a project and bring it to completion on time and within budget. The successful incumbent will have a flexible approach to problem-solving and the ability to facilitate change while working within a collegial framework. S/he will demonstrate a record of excellent oral, written and interpersonal communication skills along with strong analytical and decision-making skills. Experience with obtaining grant funding and managing grant-funded projects is preferred. To apply for this position (JOB#B01159), please visit Brown’s Online Employment website (https://careers.brown.edu), complete an application online, attach documents, and submit for immediate consideration. Documents should include cover letter, resume, and the names and e-mail addresses of three references. Review of applications will continue until the position is filled; applications received by June 18, 2010 will receive first consideration. Brown University is an Equal Opportunity/ Affirmative Action Employer. Jean Rainwater Head, Integrated Technology Services Brown University Library Box A / 10 Prospect Street Providence, Rhode Island 02912 jean_rainwa...@brown.edu
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Kyle Banerjee Sent: Thursday, May 13, 2010 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? JSON maybe a great data exchange format, but it's not a markup language like XML so doing things like preserving field order or just getting a bird's eye view of content across multiple fields or subfields becomes more complex. Huh? JSON arrays preserve element order just like XML preserves element order. Combining JSON labeled arrays and objects provide you with the same mechanisms available in markup languages such as XML. Andy.
Re: [CODE4LIB] Call for comments: OPDS Catalogs 0.9 draft, an Atom-based standard for ebook distribution
Quoting Ed Summers e...@pobox.com: Folks involved in the Open Publication Distribution System (OPDS) effort are seeking feedback on the latest version of the spec [1] from the publishing and library communities--and specifically from the library-tech oriented code4lib subscribers. The goal is to gather enough feedback for a v1.0 release mid-2010. Ed, it could just be that I missed it in the document, but I don't see a full(-ish) list of data elements. There is the statement that partial entries should have at least these data elements: * atom:category * atom:rights * dc:extent * dc:identifier * dc:issued * dc:language * dc:publisher * opds:price * prism:issue * prism:volume but I would have expected to see a title data element listed somewhere. Is dc assumed? Or is the bibliographic description scheme an open question? kc -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 begin_of_the_skype_highlighting 1-510-435-8234 end_of_the_skype_highlighting skype: kcoylenet
Re: [CODE4LIB] Query LCSH terms at id.loc.gov by modification date
The short answer to your question is no, there's no way to query terms based on last modification date. However, and this feature needs publication on the website, there is an Atom feed that exposes the change activities for the subject headings: http://id.loc.gov/authorities/feed/ You can page through it (feed/page/1, feed/page/2). There is also a page that shows when each load was performed: http://id.loc.gov/authorities/loads/ It too has an Atom feed (http://id.loc.gov/authorities/loads/feed). HTH, Kevin Ethan Gruber ewg4x...@gmail.com 05/13/10 3:14 PM Does anyone know if it's possible to query terms at id.loc.gov by their last modification date? Here's an example of the current search results: http://id.loc.gov/authorities/search/?q=egypt The scenario is that I have a Solr service with all of the subject terms in it that I use for feeding autosuggest in an XForms application. My application comes packaged with the terms from an official rdf release from a few months ago. I'd like to improve my service by enabling a user to update their index with newly created or modified terms by the LOC after the date on which they ran the previous update. I could do this pretty easily if there was a way to query the authority list on id.loc.gov for terms modified after Feb. 15, 2010, for example. It seems there's no advanced search functionality as part of that interface, but maybe there's a sly way of passing additional query parameters. Thanks, Ethan
Re: [CODE4LIB] Call for comments: OPDS Catalogs 0.9 draft, an Atom-based standard for ebook distribution
Hi Karen, On Thu, May 13, 2010 at 1:19 PM, Karen Coyle li...@kcoyle.net wrote: but I would have expected to see a title data element listed somewhere. Is dc assumed? Or is the bibliographic description scheme an open question? The nice thing about Atom is that it allows you to layer in whatever you want namespace wise into the atom:entry. So yes, you could put MARCXML, MODS, DCTERMS, RDF/XML, etc into an atom:entry with no problem. This flexibility of Atom allows OPDS to be used in multiple bibliographic metadata environments, and allows it to accommodate change. That being said, since OPDS uses Atom, the rules for atom:entry elements still apply. In section 4.1.2 of RFC 4287 [1] you'll see this requirement about titles: atom:entry elements MUST contain exactly one atom:title element. There are other requirements from Atom that are also relevant. Unfortunately it's not really practical for the OPDS spec to repeat all the details of RFC 4287. However, it is more than likely that there will be an OPDS Primer or Cookbook that provides practical guidance on using OPDS. If you feel strongly that there should be more guidance on the bibliographic metadata in the spec I encourage you to open an issue ticket [2]. Thanks for the feedback! //Ed [1] http://www.ietf.org/rfc/rfc4287.txt [2] http://code.google.com/p/openpub/issues/list
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Airtran???Newark not jfk. Allen Jones Director - Digital Library Programs The New School Libraries On May 13, 2010, at 2:00 PM, Jonathan Rochkind rochk...@jhu.edu wrote: JSON and XML as structures have 'order' in exactly analagous ways. In the case of Json, if you want to encode order you should use an array, not a dictionary, of course. Whether the particular software _parsing_ or _translating_ either Json or XML will go through it in order and preserve the order when translating to another format... is another question. Is there reason to think that software dealing with Json will be more likely to do this wrong than software dealing with xml? I don't get it. Kyle Banerjee wrote: Huh? JSON arrays preserve element order just like XML preserves element order. Combining JSON labeled arrays and objects provide you with the same mechanisms available in markup languages such as XML. Maybe I'm getting mixed up but is it not unsafe to assume that element order will be preserved in all environments in for/foreach loops where the JSON might be interpreted unless you specifically iterate through elements in order? If I'm wrong, this is a total nonissue. Otherwise, there could be side effects. Don't get me wrong. JSON's a better way to go in general, and I think that too much the focus on lossless preservation of the MARC record has a really held us back. Given that significant portions of the MARC record are not used for search, retrieval, or display, and many useful elements consist of free text, faithfully preserving each field as an object to encode elements such as extent of item or notes strikes me like using a chain saw to cut butter. kyle
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Please disregard last email. Allen Jones Director - Digital Library Programs The New School Libraries On May 13, 2010, at 2:00 PM, Jonathan Rochkind rochk...@jhu.edu wrote: JSON and XML as structures have 'order' in exactly analagous ways. In the case of Json, if you want to encode order you should use an array, not a dictionary, of course. Whether the particular software _parsing_ or _translating_ either Json or XML will go through it in order and preserve the order when translating to another format... is another question. Is there reason to think that software dealing with Json will be more likely to do this wrong than software dealing with xml? I don't get it. Kyle Banerjee wrote: Huh? JSON arrays preserve element order just like XML preserves element order. Combining JSON labeled arrays and objects provide you with the same mechanisms available in markup languages such as XML. Maybe I'm getting mixed up but is it not unsafe to assume that element order will be preserved in all environments in for/foreach loops where the JSON might be interpreted unless you specifically iterate through elements in order? If I'm wrong, this is a total nonissue. Otherwise, there could be side effects. Don't get me wrong. JSON's a better way to go in general, and I think that too much the focus on lossless preservation of the MARC record has a really held us back. Given that significant portions of the MARC record are not used for search, retrieval, or display, and many useful elements consist of free text, faithfully preserving each field as an object to encode elements such as extent of item or notes strikes me like using a chain saw to cut butter. kyle
Re: [CODE4LIB] Query LCSH terms at id.loc.gov by modification date
As Kevin said, I think you can use the Atom feed to page backwards through time. Basically this amounts to programatically following the link rel=next links in the feed, applying creates, updates and deletes as you go until you make it to Feb. 15, 2010. Currently this would involve walking from: http://id.loc.gov/authorities/feed/ to: http://id.loc.gov/authorities/feed/page/2/ all the way to: http://id.loc.gov/authorities/feed/page/96/ Then in a months time or whatever you can run the same process again. I think you can either walk through the feed pages until a known last harvest date, or until you see a record with an atom:id and atom:update you already know about. I think the latter could be a bit simpler, assuming you are keeping track of what you have. Ever since reading the OAI-ORE specs on Atom [1] I've become a bit taken with the idea of using Atom syndication as a drop in replacement for OAI-PMH--which is the spec that most people in the library community reach for when they want to do metadata synchronization. The advantage of Atom is that it fits into the syndication world so nicely, and its ecosystem of tools and services. //Ed [1] http://www.openarchives.org/ore/1.0/atom On Thu, May 13, 2010 at 4:53 PM, Kevin Ford k...@loc.gov wrote: The short answer to your question is no, there's no way to query terms based on last modification date. However, and this feature needs publication on the website, there is an Atom feed that exposes the change activities for the subject headings: http://id.loc.gov/authorities/feed/ You can page through it (feed/page/1, feed/page/2). There is also a page that shows when each load was performed: http://id.loc.gov/authorities/loads/ It too has an Atom feed (http://id.loc.gov/authorities/loads/feed). HTH, Kevin
[CODE4LIB] FW: Webcast on UC3's microservices and the new DPR: Thurs May 20, 2-3pm
Hi - of interest to some of y'all? D From: UC3-L [mailto:uc...@listserv.ucop.edu] On Behalf Of Perry Willett Sent: Thursday, May 13, 2010 3:34 PM To: uc...@listserv.ucop.edu Subject: Webcast on UC3's microservices and the new DPR: Thurs May 20, 2-3pm Here's the information on our webcast next Thursday. Sorry for the late notice--hope you can make it: Webcast on CDL UC3's microservices and the new DPR. When: Thursday May 20, 2-3pm How to participate: Telephone number: +1 (866) 740-1260 Access code: 7082332 Web conference: https://www.callinfo.com/prt?ac=7082332an=8667401260host=readytalk UC3 is developing a new DPR based on a micro-services approach[1], and we've reached a development milestone. We'd like to hold a webcast for our users on May 20 from 2-3pm. In this webcast, we'll explain our new approach, and demo the new repository, showing the ingest process and highlighting the new features. Our goal is to show you our development and to hear your thoughts on how we can improve our new system. Please plan on joining us! (and we'll record the webcast in case you can't). [1] See the UC3 Curation wiki for more documentation and details: https://confluence.ucop.edu/display/Curation/Home Perry Willett Digital Preservation Services Manager California Digital Library 415 20th St., 4th Floor Oakland CA 94612-2901 Ph: 510-987-0078 Fax: 510-893-5212 Email: perry.will...@ucop.edumailto:perry.will...@ucop.edu
Re: [CODE4LIB] FW: Webcast on UC3's microservices and the new DPR: Thurs May 20, 2-3pm
Well, Declan's let the cat out of the bag. We're having a webinar on our microservices, to demo the repository and ingest services we've developed so far. The webinar is primarily meant for our partners within the University of California, but all are welcome. Thurs May 20, 2-3pm (PDT). More information here: https://confluence.ucop.edu/display/Curation/UC3+Microservices+Webcast+2010-05-20 or more succinctly: http://bit.ly/9APZDk Please join us! Perry Willett California Digital Library