Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Hi Fernando, I have started my experience with MARC in Mongo. I have import ~6 million MARC records (auth and bib) to MongoDB. The steps I took: 1) the source was MARCXML I created with XC OAI Toolkit. 2) I created an XSLT file which creates MARC-JSON from MARCXML I followed the MARC-JSON draft and not Bill Dueber's MARC HASH http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The conversion is not 100% perfect, but from the 6 million records only 20 were converted with some errors, which is enogh error rate for a home made project. 3) imported the files 4) indexed the files Lessons learned: - the import process is moch more quicker than any other part of the workflow. The 6 million records was imported about 30 minutes, while indexing took 3 hours. - count() is very slow method for complex queries even after intensive indexing. but iterating over the results is more quicker. - there is no way to index part of strings (e.g. splitting the leader or 006/007/008 fields) - full text search is not too quick - before indexing the size of the index was 9 GB, after full index it was 28 GB (I should note, that on 32-bit operation system the max size of mongo index is 2 GB). Conclusions: - the MARC-JSON format is good for data exchange, but it is not enough precise for searching, since - MARC heritage - distinct information are combined together to single fields (Leader, 008 etc). We should split them into smaller information chunks before indexing. - I should learn more about the possibilities of MongoDB I can give you more technical details, if you interested. Péter eXtensible Catalog - Original Message - From: Fernando Gómez fjgo...@gmail.com To: CODE4LIB@LISTSERV.ND.EDU Sent: Thursday, May 13, 2010 2:59 PM Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. According to the MongoDB docs [1], you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index. Examples of this are: db.things.ensureIndex({city: 1}) db.things.ensureIndex({address.city: 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such path. For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca Antonio Monteiro INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
[CODE4LIB] Indexing MARC(-JSON) with MongoDB?
There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. According to the MongoDB docs [1], you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index. Examples of this are: db.things.ensureIndex({city: 1}) db.things.ensureIndex({address.city: 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such path. For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca Antonio Monteiro INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
On 5/13/10 8:59 AM, Fernando Gómez wrote: Any suggestions? Do other document oriented databases offer a better solution for this? Hey Fernando, I'd suggest you checkout CouchDB. CouchDB uses JSON as it's document format, provides advanced indexing (anywhere in the JSON docs) via map/reduce queries that are typically written in JavaScript. The map/reduce queries are simple lamda JavaScript functions that are part of a design document (also a simple JSON object) in CouchDB. Check out the following two links for more info: http://books.couchdb.org/relax/design-documents/design-documents http://books.couchdb.org/relax/design-documents/views A simple map reduce query using your city and address.city keys would look something like this: function (doc) { if (doc.city) { emit(doc.city, doc); } else if (doc.address.city) { emit(doc.address.city, doc); } } That function would return the full document representation keyed by their cities (which is handy for sorting and later reducing by counting unique cities). CouchDB lets you focus on pulling out the data you want, and it handles the indexing. Pretty handy. :) Let me know if you have other questions about CouchDB. Take care, Benjamin
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Hi Fernando, Yesterday I changed the Ubuntu to 64 bit version, because I'd like to try out MongoDB indexing library records, and the 32 bit version has some limitation (the maximal database could not exceed 2 GB). I haven't tried MARC yet, only XC records, which is a derivative of MARC, but from the documentation I read that the idea is absolutely possible. This is an example from Mongo's document [1]: doc = { author: 'joe', created : new Date('03-28-2009'), title : 'Yet another blog post', text : 'Here is the text...', tags : [ 'example', 'joe' ], comments : [ { author: 'jim', comment: 'I disagree' }, { author: 'nancy', comment: 'Good post' } ] } db.post.insert(doc) db.posts.find( { comments.author : jim } ) The most exciting here - for me - that is is not just a simple key-value storage (a Lucene/Solr), but provides embeding field, so you can bravely insert subfields, indicators etc. The will remain compact and findable. So you can combine the relations known from traditional relational databases and the flexibility and speed known from Solr. I will let you know as soon I could insert first MARC records to Mongo. [1] http://www.mongodb.org/display/DOCS/Inserting regards, Péter eXtensible Catalog - Original Message - From: Fernando Gómez fjgo...@gmail.com To: CODE4LIB@LISTSERV.ND.EDU Sent: Thursday, May 13, 2010 2:59 PM Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. According to the MongoDB docs [1], you can create an index by calling the ensureIndex() function, and providing a document that specifies one or more keys to index. Examples of this are: db.things.ensureIndex({city: 1}) db.things.ensureIndex({address.city: 1}) That is, you specify the keys giving a path from the root of the document to the data element you are interested in. Such a path acts both as the index's name, and as an specification of how to get the keys's values. In the case of two proposed MARC-JSON formats [2, 3], I can't see such path. For example, say you want an index on field 001. Simplifying, the JSON docs would look like this { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 -- Fernando Gómez Biblioteca Antonio Monteiro INMABB (Conicet / Universidad Nacional del Sur) Av. Alem 1253 B8000CPB Bahía Blanca, Argentina Tel. +54 (291) 459 5116 http://inmabb.criba.edu.ar/
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
There's been some talk in code4lib about using MongoDB to store MARC records in some kind of JSON format. I'd like to know if you have experimented with indexing those documents in MongoDB. From my limited exposure to MongoDB, it seems difficult, unless MongoDB supports some kind of custom indexing functionality. First things first : it depends on what kind of indexing you're looking to do — I haven't worked with CouchDB (yet), but I have with MongoDB, and although it's a great (and fast) data store, it has a basic style of indexing as SQL databases. That is, you can do exact-match, some simple regex (usually left-anchored) and then of course all the power of map/reduce (Mongo does map/reduce as well as Couch). Doing funkier full-text indexing is one of the priorities for upcoming MongoDB development, as I understand. In the interim, it might be worth having a look at ElasticSearch: http://www.elasticsearch.com/ — It's based on Lucene and has its own DSL to support fuzzy querying. I've been playing with it and it seems like a smart NoSQL implementation, albeit subtly different from Mongo or Couch. { fields : [ [001, 001 value], ... ] } or this { controlfield : [ { tag : 001, data : fst01312614 }, ... ] } How would you specify field 001 to MongoDB? I think you would do this using dot notation, eg. db.records.find( { controlfield.tag : 001 } ) But I don't know enough about MARC-in-JSON to say exactly. Have a look at: http://www.mongodb.org/display/DOCS/Dot+Notation+%28Reaching+into+Objects%29 It would be nice to have some kind of custom indexing, where one could provide an index name and separately a JavaScript function specifying how to obtain the keys's values for that index. Any suggestions? Do other document oriented databases offer a better solution for this? My understanding is that indexes, in MongoDB at least, operate much like they do in SQL RDBMS — that is, they are used to pre-hash field values for performance, rather than having to be explicitly defined. ie. I *believe* if you don't explicitly do an ensureIndex() on a field, you can still query it, but it'll be slower. But I may be wrong. BTW, I fed MongoDB with the example MARC records in [2] and [3], and it choked on them. Both are missing some commas :-) [1] http://www.mongodb.org/display/DOCS/Indexes [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 Not to start a flame war, but from my point of view, it seems rather strange for us to go through all this learning of new technology only to stuff MARC into it. That's not to say it can't be done, or there aren't valid use cases for doing such a thing, but just that it seems like an odd juxtaposition. I realize this is a bit at odds with my evangelizing at C4LN on merging old and new, but really, being limited to the MARC data model with all the flexibility of NoSQL seems kind of like having a Ferarri and then setting the speed limiter at 50km/h. Fun to drive, I _suppose_. MJ
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
On 05/13/2010 09:59 AM, MJ Suhonos wrote: First things first : it depends on what kind of indexing you're looking to do — I haven't worked with CouchDB (yet), but I have with MongoDB, and although it's a great (and fast) data store, it has a basic style of indexing as SQL databases. That is, you can do exact-match, some simple regex (usually left-anchored) and then of course all the power of map/reduce (Mongo does map/reduce as well as Couch). Out of curiosity, are there libraries to export records from MongoDB into Solr? -- Thomas Dowling tdowl...@ohiolink.edu
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Sorry, meant to include this link, which compares Elastic Search and Solr: http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/ MJ
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Kyle Banerjee Sent: Thursday, May 13, 2010 11:51 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? JSON maybe a great data exchange format, but it's not a markup language like XML so doing things like preserving field order or just getting a bird's eye view of content across multiple fields or subfields becomes more complex. Huh? JSON arrays preserve element order just like XML preserves element order. Combining JSON labeled arrays and objects provide you with the same mechanisms available in markup languages such as XML. Andy.
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Airtran???Newark not jfk. Allen Jones Director - Digital Library Programs The New School Libraries On May 13, 2010, at 2:00 PM, Jonathan Rochkind rochk...@jhu.edu wrote: JSON and XML as structures have 'order' in exactly analagous ways. In the case of Json, if you want to encode order you should use an array, not a dictionary, of course. Whether the particular software _parsing_ or _translating_ either Json or XML will go through it in order and preserve the order when translating to another format... is another question. Is there reason to think that software dealing with Json will be more likely to do this wrong than software dealing with xml? I don't get it. Kyle Banerjee wrote: Huh? JSON arrays preserve element order just like XML preserves element order. Combining JSON labeled arrays and objects provide you with the same mechanisms available in markup languages such as XML. Maybe I'm getting mixed up but is it not unsafe to assume that element order will be preserved in all environments in for/foreach loops where the JSON might be interpreted unless you specifically iterate through elements in order? If I'm wrong, this is a total nonissue. Otherwise, there could be side effects. Don't get me wrong. JSON's a better way to go in general, and I think that too much the focus on lossless preservation of the MARC record has a really held us back. Given that significant portions of the MARC record are not used for search, retrieval, or display, and many useful elements consist of free text, faithfully preserving each field as an object to encode elements such as extent of item or notes strikes me like using a chain saw to cut butter. kyle
Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
Please disregard last email. Allen Jones Director - Digital Library Programs The New School Libraries On May 13, 2010, at 2:00 PM, Jonathan Rochkind rochk...@jhu.edu wrote: JSON and XML as structures have 'order' in exactly analagous ways. In the case of Json, if you want to encode order you should use an array, not a dictionary, of course. Whether the particular software _parsing_ or _translating_ either Json or XML will go through it in order and preserve the order when translating to another format... is another question. Is there reason to think that software dealing with Json will be more likely to do this wrong than software dealing with xml? I don't get it. Kyle Banerjee wrote: Huh? JSON arrays preserve element order just like XML preserves element order. Combining JSON labeled arrays and objects provide you with the same mechanisms available in markup languages such as XML. Maybe I'm getting mixed up but is it not unsafe to assume that element order will be preserved in all environments in for/foreach loops where the JSON might be interpreted unless you specifically iterate through elements in order? If I'm wrong, this is a total nonissue. Otherwise, there could be side effects. Don't get me wrong. JSON's a better way to go in general, and I think that too much the focus on lossless preservation of the MARC record has a really held us back. Given that significant portions of the MARC record are not used for search, retrieval, or display, and many useful elements consist of free text, faithfully preserving each field as an object to encode elements such as extent of item or notes strikes me like using a chain saw to cut butter. kyle