Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-14 Thread Király Péter

Hi Fernando,

I have started my experience with MARC in Mongo. I have import ~6 million
MARC records (auth and bib) to MongoDB. The steps I took:

1) the source was MARCXML I created with XC OAI Toolkit.
2) I created an XSLT file which creates MARC-JSON from MARCXML
I followed the MARC-JSON draft and not Bill Dueber's MARC HASH
http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The
conversion is not 100% perfect, but from the 6 million records only 20
were converted with some errors, which is enogh error rate for a home
made project.
3) imported the files
4) indexed the files

Lessons learned:
- the import process is moch more quicker than any other part of the 
workflow.

The 6 million records was imported about 30 minutes, while indexing took
3 hours.
- count() is very slow method for complex queries even after intensive 
indexing.

but iterating over the results is more quicker.
- there is no way to index part of strings (e.g. splitting the leader or 
006/007/008

fields)
- full text search is not too quick
- before indexing the size of the index was 9 GB, after full index it was 28 
GB

(I should note, that on 32-bit operation system the max size of mongo index
is 2 GB).

Conclusions:
- the MARC-JSON format is good for data exchange, but it is not enough 
precise
for searching, since - MARC heritage - distinct information are combined 
together to
single fields (Leader, 008 etc). We should split them into smaller 
information chunks

before indexing.
- I should learn more about the possibilities of MongoDB

I can give you more technical details, if you interested.

Péter
eXtensible Catalog


- Original Message - 
From: Fernando Gómez fjgo...@gmail.com

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?



There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of custom indexing functionality.

According to the MongoDB docs [1], you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index. Examples of this are:

   db.things.ensureIndex({city: 1})
   db.things.ensureIndex({address.city: 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
path. For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

   { fields : [ [001, 001 value], ... ] }

or this

   { controlfield : [ { tag : 001, data : fst01312614 }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?


BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)


[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


--
Fernando Gómez
Biblioteca Antonio Monteiro
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116
http://inmabb.criba.edu.ar/



[CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Fernando Gómez
There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of custom indexing functionality.

According to the MongoDB docs [1], you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index. Examples of this are:

db.things.ensureIndex({city: 1})
db.things.ensureIndex({address.city: 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
path. For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

{ fields : [ [001, 001 value], ... ] }

or this

{ controlfield : [ { tag : 001, data : fst01312614 }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?


BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)


[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


-- 
Fernando Gómez
Biblioteca Antonio Monteiro
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116
http://inmabb.criba.edu.ar/


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Benjamin Young

On 5/13/10 8:59 AM, Fernando Gómez wrote:

Any suggestions? Do other document oriented databases offer a better
solution for this?
   

Hey Fernando,

I'd suggest you checkout CouchDB. CouchDB uses JSON as it's document 
format, provides advanced indexing (anywhere in the JSON docs) via 
map/reduce queries that are typically written in JavaScript. The 
map/reduce queries are simple lamda JavaScript functions that are part 
of a design document (also a simple JSON object) in CouchDB. Check out 
the following two links for more info:

http://books.couchdb.org/relax/design-documents/design-documents
http://books.couchdb.org/relax/design-documents/views

A simple map reduce query using your city and address.city keys would 
look something like this:


function (doc) {
  if (doc.city) {
emit(doc.city, doc);
  } else if (doc.address.city) {
emit(doc.address.city, doc);
  }
}

That function would return the full document representation keyed by 
their cities (which is handy for sorting and later reducing by counting 
unique cities).


CouchDB lets you focus on pulling out the data you want, and it handles 
the indexing. Pretty handy. :)


Let me know if you have other questions about CouchDB.

Take care,
Benjamin


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Király Péter

Hi Fernando,

Yesterday I changed the Ubuntu to 64 bit version, because I'd like to try 
out

MongoDB indexing library records, and the 32 bit version has some limitation
(the maximal database could not exceed 2 GB). I haven't tried MARC yet, only
XC records, which is a derivative of MARC, but from the documentation I read
that the idea is absolutely possible.

This is an example from Mongo's document [1]:

doc = { author: 'joe',
 created : new Date('03-28-2009'),
 title : 'Yet another blog post',
 text : 'Here is the text...',
 tags : [ 'example', 'joe' ],
 comments : [ { author: 'jim', comment: 'I disagree' },
 { author: 'nancy', comment: 'Good post' }
 ]
}
db.post.insert(doc)
db.posts.find( { comments.author : jim } )

The most exciting here - for me - that is is not just a simple key-value
storage (a Lucene/Solr), but provides embeding field, so you can bravely
insert subfields, indicators etc. The will remain compact and findable.
So you can combine the relations known from traditional relational
databases and the flexibility and speed known from Solr.

I will let you know as soon I could insert first MARC records to Mongo.

[1] http://www.mongodb.org/display/DOCS/Inserting

regards,
Péter
eXtensible Catalog

- Original Message - 
From: Fernando Gómez fjgo...@gmail.com

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?



There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of custom indexing functionality.

According to the MongoDB docs [1], you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index. Examples of this are:

   db.things.ensureIndex({city: 1})
   db.things.ensureIndex({address.city: 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
path. For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

   { fields : [ [001, 001 value], ... ] }

or this

   { controlfield : [ { tag : 001, data : fst01312614 }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?


BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)


[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


--
Fernando Gómez
Biblioteca Antonio Monteiro
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116
http://inmabb.criba.edu.ar/



Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread MJ Suhonos
 There's been some talk in code4lib about using MongoDB to store MARC
 records in some kind of JSON format. I'd like to know if you have
 experimented with indexing those documents in MongoDB. From my limited
 exposure to MongoDB, it seems difficult, unless MongoDB supports some
 kind of custom indexing functionality.

First things first : it depends on what kind of indexing you're looking to do 
— I haven't worked with CouchDB (yet), but I have with MongoDB, and although 
it's a great (and fast) data store, it has a basic style of indexing as SQL 
databases.  That is, you can do exact-match, some simple regex (usually 
left-anchored) and then of course all the power of map/reduce (Mongo does 
map/reduce as well as Couch).

Doing funkier full-text indexing is one of the priorities for upcoming MongoDB 
development, as I understand.  In the interim, it might be worth having a look 
at ElasticSearch: http://www.elasticsearch.com/ — It's based on Lucene and has 
its own DSL to support fuzzy querying.  I've been playing with it and it seems 
like a smart NoSQL implementation, albeit subtly different from Mongo or Couch.

{ fields : [ [001, 001 value], ... ] }
 
 or this
 
{ controlfield : [ { tag : 001, data : fst01312614 }, ... ] }
 
 How would you specify field 001 to MongoDB?

I think you would do this using dot notation, eg.  db.records.find( { 
controlfield.tag : 001 } )

But I don't know enough about MARC-in-JSON to say exactly.  Have a look at:

http://www.mongodb.org/display/DOCS/Dot+Notation+%28Reaching+into+Objects%29

 It would be nice to have some kind of custom indexing, where one could
 provide an index name and separately a JavaScript function specifying
 how to obtain the keys's values for that index.
 
 Any suggestions? Do other document oriented databases offer a better
 solution for this?

My understanding is that indexes, in MongoDB at least, operate much like they 
do in SQL RDBMS — that is, they are used to pre-hash field values for 
performance, rather than having to be explicitly defined.  ie. I *believe* if 
you don't explicitly do an ensureIndex() on a field, you can still query it, 
but it'll be slower.  But I may be wrong.

 BTW, I fed MongoDB with the example MARC records in [2] and [3], and
 it choked on them. Both are missing some commas :-)
 
 [1] http://www.mongodb.org/display/DOCS/Indexes
 [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
 [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11

Not to start a flame war, but from my point of view, it seems rather strange 
for us to go through all this learning of new technology only to stuff MARC 
into it.  That's not to say it can't be done, or there aren't valid use cases 
for doing such a thing, but just that it seems like an odd juxtaposition.

I realize this is a bit at odds with my evangelizing at C4LN on merging old 
and new, but really, being limited to the MARC data model with all the 
flexibility of NoSQL seems kind of like having a Ferarri and then setting the 
speed limiter at 50km/h.  Fun to drive, I _suppose_.

MJ


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Thomas Dowling
On 05/13/2010 09:59 AM, MJ Suhonos wrote:

 
 First things first : it depends on what kind of indexing you're looking to 
 do — I haven't worked with CouchDB (yet), but I have with MongoDB, and 
 although it's a great (and fast) data store, it has a basic style of 
 indexing as SQL databases.  That is, you can do exact-match, some simple 
 regex (usually left-anchored) and then of course all the power of map/reduce 
 (Mongo does map/reduce as well as Couch).
 

Out of curiosity, are there libraries to export records from MongoDB into
Solr?

-- 
Thomas Dowling
tdowl...@ohiolink.edu


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread MJ Suhonos
Sorry, meant to include this link, which compares Elastic Search and Solr:

http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/

MJ


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Houghton,Andrew
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Kyle Banerjee
 Sent: Thursday, May 13, 2010 11:51 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?
 
 JSON maybe a great data exchange format,  but it's not a markup
 language like XML so doing things like preserving  field order 
 or just getting a bird's eye view of content across multiple 
 fields or subfields becomes more complex.

Huh? JSON arrays preserve element order just like XML preserves element
order.  Combining JSON labeled arrays and objects provide you with the 
same mechanisms available in markup languages such as XML.


Andy.


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Allen Jones

Airtran???Newark not jfk.

Allen Jones
Director - Digital Library Programs
The New School Libraries


On May 13, 2010, at 2:00 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

JSON and XML as structures have 'order' in exactly analagous ways.  
In the case of Json, if you want to encode order you should use an  
array, not a dictionary, of course. Whether the particular software  
_parsing_ or _translating_ either Json or XML will go through it in  
order and preserve the order when translating to another format...  
is another question. Is there reason to think that software dealing  
with Json will be more likely to do this wrong than software dealing  
with xml? I don't get it.


Kyle Banerjee wrote:
Huh? JSON arrays preserve element order just like XML preserves  
element
order.  Combining JSON labeled arrays and objects provide you with  
the

same mechanisms available in markup languages such as XML.




Maybe I'm getting mixed up but is it not unsafe to assume that  
element order
will be preserved in all environments in for/foreach loops where  
the JSON
might be interpreted unless you specifically iterate through  
elements in
order? If I'm wrong, this is a total nonissue. Otherwise, there  
could be

side effects.

Don't get me wrong. JSON's a better way to go in general, and I  
think that
too much the focus on lossless preservation of the MARC record has  
a really
held us back. Given that significant portions of the MARC record  
are not
used for search, retrieval, or display, and many useful elements  
consist of
free text, faithfully preserving each field as an object to encode  
elements
such as extent of item or notes strikes me like using a chain saw  
to cut

butter.

kyle




Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Allen Jones

Please disregard last email.

Allen Jones
Director - Digital Library Programs
The New School Libraries


On May 13, 2010, at 2:00 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

JSON and XML as structures have 'order' in exactly analagous ways.  
In the case of Json, if you want to encode order you should use an  
array, not a dictionary, of course. Whether the particular software  
_parsing_ or _translating_ either Json or XML will go through it in  
order and preserve the order when translating to another format...  
is another question. Is there reason to think that software dealing  
with Json will be more likely to do this wrong than software dealing  
with xml? I don't get it.


Kyle Banerjee wrote:
Huh? JSON arrays preserve element order just like XML preserves  
element
order.  Combining JSON labeled arrays and objects provide you with  
the

same mechanisms available in markup languages such as XML.




Maybe I'm getting mixed up but is it not unsafe to assume that  
element order
will be preserved in all environments in for/foreach loops where  
the JSON
might be interpreted unless you specifically iterate through  
elements in
order? If I'm wrong, this is a total nonissue. Otherwise, there  
could be

side effects.

Don't get me wrong. JSON's a better way to go in general, and I  
think that
too much the focus on lossless preservation of the MARC record has  
a really
held us back. Given that significant portions of the MARC record  
are not
used for search, retrieval, or display, and many useful elements  
consist of
free text, faithfully preserving each field as an object to encode  
elements
such as extent of item or notes strikes me like using a chain saw  
to cut

butter.

kyle