Re: couchdb and millions of records

Simon Metson Mon, 26 Jul 2010 10:41:55 -0700

Hi,

We've done things at this scale with CouchDB. The key thing is to dobulk inserts, and to trigger view indexing as you go. For instance ourcode by default will bulk insert 5000 records, then hit a view, thendo the next 5000 then hit the view etc. Of course the batch size issomething you'd want to tune, since it'll depend on your documents andviews. It's much quicker to do the view index incrementally than hitall N million records at once. You might also want to hit view and dbcompaction occasionally, especially if you're also doing bulk deletes.

Cheers
Simon


On 26 Jul 2010, at 18:00, Norman Barker wrote:

Hi,

I have sampled the wikipedia tsv collection from freebase
(http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
through awk and drop the xml field and then did a simple conversion to
JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.

I wrote a simple view in erlang that emits the date as a key (I am
actually using this to test the free text search couchdb-clucene), the
views are fast once computed.

The amount of disk storage used by couchdb is an issue, and the write
times are slow, I changed my view and my 2.3 million view computation
is still running!

       "request_time": {
           "description": "length of a request inside CouchDB without
MochiWeb",
           "current": 2253451.122,
           "sum": 2253451.122,
           "mean": 501.212,
           "stddev": 12275.385,
           "min": 0.5,
           "max": 798124.0
       },

For my use case once the system is up there is only a few updates per
hour, but doing the initial harvest takes a long time.

Does 1.0 make substantial gains on this, if so how, are there any
other areas that I should be looking at to improve this, I am happy
writing erlang code.

thanks,

Norman

Re: couchdb and millions of records

Reply via email to