On Jul 26, 2010, at 10:41 AM, Simon Metson wrote: > Hi, > We've done things at this scale with CouchDB. The key thing is to do > bulk inserts, and to trigger view indexing as you go. For instance our code > by default will bulk insert 5000 records, then hit a view, then do the next > 5000 then hit the view etc. Of course the batch size is something you'd want > to tune, since it'll depend on your documents and views. It's much quicker to > do the view index incrementally than hit all N million records at once. You > might also want to hit view and db compaction occasionally, especially if > you're also doing bulk deletes. > Cheers > Simon >
Also, 1.0 should be significantly faster for your use case. Chris > On 26 Jul 2010, at 18:00, Norman Barker wrote: > >> Hi, >> >> I have sampled the wikipedia tsv collection from freebase >> (http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this >> through awk and drop the xml field and then did a simple conversion to >> JSON. I then call _bulk_docs 150 docs at a time into couch 0.11. >> >> I wrote a simple view in erlang that emits the date as a key (I am >> actually using this to test the free text search couchdb-clucene), the >> views are fast once computed. >> >> The amount of disk storage used by couchdb is an issue, and the write >> times are slow, I changed my view and my 2.3 million view computation >> is still running! >> >> "request_time": { >> "description": "length of a request inside CouchDB without >> MochiWeb", >> "current": 2253451.122, >> "sum": 2253451.122, >> "mean": 501.212, >> "stddev": 12275.385, >> "min": 0.5, >> "max": 798124.0 >> }, >> >> For my use case once the system is up there is only a few updates per >> hour, but doing the initial harvest takes a long time. >> >> Does 1.0 make substantial gains on this, if so how, are there any >> other areas that I should be looking at to improve this, I am happy >> writing erlang code. >> >> thanks, >> >> Norman >
