Hi,
We've done things at this scale with CouchDB. The key thing is to do
bulk inserts, and to trigger view indexing as you go. For instance our
code by default will bulk insert 5000 records, then hit a view, then
do the next 5000 then hit the view etc. Of course the batch size is
something you'd want to tune, since it'll depend on your documents and
views. It's much quicker to do the view index incrementally than hit
all N million records at once. You might also want to hit view and db
compaction occasionally, especially if you're also doing bulk deletes.
Cheers
Simon
On 26 Jul 2010, at 18:00, Norman Barker wrote:
Hi,
I have sampled the wikipedia tsv collection from freebase
(http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
through awk and drop the xml field and then did a simple conversion to
JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.
I wrote a simple view in erlang that emits the date as a key (I am
actually using this to test the free text search couchdb-clucene), the
views are fast once computed.
The amount of disk storage used by couchdb is an issue, and the write
times are slow, I changed my view and my 2.3 million view computation
is still running!
"request_time": {
"description": "length of a request inside CouchDB without
MochiWeb",
"current": 2253451.122,
"sum": 2253451.122,
"mean": 501.212,
"stddev": 12275.385,
"min": 0.5,
"max": 798124.0
},
For my use case once the system is up there is only a few updates per
hour, but doing the initial harvest takes a long time.
Does 1.0 make substantial gains on this, if so how, are there any
other areas that I should be looking at to improve this, I am happy
writing erlang code.
thanks,
Norman