Hi,
I have sampled the wikipedia tsv collection from freebase
(http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
through awk and drop the xml field and then did a simple conversion to
JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.
I wrote a simple view in erlang that emits the date as a key (I am
actually using this to test the free text search couchdb-clucene), the
views are fast once computed.
The amount of disk storage used by couchdb is an issue, and the write
times are slow, I changed my view and my 2.3 million view computation
is still running!
"request_time": {
"description": "length of a request inside CouchDB without
MochiWeb",
"current": 2253451.122,
"sum": 2253451.122,
"mean": 501.212,
"stddev": 12275.385,
"min": 0.5,
"max": 798124.0
},
For my use case once the system is up there is only a few updates per
hour, but doing the initial harvest takes a long time.
Does 1.0 make substantial gains on this, if so how, are there any
other areas that I should be looking at to improve this, I am happy
writing erlang code.
thanks,
Norman