Initial Bulk Upload (was Re: Exist test?)

Kevin Coombes Tue, 06 Nov 2012 03:14:28 -0800

Hi Dave,

Special thanks for your suggestion on initial bulk upload. Point [2]explains why I always had to compact immediately afterwards, and reduceddisk space usage ten-fold....

(And the subject change is so that I and others can maybe find thisadvice again in the future.)


    Kevin

On 11/6/2012 2:15 AM, Dave Cottlehuber wrote:

On 5 November 2012 19:22, Kevin Burton <[email protected]> wrote:

[SNIP]

Hi Kevin,

[SNIP]
If you're initially bulk uploading data, I would do 3 things
differently to what you're currently doing.

1. assign UUIDs myself
This is the only enforced unique indexed attribute in a DB, so use it
well. Put something you want in it. It's basically free text ** within
reason.

2. insert them in sorted UUID order
CouchDB is a database and sorting matters. Couch uses a B~tree ** and
so if you insert randomly you spend a lot of time forcing the re-write
of intermediate nodes for no gain. As Couch is an append-only
datastore this means several things -
- wasted space until you compact
- slower insert performance as you have multiple writes instead of one
http://horicky.blogspot.co.at/2008/10/couchdb-implementation.html

3. try inserting the first few docs by hand with curl. And read up on
the _bulk_docs API, this is much much faster.

Re your drivers, there are several but I personally don't use any of
them. There are more popular ones (based on my dodgy recollection)
here http://wiki.apache.org/couchdb/Related_Projects hopefully some of
the other Windows folk will pipe up.

A+
Dave

** handwavey

Initial Bulk Upload (was Re: Exist test?)

Reply via email to