Re: couchdb disk storage format - why so large overhead?

Alexey Loshkarev Wed, 28 Dec 2011 10:18:17 -0800

2011/12/28 Randall Leeds <[email protected]>:
> On Wed, Dec 28, 2011 at 11:11, Alexey Loshkarev <[email protected]> wrote:
>> Hello.
>>
>> I'm using CouchDB for two+ years for our company's internal projects.
>> It's very good and reliable database, i'm almost satisfied with it.
>>
>> But, couchdb disk size makes me cry. I'll describe this.
>>
>> My new project must store and manipulate with simple documents (15-20
>> integer/float/string fields, without attachments).
>> Target documents count may vary between 50M-500M. We are using SSD for
>> database now and need to count every gigabyte.
>> Currently, project data stored in MySQL.
>> I know why mysql data is so compact - data file consits only data, not
>> types and row names.
>> But CouchDB database disk size is very overheaded.
>>
>> Some examples:
>>
>> I have snippet of data (900K rows). Average row length is 200 bytes.
>> Total data size (disk size) is about 190MB.
>>
>> I imported all of this data to CouchDB and realized, it occupies 800MB
>> (4x more than mysql). It was bulk insert with incrementing keys and
>> after import database was compacted.
>> I tried to reduce field names from 8-10 characters to 1-2 with almost no 
>> result.
>> My data consists of strings in unicode. I realized, erlang external
>> term format takes 5 bytes for every unicode character (instead of
>> 1-... for utf-8). So i converted my unicode characters to ascii (just
>> transliterating cyrillic symbols to asci, one unicode symbol to ascii
>> equivalent).
>> Result - almost no.
>>
>> Then I tried to calc sum of document sizes.
>> I wrote an erlang view:
>>
>> fun({Doc}) ->
>>    Emit(<<"raw">>, size(term_to_binary(Doc))),
>>    Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}])))
>> end.
>>
>> According this,
>> raw document sum is about 725MB. So, about 10% overhead to id/rev
>> index. It's almost ok, but.. So much!
>> compressed data takes 435MB. It's much more better than 725, but still
>> 2x more than mysql. I can live with 2x overhead, but 4x makes me cry.
>>
>> Which serialization format is used by couchdb storage engine?
>> If it uses term_to_binary, is it possible to enable data compression?
>> Via config-file or by http-headers.
>>
>> Also, term_to_binary seems very overheaded by itself. Any unicode
>> character is encoded with 4 bytes, when utf-8 uses only 2 bytes for
>> cyrillic chars.
>>
>> So, the questions are:
>>
>> 1. What can I do now, to use less space for my data?
>> 2. Can I add compression option to term_to_binary (if it used by couchdb, 
>> sure)?
>> 3. Possibilities to provide charset information for data, to make
>> unicode to binary conversion more efficient?
>> 4. Are there any progress in CouchDB development to change data
>> storage format to less overheaded?
>>
>>
>> Also, I just realized here
>> (http://www.erlang.org/doc/apps/erts/erl_ext_dist.html), cite:
>> ===============
>> A float is stored in string format. the format used in sprintf to
>> format the float is "%.20e" (there are more bytes allocated than
>> necessary)
>> ===============
>> So, every float requires 33 bytes off disk space. Not so efficient.
>>
>>
>> --
>> ----------------
>> Best regards
>> Alexey Loshkarev
>> mailto:[email protected]
>
> Future releases of CouchDB, starting with the 1.2 release, will allow
> for compression using google's snappy library which should greatly
> reduce the overhead you experience.


Cool!


> Also be sure to compact if the
> ratio of disk usage to dataset size starts to grow too far. An
> automatic compaction daemon is also coming.

Will wait for it!



-- 
----------------
Best regards
Alexey Loshkarev
mailto:[email protected]

Re: couchdb disk storage format - why so large overhead?

Reply via email to