2011/12/28 Randall Leeds <[email protected]>: > On Wed, Dec 28, 2011 at 11:11, Alexey Loshkarev <[email protected]> wrote: >> Hello. >> >> I'm using CouchDB for two+ years for our company's internal projects. >> It's very good and reliable database, i'm almost satisfied with it. >> >> But, couchdb disk size makes me cry. I'll describe this. >> >> My new project must store and manipulate with simple documents (15-20 >> integer/float/string fields, without attachments). >> Target documents count may vary between 50M-500M. We are using SSD for >> database now and need to count every gigabyte. >> Currently, project data stored in MySQL. >> I know why mysql data is so compact - data file consits only data, not >> types and row names. >> But CouchDB database disk size is very overheaded. >> >> Some examples: >> >> I have snippet of data (900K rows). Average row length is 200 bytes. >> Total data size (disk size) is about 190MB. >> >> I imported all of this data to CouchDB and realized, it occupies 800MB >> (4x more than mysql). It was bulk insert with incrementing keys and >> after import database was compacted. >> I tried to reduce field names from 8-10 characters to 1-2 with almost no >> result. >> My data consists of strings in unicode. I realized, erlang external >> term format takes 5 bytes for every unicode character (instead of >> 1-... for utf-8). So i converted my unicode characters to ascii (just >> transliterating cyrillic symbols to asci, one unicode symbol to ascii >> equivalent). >> Result - almost no. >> >> Then I tried to calc sum of document sizes. >> I wrote an erlang view: >> >> fun({Doc}) -> >> Emit(<<"raw">>, size(term_to_binary(Doc))), >> Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}]))) >> end. >> >> According this, >> raw document sum is about 725MB. So, about 10% overhead to id/rev >> index. It's almost ok, but.. So much! >> compressed data takes 435MB. It's much more better than 725, but still >> 2x more than mysql. I can live with 2x overhead, but 4x makes me cry. >> >> Which serialization format is used by couchdb storage engine? >> If it uses term_to_binary, is it possible to enable data compression? >> Via config-file or by http-headers. >> >> Also, term_to_binary seems very overheaded by itself. Any unicode >> character is encoded with 4 bytes, when utf-8 uses only 2 bytes for >> cyrillic chars. >> >> So, the questions are: >> >> 1. What can I do now, to use less space for my data? >> 2. Can I add compression option to term_to_binary (if it used by couchdb, >> sure)? >> 3. Possibilities to provide charset information for data, to make >> unicode to binary conversion more efficient? >> 4. Are there any progress in CouchDB development to change data >> storage format to less overheaded? >> >> >> Also, I just realized here >> (http://www.erlang.org/doc/apps/erts/erl_ext_dist.html), cite: >> =============== >> A float is stored in string format. the format used in sprintf to >> format the float is "%.20e" (there are more bytes allocated than >> necessary) >> =============== >> So, every float requires 33 bytes off disk space. Not so efficient. >> >> >> -- >> ---------------- >> Best regards >> Alexey Loshkarev >> mailto:[email protected] > > Future releases of CouchDB, starting with the 1.2 release, will allow > for compression using google's snappy library which should greatly > reduce the overhead you experience.
Cool! > Also be sure to compact if the > ratio of disk usage to dataset size starts to grow too far. An > automatic compaction daemon is also coming. Will wait for it! -- ---------------- Best regards Alexey Loshkarev mailto:[email protected]
