On Mar 15, 2012, at 7:55 PM, Jason Smith wrote: > On Thu, Mar 15, 2012 at 10:14 PM, Daniel Gonzalez <[email protected]> > wrote: >> Hi Matthieu, >> >> This really seems to help. I am using now a base62 encoded monotonically >> increasing integer, which means my doc_id goes from "0" onwards, using the >> alphabet: >> >> ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz >> >> I am getting now 3000 docs/s, more or less stable, and the size of my >> documents has decreased from 3KB to 0.4 KB. >> I am not sure whether this metrics will worsen when the database grows, but >> my feeling is that the situation has improved a lot just by changing the >> doc_id. > > Hi, Daniel. That's great news! Also, I have an update from a CouchDB 1.2.0 > test. > > I have a database here with 10 million documents, most several KB of > English text. upgrade to version 1.2 changed the database size from > 38GB to is 9.2GB, or now 0.94 KB per document. > > So you should see an even greater improvement when 1.2.0 comes out > Real Soon Now. > >> I have one more question. Is the alphabet I have shown above "ordered" for >> couchdb? > > The sort order may not be quite what you expect, especially if you > work with Unix or servers a lot. > > It is described here: > http://wiki.apache.org/couchdb/View_collation#Collation_Specification > > Basically CouchDB follows (uses!) ICU. The major point is that > different letter sequences are compared case-insensitively, but > same-letter strings are case sensitive (lower case first). To me, it > more or less follows how an English dictionary would do it. > > -- > Iris Couch
If memory serves the database's by_id tree uses Erlang term sorting for collation instead of ICU. ICU is of course the default collation option for MR views. Regards, Adam
