On Apr 20, 2010, at 3:51 PM, Chris Stockton wrote: > Hello, > > On Tue, Apr 20, 2010 at 10:58 AM, Adam Kocoloski <[email protected]> wrote: >> Hi Chris, for the type of access pattern in your benchmark I generally >> recommend to use emit(doc.model, doc) and avoid include_docs=true. >> include_docs introduces an extra lookup back in the DB for every row of your >> view. If you emit the document into the view index the index will get large >> but streaming requests such as yours can be accomplished with a minimum of >> disk IO. > > We have tried this approach and it was indeed faster, however we wound > up with what I remember to be over 19G view file. For 300mb sized > database this trade off did not seem reasonable, although disk is > cheaper in many cases, we found the bloat to be unacceptable. Do you > know of a way to limit the size of the view when including the doc? > Additionally may I ask if include_docs = true has potential room for > optimization?
You should make sure to compact the view index, it doesn't take too long and can offer some huge space savings (as well as better query time performance). You should expect the view index to be comparable in size to the DB in that case. include_docs=true uses the same code path as a single-document GET request. I'm not aware of any extreme hot spots there. Of course we're always keeping an eye on performance and looking out for optimizations, especially as CouchDB stabilizes and heads towards 1.0. >> On the other hand, your sar report shows negligible iowait, so perhaps >> that's not your immediate problem. It may be the case that you're >> CPU-limited in the (pure Erlang) JSON encoder, although I would've expected >> JSON encoding CPU usage to scale with network traffic. > > It would surprise me if 13mb of json encoding could cause such spikes > in CPU. I also expected network traffic to scale with our CPU usage. > Have you seen issues in this area before? At first thought I would > think of the encoding stage as being one of the lighter areas in the > request, given the simple nature of json. A thought -- do your documents have a very large number of edits? I _have_ seen heavy CPU utilization when dealing with documents containing 100+ revisions. Even after those revisions have been compacted away, the revision tree hangs around and is processed for every single-document (and include_docs) request. I've profiled couch_key_tree as a significant bottleneck in that case. If you do have a large number of revisions and you don't worry too much about spurious conflicts on replication, you can lower the _revs_limit setting for your DB to trim that history down a bit. The default value is 1000 revisions. >> You might try running eprof while you do this test. It's quite heavyweight >> and will slow your system down. If you start couchdb with the -i flag you >> can get an Erlang shell and execute >> <snip> > > This was good information and I will look into profiling with erlang. > May I ask if any effort is currently being put into performance and > optimization for couchdb? Yes, all the time, particularly as the codebase stabilizes. I've submitted performance-related patches for DB compaction and view key collation in the past week, for instance. > I am also very interested in any reads on > large-scall couchdb deployments, that are not so high-level (I.E. > hardware specs, use cases, etc). Ah, I'm not aware of so many low-level case studies like that. We should get around to writing up so of our accumulated experience at cloudant.com. Cheers, Adam > > Kind Regards, > > -Chris
