On Thu, May 28, 2009 at 12:37 PM, Brian Candler <[email protected]> wrote: > On Tue, May 26, 2009 at 12:21:10PM +0100, Michael Stillwell wrote: >> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler <[email protected]> wrote: >> > Has anyone found a clear definition of the limitations of reduce functions, >> > in terms of how large the reduced data may be? >> >> In >> http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%[email protected]%3e, >> Damian says: >> >> "the size of the reduce value can be logarithmic with respect to the rows" > > Which doesn't give any guidance as to the absolute maximum size of the > reduce value, only how big it can get in relation to the number of rows, > e.g. > > max_reduce_object_size = k . ln(number of rows reduced) > > for some unknown k. Or is it ln(total size of rows reduced)? >
The deal is that if your reduce function's output is the same size as its input, the final reduce value will end up being as large as all the map rows put together. If your reduce function's output is 1/2 the size of it's input, you'll also end up with quite a large amount of data in the final reduce. In these cases each reduction stage actually accumulates more data, as it is based on ever increasing numbers of map rows. If the function reduces data fast enough, the intermediate reduction values will stay relatively constant, even as each reduce stage reflects logarithmically more map rows. This is the kind of reduce function you want. Theoretically, there are no hard limits, and theoretically, even the first kind of function (identity on values, which leads to logarithmic growth of intermediate values) could eventually complete even on a large data set. Practically, the first limit you'll hit is that all the input values for the function will not fit in the JavaScript interpreter's memory space. Even if that were not an issue, the function computation time will likely go up logarithmically; similarly there will be slowdowns in index processing as the reduction values are stored in the btree inner-nodes. Shuffling around multi-gigabyte inner nodes is not optimal and should be avoided. I hope that's clear, let me know if I can make it clearer. Chris -- Chris Anderson http://jchrisa.net http://couch.io
