Yes, any individual scan should be able to calculate an accurate average based on the entries present at the time of the scan. You just can't pre-compute an average, but you can pre-compute the sum and count and do the division on the fly. For averaging, finishing up the calculation is trivial, but it is a simple example of a reducer that loses information when calculating its result: there is no function f(avg(v_0, ... ,v_N), v_new) that equals avg(v_0, ... ,v_N, v_new) when you don't know N. You would not want a combiner that loses information to run during major or minor compaction scopes.
On Fri, Jul 11, 2014 at 12:38 AM, Russ Weeks <[email protected]> wrote: > Hi, > > I'd like to understand this paragraph in the Accumulo manual a little > better: > > "The only restriction on an combining iterator is that the combiner > developer should not assume that all values for a given key have been seen, > since new mutations can be inserted at anytime. This precludes using the > total number of values in the aggregation such as when calculating an > average, for example." > > By "using the total number of values in the aggregation", I presume that > it means inside the combiner's reduce method? Because it seems like if I'm > using the example StatsCombiner registered on all 3 scopes, after the scan > completes the count and the sum fields should be consistent (w.r.t each > other, of course new mutations could have been added since the scan > started) and if I divide the two I'll get an accurate average, right? > > Thanks, > -Russ >
