Hadoop has something like this: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html
I find there's a very strong and unfortunate tension between reusability and performance in some cases. Having a discrete stage to compute something like this is good; if it can be computed inline in a prior stage and output on the side, that's a big performance savings. I also find myself tempted to construct a bunch of M/R primitives. For now I am trying to restrict my thinking to refactoring pieces that can come out easily, and that are used already in at least one place. I suppose I mean: if you want to write primitive X and can't find one good use for it yet in Mahout, I'd hold off, but otherwise would surely add it and use it. On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <[email protected]> wrote: > MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies > so that it can prune noisy words. I'm thinking of making it a bit more > generic and adding a stats package to org.apache.mahout.math.hadoop that > contains this and other basic stats calculations (mean, variance, sum of > squares, etc.) that operate in M/R. > > Is that useful or am I re-inventing the wheel here or wasting time? Seems > like such a beast should already exist, but a quick search didn't turn up > much. > > -Grant
