MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies so that it can prune noisy words. I'm thinking of making it a bit more generic and adding a stats package to org.apache.mahout.math.hadoop that contains this and other basic stats calculations (mean, variance, sum of squares, etc.) that operate in M/R.
Is that useful or am I re-inventing the wheel here or wasting time? Seems like such a beast should already exist, but a quick search didn't turn up much. -Grant
