I'm a bit concerned by the implementation of FullRunningAverage. It's probably not a big deal, but because it performs division each time a new datum is added, there will be a floating point error that will increase over time, and if you average together millions of data points, this error may become significant. I know that Java doubles have really huge precision, but still.
In addition, the implementation seems rather inefficient in that it performs 6 math operations every time a datum is added. Since adding to a running average is likely to happen much more often than accessing the average, wouldn't it be a lot more efficient to maintain the *total* and count internally, rather than the current average and the count, and then when someone requests the average, calculate the actual average on the fly by dividing the total by the count? Then it would just be two addition operations to add a datum. This would simultaneously eliminate the accumulation of floating point errors. If people agree, I'd be happy to fix this and submit a patch. It's a simple thing, but it's sort of bothering me :) Plus it will allow me to get familiar with the process. -- View this message in context: http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html Sent from the Mahout User List mailing list archive at Nabble.com.
