I'm a bit concerned by the implementation of FullRunningAverage.  It's
probably not a big deal, but because it performs division each time a new
datum is added, there will be a floating point error that will increase over
time, and if you average together millions of data points, this error may
become significant.  I know that Java doubles have really huge precision,
but still.

In addition, the implementation seems rather inefficient in that it performs
6 math operations every time a datum is added.  Since adding to a running
average is likely to happen much more often than accessing the average,
wouldn't it be a lot more efficient to maintain the *total* and count
internally, rather than the current average and the count, and then when
someone requests the average, calculate the actual average on the fly by
dividing the total by the count?  Then it would just be two addition
operations to add a datum.  This would simultaneously eliminate the
accumulation of floating point errors.

If people agree, I'd be happy to fix this and submit a patch.  It's a simple
thing, but it's sort of bothering me :)  Plus it will allow me to get
familiar with the process.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Reply via email to