This class is used mostly in slope-one. The priorities are memory usage, then speed, then accuracy in that context.
Because of memory concerns, I don't want to store more than an int and double. So yes you can store a total and count. So we're left trading speed for accuracy. Your change is a bit more accurate but slower at reads (must divide to get the average). For slope one, which is read-heavy, and where accuracy is not a big deal, this isn't a worthwhile tradeoff I think. What's the use case you have in mind? On Thu, Oct 21, 2010 at 10:03 AM, gabeweb <[email protected]> wrote: > > I'm a bit concerned by the implementation of FullRunningAverage. It's > probably not a big deal, but because it performs division each time a new > datum is added, there will be a floating point error that will increase > over > time, and if you average together millions of data points, this error may > become significant. I know that Java doubles have really huge precision, > but still. > > In addition, the implementation seems rather inefficient in that it > performs > 6 math operations every time a datum is added. Since adding to a running > average is likely to happen much more often than accessing the average, > wouldn't it be a lot more efficient to maintain the *total* and count > internally, rather than the current average and the count, and then when > someone requests the average, calculate the actual average on the fly by > dividing the total by the count? Then it would just be two addition > operations to add a datum. This would simultaneously eliminate the > accumulation of floating point errors. > > If people agree, I'd be happy to fix this and submit a patch. It's a > simple > thing, but it's sort of bothering me :) Plus it will allow me to get > familiar with the process. > -- > View this message in context: > http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html > Sent from the Mahout User List mailing list archive at Nabble.com. >
