Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Gabriel Webster Thu, 21 Oct 2010 23:45:48 -0700

OK no worries then, thanks for explaining. I was thinking of using theclass as part of transforming ratings to z-scores as a preprocessingstep (or a wrapping recommender), and although the accuracy issuewouldn't come up (since there are only up to few thousand ratings peruser), the possible speed-up over millions of users might make itworthwhile from a value point of view (value = amount of improvement /amount of work to implement). But since it's just averaging, it'strivial to do it myself in a different class.

A documenting note at the top of the class explaining the motivationbehind the implementation might be useful to future code explorers, ifit's not too much bother. You could probably largely just copy andpaste your post.


On 10/21/10 5:26 PM, Sean Owen wrote:

This class is used mostly in slope-one. The priorities are memory usage,
then speed, then accuracy in that context.

Because of memory concerns, I don't want to store more than an int and
double. So yes you can store a total and count. So we're left trading speed
for accuracy. Your change is a bit more accurate but slower at reads (must
divide to get the average).

For slope one, which is read-heavy, and where accuracy is not a big deal,
this isn't a worthwhile tradeoff I think.

What's the use case you have in mind?

On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<[email protected]>  wrote:


I'm a bit concerned by the implementation of FullRunningAverage.  It's
probably not a big deal, but because it performs division each time a new
datum is added, there will be a floating point error that will increase
over
time, and if you average together millions of data points, this error may
become significant.  I know that Java doubles have really huge precision,
but still.

In addition, the implementation seems rather inefficient in that it
performs
6 math operations every time a datum is added.  Since adding to a running
average is likely to happen much more often than accessing the average,
wouldn't it be a lot more efficient to maintain the *total* and count
internally, rather than the current average and the count, and then when
someone requests the average, calculate the actual average on the fly by
dividing the total by the count?  Then it would just be two addition
operations to add a datum.  This would simultaneously eliminate the
accumulation of floating point errors.

If people agree, I'd be happy to fix this and submit a patch.  It's a
simple
thing, but it's sort of bothering me :)  Plus it will allow me to get
familiar with the process.
--
View this message in context:
http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Reply via email to