OK no worries then, thanks for explaining. I was thinking of using the
class as part of transforming ratings to z-scores as a preprocessing
step (or a wrapping recommender), and although the accuracy issue
wouldn't come up (since there are only up to few thousand ratings per
user), the possible speed-up over millions of users might make it
worthwhile from a value point of view (value = amount of improvement /
amount of work to implement). But since it's just averaging, it's
trivial to do it myself in a different class.
A documenting note at the top of the class explaining the motivation
behind the implementation might be useful to future code explorers, if
it's not too much bother. You could probably largely just copy and
paste your post.
On 10/21/10 5:26 PM, Sean Owen wrote:
This class is used mostly in slope-one. The priorities are memory usage,
then speed, then accuracy in that context.
Because of memory concerns, I don't want to store more than an int and
double. So yes you can store a total and count. So we're left trading speed
for accuracy. Your change is a bit more accurate but slower at reads (must
divide to get the average).
For slope one, which is read-heavy, and where accuracy is not a big deal,
this isn't a worthwhile tradeoff I think.
What's the use case you have in mind?
On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<[email protected]> wrote:
I'm a bit concerned by the implementation of FullRunningAverage. It's
probably not a big deal, but because it performs division each time a new
datum is added, there will be a floating point error that will increase
over
time, and if you average together millions of data points, this error may
become significant. I know that Java doubles have really huge precision,
but still.
In addition, the implementation seems rather inefficient in that it
performs
6 math operations every time a datum is added. Since adding to a running
average is likely to happen much more often than accessing the average,
wouldn't it be a lot more efficient to maintain the *total* and count
internally, rather than the current average and the count, and then when
someone requests the average, calculate the actual average on the fly by
dividing the total by the count? Then it would just be two addition
operations to add a datum. This would simultaneously eliminate the
accumulation of floating point errors.
If people agree, I'd be happy to fix this and submit a patch. It's a
simple
thing, but it's sort of bothering me :) Plus it will allow me to get
familiar with the process.
--
View this message in context:
http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
Sent from the Mahout User List mailing list archive at Nabble.com.