Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Lance Norskog Fri, 22 Oct 2010 16:45:44 -0700

This Welford's method?

http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/


http://www.johndcook.com/standard_deviation.html

In the second one, you can ditch the standard deviation part.  As you
can see this has more computation than FullRunningAverage. It will be
hard to make a benchmark that elicits performance differences.


On Fri, Oct 22, 2010 at 10:55 AM, Ted Dunning <[email protected]> wrote:
> If you care about accuracy, then switching to Welfords method (which is
> different from keeping a total) would help as much or more.  It would also
> avoid the conditional for the case of no prior data.
>
> The speedup is, in any case, not likely to exist because this code is
> probably memory bound anyway.  This is typical for modern architectures
> where floating point math is so much faster than memory access that you can
> only saturate the ALU with carefully designed codes.  Virtually no 1-d
> vector oriented code can do this because there are as many memory ops as
> flops  ... it takes a matrix operation to do it because those have n^2
> memory ops and n^3 flops.
>
> On Thu, Oct 21, 2010 at 11:45 PM, Gabriel Webster
> <[email protected]>wrote:
>
>> OK no worries then, thanks for explaining.  I was thinking of using the
>> class as part of transforming ratings to z-scores as a preprocessing step
>> (or a wrapping recommender), and although the accuracy issue wouldn't come
>> up (since there are only up to few thousand ratings per user), the possible
>> speed-up over millions of users might make it worthwhile from a value point
>> of view (value = amount of improvement / amount of work to implement).  But
>> since it's just averaging, it's trivial to do it myself in a different
>> class.
>>
>> A documenting note at the top of the class explaining the motivation behind
>> the implementation might be useful to future code explorers, if it's not too
>> much bother.  You could probably largely just copy and paste your post.
>>
>>
>> On 10/21/10 5:26 PM, Sean Owen wrote:
>>
>>> This class is used mostly in slope-one. The priorities are memory usage,
>>> then speed, then accuracy in that context.
>>>
>>> Because of memory concerns, I don't want to store more than an int and
>>> double. So yes you can store a total and count. So we're left trading
>>> speed
>>> for accuracy. Your change is a bit more accurate but slower at reads (must
>>> divide to get the average).
>>>
>>> For slope one, which is read-heavy, and where accuracy is not a big deal,
>>> this isn't a worthwhile tradeoff I think.
>>>
>>> What's the use case you have in mind?
>>>
>>> On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<[email protected]>
>>>  wrote:
>>>
>>>
>>>> I'm a bit concerned by the implementation of FullRunningAverage.  It's
>>>> probably not a big deal, but because it performs division each time a new
>>>> datum is added, there will be a floating point error that will increase
>>>> over
>>>> time, and if you average together millions of data points, this error may
>>>> become significant.  I know that Java doubles have really huge precision,
>>>> but still.
>>>>
>>>> In addition, the implementation seems rather inefficient in that it
>>>> performs
>>>> 6 math operations every time a datum is added.  Since adding to a running
>>>> average is likely to happen much more often than accessing the average,
>>>> wouldn't it be a lot more efficient to maintain the *total* and count
>>>> internally, rather than the current average and the count, and then when
>>>> someone requests the average, calculate the actual average on the fly by
>>>> dividing the total by the count?  Then it would just be two addition
>>>> operations to add a datum.  This would simultaneously eliminate the
>>>> accumulation of floating point errors.
>>>>
>>>> If people agree, I'd be happy to fix this and submit a patch.  It's a
>>>> simple
>>>> thing, but it's sort of bothering me :)  Plus it will allow me to get
>>>> familiar with the process.
>>>> --
>>>> View this message in context:
>>>>
>>>> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>
>>>
>



-- 
Lance Norskog
[email protected]

Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Reply via email to