This Welford's method? http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/
http://www.johndcook.com/standard_deviation.html In the second one, you can ditch the standard deviation part. As you can see this has more computation than FullRunningAverage. It will be hard to make a benchmark that elicits performance differences. On Fri, Oct 22, 2010 at 10:55 AM, Ted Dunning <[email protected]> wrote: > If you care about accuracy, then switching to Welfords method (which is > different from keeping a total) would help as much or more. It would also > avoid the conditional for the case of no prior data. > > The speedup is, in any case, not likely to exist because this code is > probably memory bound anyway. This is typical for modern architectures > where floating point math is so much faster than memory access that you can > only saturate the ALU with carefully designed codes. Virtually no 1-d > vector oriented code can do this because there are as many memory ops as > flops ... it takes a matrix operation to do it because those have n^2 > memory ops and n^3 flops. > > On Thu, Oct 21, 2010 at 11:45 PM, Gabriel Webster > <[email protected]>wrote: > >> OK no worries then, thanks for explaining. I was thinking of using the >> class as part of transforming ratings to z-scores as a preprocessing step >> (or a wrapping recommender), and although the accuracy issue wouldn't come >> up (since there are only up to few thousand ratings per user), the possible >> speed-up over millions of users might make it worthwhile from a value point >> of view (value = amount of improvement / amount of work to implement). But >> since it's just averaging, it's trivial to do it myself in a different >> class. >> >> A documenting note at the top of the class explaining the motivation behind >> the implementation might be useful to future code explorers, if it's not too >> much bother. You could probably largely just copy and paste your post. >> >> >> On 10/21/10 5:26 PM, Sean Owen wrote: >> >>> This class is used mostly in slope-one. The priorities are memory usage, >>> then speed, then accuracy in that context. >>> >>> Because of memory concerns, I don't want to store more than an int and >>> double. So yes you can store a total and count. So we're left trading >>> speed >>> for accuracy. Your change is a bit more accurate but slower at reads (must >>> divide to get the average). >>> >>> For slope one, which is read-heavy, and where accuracy is not a big deal, >>> this isn't a worthwhile tradeoff I think. >>> >>> What's the use case you have in mind? >>> >>> On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<[email protected]> >>> wrote: >>> >>> >>>> I'm a bit concerned by the implementation of FullRunningAverage. It's >>>> probably not a big deal, but because it performs division each time a new >>>> datum is added, there will be a floating point error that will increase >>>> over >>>> time, and if you average together millions of data points, this error may >>>> become significant. I know that Java doubles have really huge precision, >>>> but still. >>>> >>>> In addition, the implementation seems rather inefficient in that it >>>> performs >>>> 6 math operations every time a datum is added. Since adding to a running >>>> average is likely to happen much more often than accessing the average, >>>> wouldn't it be a lot more efficient to maintain the *total* and count >>>> internally, rather than the current average and the count, and then when >>>> someone requests the average, calculate the actual average on the fly by >>>> dividing the total by the count? Then it would just be two addition >>>> operations to add a datum. This would simultaneously eliminate the >>>> accumulation of floating point errors. >>>> >>>> If people agree, I'd be happy to fix this and submit a patch. It's a >>>> simple >>>> thing, but it's sort of bothering me :) Plus it will allow me to get >>>> familiar with the process. >>>> -- >>>> View this message in context: >>>> >>>> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html >>>> Sent from the Mahout User List mailing list archive at Nabble.com. >>>> >>> > -- Lance Norskog [email protected]
