This is not the same algorithm at all. In the first place, there is a requirement that you know the size of your data-set in advance so that you can set \tau. It might be possible to adaptively set \tau as you go, but that isn't obvious to me in the few minutes I have looked at this. My worry is that if you do this, you may have already thrown away important data by the time that you realize that \tau needs to be larger than you expected.
Secondly, this appears to be most useful for computing moderately extreme quantiles. That may be what you intend to do. Or not. I think that the other algorithm is better in that it requires much less memory and is completely on-line. The basic idea in the on-line algorithm is that you maintain an estimate of the quantile that you want and an estimate of the PDF at that point. You estimate the PDF by keeping a count of the points that you have seen within bounds that you tighten over time. The current estimate is adjust up or down by an amount based on the quantile that you are estimating times the current PDF estimate. For each quantile that you are estimating, you only need to keep a few values. In the case of the current Mahout implementation, I re-use data stored for one quantile for a different purpose in the estimation of other quantiles so that we pretty much just need to keep the 5 current quartile estimates and possibly a total count. This compares to keeping a thousand values plus counts per quantile with the IBM paper you reference. On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <[email protected]> wrote: > This seems to be the same thing, from '95. And simple enough that even > I can follow it. > > > http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf > > On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <[email protected]> > wrote: > > By definition, quaRtiles are quartiles. > > > > You can adapt the code to generate other quaNtiles by adjusting the > > parameters that adjust the estimates up and down. Right now, I use the > > inter-quartile range to estimate the probability density. You would need > to > > do something else. > > > > The original paper has some help in this regard. See the javadoc for the > > link. > > > > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <[email protected]> > wrote: > > > >> How would one change OnlineSummarizer to allow setting the 1 and 3 > >> quartiles? They are hard-coded at 25% and 75%. > >> > >> Q1 and Q3 at 02%/98% gives a way to ignore outliers. > >> > >> -- > >> Lance Norskog > >> [email protected] > >> > > > > > > -- > Lance Norskog > [email protected] >
