Re: OnlineSummarizer mods

Ted Dunning Sat, 21 May 2011 22:15:34 -0700

This is not the same algorithm at all.  In the first place, there is a
requirement that you know the size of your data-set in advance so that you
can set \tau.  It might be possible to adaptively set \tau as you go, but
that isn't obvious to me in the few minutes I have looked at this.  My worry
is that if you do this, you may have already thrown away important data by
the time that you realize that \tau needs to be larger than you expected.

Secondly, this appears to be most useful for computing moderately extreme
quantiles.  That may be what you intend to do.  Or not.

I think that the other algorithm is better in that it requires much less
memory and is completely on-line.

The basic idea in the on-line algorithm is that you maintain an estimate of
the quantile that you want and an estimate of the PDF at that point.  You
estimate the PDF by keeping a count of the points that you have seen within
bounds that you tighten over time.  The current estimate is adjust up or
down by an amount based on the quantile that you are estimating times the
current PDF estimate.  For each quantile that you are estimating, you only
need to keep a few values.  In the case of the current Mahout
implementation, I re-use data stored for one quantile for a different
purpose in the estimation of other quantiles so that we pretty much just
need to keep the 5 current quartile estimates and possibly a total count.
 This compares to keeping a thousand values plus counts per quantile with
the IBM paper you reference.

On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <[email protected]> wrote:

> This seems to be the same thing, from '95. And simple enough that even
> I can follow it.
>
>
> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>
> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <[email protected]>
> wrote:
> > By definition, quaRtiles are quartiles.
> >
> > You can adapt the code to generate other quaNtiles by adjusting the
> > parameters that adjust the estimates up and down.  Right now, I use the
> > inter-quartile range to estimate the probability density.  You would need
> to
> > do something else.
> >
> > The original paper has some help in this regard.  See the javadoc for the
> > link.
> >
> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <[email protected]>
> wrote:
> >
> >> How would one change OnlineSummarizer to allow setting the 1 and 3
> >> quartiles? They are hard-coded at 25% and 75%.
> >>
> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
> >>
> >> --
> >> Lance Norskog
> >> [email protected]
> >>
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: OnlineSummarizer mods

Reply via email to