Thanks! On Sun, May 22, 2011 at 12:45 AM, Ted Dunning <[email protected]> wrote: > The simplest implementation for this would be to keep a tree of values. > While you have less than N (about 1000) values in the tree, you add each > incoming value to the tree. When you have more than N values after adding a > new sample, pick values at random to delete until you are down to N values > in the tree. Any time you have a value that is in the top or bottom 2% you > can log it. This will get very nearly the result you want at the cost of a > bit of work for the insertion and deletion. > > The data structure to use is a little tricky. You want to be able to > efficiently insert an element, of course, but you also want to find the n-th > largest element quickly and to determine what the rank of a newly inserted > element is reasonably quickly. For most applications with time scales on > the order of milliseconds, simply keeping a sorted array will be a fine > solution. This obviously costs a fair bit to shift things when you do an > insertion. One fun trick to improve this by a substantial factor is to size > the array larger than necessary and keep "tombstone" markers in the array > where ever you have deleted an element. Then when shifting to make room for > a new element, you will have to shift data only until you find a tombstone > or the end of the array and you won't have to shift at all on deletion. > Finding the current rank or finding the n-th largest element will require a > scan, but because of caching and fast processors, this will probably be > nearly as fast as any fancy data structure you can build especially since > you will only be scanning from one end or the other. > > This won't be terribly accurate, not will it be as tight on memory, but it > will be very simple to get right and for the slow query logging problem, it > should be plenty good enough. > > On Sun, May 22, 2011 at 12:08 AM, Lance Norskog <[email protected]> wrote: > >> Here is the use case: quantiles for search query times. 2% gives the >> fastest possible queries, and 98% gives the slowest normal queries and >> ignores a garbage collection colliding with a cosmic ray. >> >> The next step is to maintain a reservoir sample of search queries, and >> calculate the quantiles on demand. This allows drilling down into >> "what queries are consistently slow?". This definitely wants the >> online algorithm, because it allows two-pass classification: the first >> pass through the queries establishes the quantiles, and the second >> pass classifies the query time among the quantiles. This would be very >> useful. >> >> Lance >> >> On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <[email protected]> >> wrote: >> > This is not the same algorithm at all. In the first place, there is a >> > requirement that you know the size of your data-set in advance so that >> you >> > can set \tau. It might be possible to adaptively set \tau as you go, but >> > that isn't obvious to me in the few minutes I have looked at this. My >> worry >> > is that if you do this, you may have already thrown away important data >> by >> > the time that you realize that \tau needs to be larger than you expected. >> > >> > Secondly, this appears to be most useful for computing moderately extreme >> > quantiles. That may be what you intend to do. Or not. >> > >> > I think that the other algorithm is better in that it requires much less >> > memory and is completely on-line. >> > >> > The basic idea in the on-line algorithm is that you maintain an estimate >> of >> > the quantile that you want and an estimate of the PDF at that point. You >> > estimate the PDF by keeping a count of the points that you have seen >> within >> > bounds that you tighten over time. The current estimate is adjust up or >> > down by an amount based on the quantile that you are estimating times the >> > current PDF estimate. For each quantile that you are estimating, you >> only >> > need to keep a few values. In the case of the current Mahout >> > implementation, I re-use data stored for one quantile for a different >> > purpose in the estimation of other quantiles so that we pretty much just >> > need to keep the 5 current quartile estimates and possibly a total count. >> > This compares to keeping a thousand values plus counts per quantile with >> > the IBM paper you reference. >> > >> > >> > On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <[email protected]> >> wrote: >> > >> >> This seems to be the same thing, from '95. And simple enough that even >> >> I can follow it. >> >> >> >> >> >> >> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf >> >> >> >> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <[email protected]> >> >> wrote: >> >> > By definition, quaRtiles are quartiles. >> >> > >> >> > You can adapt the code to generate other quaNtiles by adjusting the >> >> > parameters that adjust the estimates up and down. Right now, I use >> the >> >> > inter-quartile range to estimate the probability density. You would >> need >> >> to >> >> > do something else. >> >> > >> >> > The original paper has some help in this regard. See the javadoc for >> the >> >> > link. >> >> > >> >> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <[email protected]> >> >> wrote: >> >> > >> >> >> How would one change OnlineSummarizer to allow setting the 1 and 3 >> >> >> quartiles? They are hard-coded at 25% and 75%. >> >> >> >> >> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers. >> >> >> >> >> >> -- >> >> >> Lance Norskog >> >> >> [email protected] >> >> >> >> >> > >> >> >> >> >> >> >> >> -- >> >> Lance Norskog >> >> [email protected] >> >> >> > >> >> >> >> -- >> Lance Norskog >> [email protected] >> >
-- Lance Norskog [email protected]
