Thanks Lee. This has been very helpful.
I am curious about the confidence number though, can you elaborate more on how you computed it to be 99℅? I was under the impression that it's a fixed value given by the paper. Thank You, Gourav On Tue, 23 Jun, 2020, 10:50 AM leerho, <lee...@gmail.com> wrote: > Gourav, > > I will defer the answer to Q1 to either Justin or Edo, but the rest I can > answer. > > Q2: 99% was somewhat arbitrary and a compromise. It took a huge amount of > carefully constructed characterization studies to establish the 99% > confidence level and to refine it to a much finer level was left to a > future task. > > Q3: The minimum width set to 8 is also speed / space compromise. The > paper suggests a minimum of 2, but with only 2 values per level the process > resembles random sampling, which the paper also discusses. However. > Installing a sampler at the tail was going to add quite a bit of complexity > to the code especially for merging. And frankly, at the time, we were not > confident that merging with the sampler at the tail would be as robust. > Again, this was left to a future algorithm that could be proven to work in > practical situations. Also a long tail of compactors of size 2 is > computationally very wasteful and slow. By widening the tail to 8 we were > able to recover nearly all the speed of the older quantiles algorithm with > most of the benefit of the size reduction of KLL. This is another example > of an engineering design decision where we were particularly concerned > about speed performance. This was not arbitrary, we did extensive > characterization studies of various widths: 2,4,6,8,10,12 and 8 seemed to > be the happy medium between speed, space and simplicity of implementation. > > Q4. This is an easy question, but clearly not obvious. The whole idea of > lazy promotion of the compactors is to keep as much data as possible at the > low levels as they have the most accurate representation of the input > stream. If you focus on promoting the higher levels first you end up > degrading your accuracy faster. > > Cheers, > > Lee. > > On Mon, Jun 22, 2020 at 9:43 PM Gourav Kumar <gourav1...@gmail.com> wrote: > >> Thanks Lee, >> >> This clears a lot of things. >> Although, I have few more questions on this. >> >> Q1. From users perspective, can you tell apriori whenever the sketch >> constructed falls in the 1% category (bad case), >> or maybe when the users asks a get_quantile query give some sort of >> indication around confidence or error bound for that rank query. >> As a user, when I am relying on this algorithm how can I be sure >> that I am not in the 1% case. >> If not, then as we don't have any bounds on rank in the bad case, >> will users be skeptical in using it? >> >> Q2. How did you arrive at the confidence value of 99%? >> The paper mentions some formula for delta, but it's not clear to me >> if it will translate to 99%. >> >> Q3. Why is min width's(m_) default value set to 8? >> For compaction to be meaningful, In my opinion having a min width >> value greater than 1 would have been sufficient. >> The other reason which comes to my mind is faster convergence and >> less number of compaction. >> In that case why 8, why not any value larger than that too? >> >> Q4. Also in compaction, you always pick the first level which stores more >> elements than its capacity. >> Would guarantees be affected or will it have any effect on the >> speed of the algorithm when you change this strategy to >> lets say picking a level which overshoots its capacity by the >> largest amount? >> In my opinion, this strategy would free up more space and should >> lead to lesser number of compaction and hence decreasing empirical error >> bounds. >> >> Thank You, >> Gourav >> >> On Tue, 23 Jun 2020 at 07:56, Alexander Saydakov < >> sayda...@verizonmedia.com> wrote: >> >>> Adding the original poster just in case he is not subscribed to the list >>> >>> On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote: >>> >>>> I see a typo: What I called the Omega relation is actually Omicron (big >>>> "Oh") :) >>>> >>>> >>>> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler < >>>> justin.r.tha...@gmail.com> wrote: >>>> >>>>> Your response looks great to me. >>>>> >>>>> I suppose Gourav will respond if he is still confused on any issues. >>>>> And hopefully Edo will read this over at some point and double-check both >>>>> of our work :-) >>>>> >>>>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote: >>>>> >>>>>> Hello Gourav, welcome to this forum! >>>>>> >>>>>> I want to make sure you have access to and have read the code >>>>>> documentation for the KLL sketch in addition to the papers. Although the >>>>>> code documentation exists for both Java and C++, it is a little easier to >>>>>> access the Javadocs as they are accessible from the website. There is a >>>>>> long section about the KLL sketch >>>>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html> >>>>>> and its implementation that may answer some of your questions. >>>>>> >>>>>> It is important to emphasize that the error of the quantile sketch is >>>>>> only relevant to the rank and not the value. It is not clear from your >>>>>> question, which you are referring to. >>>>>> >>>>>> 1. I asked Justin Thaler, one of our scientists, how to better >>>>>> explain the error "guarantee" on the sketch. Hopefully, he can respond >>>>>> here if what I write here is not totally correct :) >>>>>> >>>>>> When you feed a stream of values into a KLL (or Quantiles) sketch the >>>>>> result is actually a set of values (kept internally by the sketch) that >>>>>> have been "sampled" from the stream using a random process. Let's call >>>>>> this >>>>>> set a "*result set*". When we issue a query against the sketch it >>>>>> analyzes this *result set* and returns a value (or a rank depending >>>>>> on the query). >>>>>> >>>>>> The guarantee goes something like this: >>>>>> >>>>>> With 99% probability, the *result set* of a sketch will be a set >>>>>> such that a *getQuantile(rank)* query from that *result set* will >>>>>> produce an *estimated quantile (or value)* such that the *true rank >>>>>> of the true value* will be within +/- the absolute fractional rank >>>>>> error returned by the function getNormalizedRankError(k, pmf). If the >>>>>> *result >>>>>> set* is a good one, i.e., one of the good 99%- of-the-time sketches, >>>>>> then the guarantee is that *all possible queries* will fall within >>>>>> the +/- rank error interval defined above. >>>>>> >>>>>> Because of the random sampling process used by the sketch, there is a >>>>>> 1% chance that the *result set* is bad. We don't know how bad: it >>>>>> could mess up only one query, or it could mess up all queries from that >>>>>> bad *result >>>>>> set*. >>>>>> >>>>>> This applies to both of your Q1 scenarios, given the caveat above and >>>>>> is actually a stronger guarantee. >>>>>> >>>>>> There is no guarantee on the accuracy of values, however, you can get >>>>>> a sense of the range of values along the value axis that correspond to >>>>>> the >>>>>> confidence interval range along the rank axis. For example: Suppose >>>>>> your >>>>>> rank error is 1% and you getQuantile(0.5) which returns the value. Now >>>>>> you >>>>>> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give >>>>>> the >>>>>> corresponding range of values. >>>>>> >>>>>> Q2. Your Q2 is confusing, because I think you are conflating the >>>>>> accuracy (1/epsilon) with the confidence (1/delta). The Omega equation >>>>>> is >>>>>> O(sqrt(log(1/delta))/epsilon). >>>>>> There is actually a code comment in the Java version that states: >>>>>> >>>>>> // constants were derived as the best fit to 99 percentile >>>>>>> empirically measured max error in >>>>>>> // thousands of trials >>>>>>> public static double getNormalizedRankError(final int k, final >>>>>>> boolean pmf) { >>>>>>> return pmf >>>>>>> ? 2.446 / pow(k, 0.9433) >>>>>>> : 2.296 / pow(k, 0.9723); >>>>>>> } >>>>>> >>>>>> >>>>>> The hidden constant values behind the Omega equation above is not >>>>>> discussed in the paper. So we ran many thousands of trails to >>>>>> empirically >>>>>> determine what those constants are, and the result of that study is the >>>>>> equations given. >>>>>> >>>>>> Q3. Space is initially limited to 3K, but for very long streams, the >>>>>> space will grow very slowly ... something along the line of >>>>>> *log_base8(N/k)*. We need to put growth tables on the web site like >>>>>> we did for the older QuantilesSketch, we just haven't gotten around to >>>>>> it. >>>>>> >>>>>> You are correct that to quickly answer forward queries (like the >>>>>> Q(r)) we build a temporary, internal aux table. If you have multiple >>>>>> queries from the same sketch this speeds up the query process >>>>>> considerably. For all the other queries: R(q), CDF, PDF, this aux table >>>>>> is >>>>>> not needed. Also, when the sketch is serialized, this aux table is not >>>>>> needed as it can be constructed anytime. You are correct that we could >>>>>> document this temporary size increase better. However, the speed gain is >>>>>> considerable, and most of our users prefer to have the speed gain at >>>>>> query >>>>>> time and don't worry so much about the temporary allocation of the Aux >>>>>> table. This is clearly a speed/ space optimization we made as part of >>>>>> the >>>>>> engineering of the design. >>>>>> >>>>>> I hope this helps! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Lee. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> Hope you are well in these COVID times. >>>>>>> >>>>>>> I was going through implementation of the KLL algorithm based on the >>>>>>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" >>>>>>> and >>>>>>> wanted to understand it more to understand applications in databases. >>>>>>> (cpp version on apache github.) >>>>>>> I had a few questions related to the paper on which I couldn't get >>>>>>> my head around, I would appreciate if I could get some help on these: >>>>>>> >>>>>>> Q1. What is the meaning of confidence in the KLL algorithm? >>>>>>> Does it mean that once a sketch has been constructed, we >>>>>>> start asking quantile queries. Confidence will tell how many of these >>>>>>> queries will fall under the normalized error bound. >>>>>>> OR >>>>>>> If I construct the sketch n times on any data and ask the >>>>>>> same query q each time, confidence will tell in how many cases out of >>>>>>> these >>>>>>> n times q will fall under the normalized error bounds. >>>>>>> >>>>>>> I tested the apache version with deterministic bits, that you >>>>>>> have under KLL_VALIDATION flag. >>>>>>> There for many queries(different ones) I got values beyond >>>>>>> error bounds, and the number of such queries was way beyond confidence >>>>>>> of >>>>>>> 99%. >>>>>>> The deterministic bits followed a pattern (010101..), the >>>>>>> random bit generation could also at some time generate such a pattern. >>>>>>> What do you do in such a case, where error bounds and >>>>>>> confidence both will not hold? >>>>>>> >>>>>>> Q2. Though the paper mentions a relation between error bound and >>>>>>> buffer size k, as O(sqrt(log e) / e)) >>>>>>> What will be the actual relation between these two wrt >>>>>>> constants as well, when we try to get a practical application of the >>>>>>> algorithm. >>>>>>> In your implementation you have added normalized error to be >>>>>>> equal to 2.296/k^0.9723. >>>>>>> I want to understand how you got to this value. >>>>>>> >>>>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will >>>>>>> take more memory for storing weights to answer quantiles (after sorting >>>>>>> across levels at the end of sketch creation, otherwise bounds might not >>>>>>> hold). Why are these not included in memory taken by sketch? >>>>>>> As a practical application of the paper will need these memories and >>>>>>> an additional tmp storage space as well during cdf construction to >>>>>>> answer >>>>>>> queries. >>>>>>> >>>>>>> Appreciate your time. >>>>>>> Thank You, >>>>>>> Gourav Kumar >>>>>>> >>>>>> >> >> -- >> Regards, >> Gourav >> >