Re: Regarding error bounds and confidence of apache KLL implementation

Gourav Kumar Mon, 22 Jun 2020 23:08:04 -0700

Thanks Lee.

This has been very helpful.


I am curious about the confidence number though, can you elaborate more on
how you computed it to be 99℅?

I was under the impression that it's a fixed value given by the paper.

Thank You,
Gourav

On Tue, 23 Jun, 2020, 10:50 AM leerho, <lee...@gmail.com> wrote:

> Gourav,
>
> I will defer the answer to Q1 to either Justin or Edo, but the rest I can
> answer.
>
> Q2: 99% was somewhat arbitrary and a compromise.  It took a huge amount of
> carefully constructed characterization studies to establish the 99%
> confidence level and to refine it to a much finer level was left to a
> future task.
>
> Q3: The minimum width set to 8 is also speed / space compromise.  The
> paper suggests a minimum of 2, but with only 2 values per level the process
> resembles random sampling, which the paper also discusses.  However.
> Installing a sampler at the tail was going to add quite a bit of complexity
> to the code especially for merging.  And frankly, at the time, we were not
> confident that merging with the sampler at the tail would be as robust.
> Again, this was left to a future algorithm that could be proven to work in
> practical situations.  Also a long tail of compactors of size 2 is
> computationally very wasteful and slow.  By widening the tail to 8 we were
> able to recover nearly all the speed of the older quantiles algorithm with
> most of the benefit of the size reduction of KLL.   This is another example
> of an engineering design decision where we were particularly concerned
> about speed performance.  This was not arbitrary,  we did extensive
> characterization studies of various widths: 2,4,6,8,10,12 and 8 seemed to
> be the happy medium between speed, space and simplicity of implementation.
>
> Q4.  This is an easy question, but clearly not obvious.  The whole idea of
> lazy promotion of the compactors is to keep as much data as possible at the
> low levels as they have the most accurate representation of the input
> stream.  If you focus on promoting the higher levels first you end up
> degrading your accuracy faster.
>
> Cheers,
>
> Lee.
>
> On Mon, Jun 22, 2020 at 9:43 PM Gourav Kumar <gourav1...@gmail.com> wrote:
>
>> Thanks Lee,
>>
>> This clears a lot of things.
>> Although, I have few more questions on this.
>>
>> Q1. From users perspective, can you tell apriori whenever the sketch
>> constructed falls in the 1% category (bad case),
>>       or maybe when the users asks a get_quantile query give some sort of
>> indication around confidence or error bound for that rank query.
>>        As a user, when I am relying on this algorithm how can I be sure
>> that I am not in the 1% case.
>>        If not, then as we don't have any bounds on rank in the bad case,
>> will users be skeptical in using it?
>>
>> Q2. How did you arrive at the confidence value of 99%?
>>       The paper mentions some formula for delta, but it's not clear to me
>> if it will translate to 99%.
>>
>> Q3. Why is min width's(m_) default value set to 8?
>>        For compaction to be meaningful, In my opinion having a min width
>> value greater than 1 would have been sufficient.
>>        The other reason which comes to my mind is faster convergence and
>> less number of compaction.
>>        In that case why 8, why not any value larger than that too?
>>
>> Q4. Also in compaction, you always pick the first level which stores more
>> elements than its capacity.
>>        Would guarantees be affected or will it have any effect on the
>> speed of the algorithm when you change this strategy to
>>         lets say picking a level which overshoots its capacity by the
>> largest amount?
>>         In my opinion, this strategy would free up more space and should
>> lead to lesser number of compaction and hence decreasing empirical error
>> bounds.
>>
>> Thank You,
>> Gourav
>>
>> On Tue, 23 Jun 2020 at 07:56, Alexander Saydakov <
>> sayda...@verizonmedia.com> wrote:
>>
>>> Adding the original poster just in case he is not subscribed to the list
>>>
>>> On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote:
>>>
>>>> I see a typo: What I called the Omega relation is actually Omicron (big
>>>> "Oh")  :)
>>>>
>>>>
>>>> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <
>>>> justin.r.tha...@gmail.com> wrote:
>>>>
>>>>> Your response looks great to me.
>>>>>
>>>>> I suppose Gourav will respond if he is still confused on any issues.
>>>>> And hopefully Edo will read this over at some point and double-check both
>>>>> of our work :-)
>>>>>
>>>>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote:
>>>>>
>>>>>> Hello Gourav, welcome to this forum!
>>>>>>
>>>>>> I want to make sure you have access to and have read the code
>>>>>> documentation for the KLL sketch in addition to the papers.  Although the
>>>>>> code documentation exists for both Java and C++, it is a little easier to
>>>>>> access the Javadocs as they are accessible from the website.  There is a
>>>>>> long section about the KLL sketch
>>>>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html>
>>>>>> and its implementation that may answer some of your questions.
>>>>>>
>>>>>> It is important to emphasize that the error of the quantile sketch is
>>>>>> only relevant to the rank and not the value. It is not clear from your
>>>>>> question, which you are referring to.
>>>>>>
>>>>>> 1. I asked Justin Thaler, one of our scientists, how to better
>>>>>> explain the error "guarantee" on the sketch.  Hopefully, he can respond
>>>>>> here if what I write here is not totally correct :)
>>>>>>
>>>>>> When you feed a stream of values into a KLL (or Quantiles) sketch the
>>>>>> result is actually a set of values (kept internally by the sketch) that
>>>>>> have been "sampled" from the stream using a random process. Let's call 
>>>>>> this
>>>>>> set a "*result set*".  When we issue a query against the sketch it
>>>>>> analyzes this *result set* and returns a value (or a rank depending
>>>>>> on the query).
>>>>>>
>>>>>> The guarantee goes something like this:
>>>>>>
>>>>>> With 99% probability, the *result set* of a sketch will be a set
>>>>>> such that a *getQuantile(rank)* query from that *result set* will
>>>>>> produce an *estimated quantile (or value)* such that the *true rank
>>>>>> of the true value* will be within +/- the absolute fractional rank
>>>>>> error returned by the function getNormalizedRankError(k, pmf).  If the 
>>>>>> *result
>>>>>> set* is a good one, i.e., one of the good 99%- of-the-time sketches,
>>>>>> then the guarantee is that *all possible queries* will fall within
>>>>>> the +/- rank error interval defined above.
>>>>>>
>>>>>> Because of the random sampling process used by the sketch, there is a
>>>>>> 1% chance that the *result set* is bad.  We don't know how bad:  it
>>>>>> could mess up only one query, or it could mess up all queries from that 
>>>>>> bad *result
>>>>>> set*.
>>>>>>
>>>>>> This applies to both of your Q1 scenarios, given the caveat above and
>>>>>> is actually a stronger guarantee.
>>>>>>
>>>>>> There is no guarantee on the accuracy of values, however, you can get
>>>>>> a sense of the range of values along the value axis that correspond to 
>>>>>> the
>>>>>> confidence interval range along the rank axis.  For example:  Suppose 
>>>>>> your
>>>>>> rank error is 1% and you getQuantile(0.5) which returns the value.  Now 
>>>>>> you
>>>>>> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give 
>>>>>> the
>>>>>> corresponding range of values.
>>>>>>
>>>>>> Q2. Your Q2 is confusing, because I think you are conflating the
>>>>>> accuracy (1/epsilon) with the confidence (1/delta).  The Omega equation 
>>>>>> is
>>>>>> O(sqrt(log(1/delta))/epsilon).
>>>>>> There is actually a code comment in the Java version that states:
>>>>>>
>>>>>>   // constants were derived as the best fit to 99 percentile
>>>>>>> empirically measured max error in
>>>>>>>   // thousands of trials
>>>>>>>   public static double getNormalizedRankError(final int k, final
>>>>>>> boolean pmf) {
>>>>>>>     return pmf
>>>>>>>         ? 2.446 / pow(k, 0.9433)
>>>>>>>         : 2.296 / pow(k, 0.9723);
>>>>>>>   }
>>>>>>
>>>>>>
>>>>>> The hidden constant values behind the Omega equation above is not
>>>>>> discussed in the paper.  So we ran many thousands of trails to 
>>>>>> empirically
>>>>>> determine what those constants are, and the result of that study is the
>>>>>> equations given.
>>>>>>
>>>>>> Q3.  Space is initially limited to 3K, but for very long streams, the
>>>>>> space will grow very slowly ...  something along the line of
>>>>>> *log_base8(N/k)*.  We need to put growth tables on the web site like
>>>>>> we did for the older QuantilesSketch, we just haven't gotten around to 
>>>>>> it.
>>>>>>
>>>>>> You are correct that to quickly answer forward queries (like the
>>>>>> Q(r)) we build a temporary, internal aux table.  If you have multiple
>>>>>> queries from the same sketch this speeds up the query process
>>>>>> considerably.  For all the other queries: R(q), CDF, PDF, this aux table 
>>>>>> is
>>>>>> not needed.  Also, when the sketch is serialized, this aux table is not
>>>>>> needed as it can be constructed anytime.  You are correct that we could
>>>>>> document this temporary size increase better.  However, the speed gain is
>>>>>> considerable, and most of our users prefer to have the speed gain at 
>>>>>> query
>>>>>> time and don't worry so much about the temporary allocation of the Aux
>>>>>> table.  This is clearly a speed/ space optimization we made as part of 
>>>>>> the
>>>>>> engineering of the design.
>>>>>>
>>>>>> I hope this helps!
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Lee.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>> Hope you are well in these COVID times.
>>>>>>>
>>>>>>> I was going through implementation of the KLL algorithm based on the
>>>>>>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" 
>>>>>>> and
>>>>>>> wanted to understand it more to understand applications in databases.
>>>>>>> (cpp version on apache github.)
>>>>>>> I had a few questions related to the paper on which I couldn't get
>>>>>>> my head around, I would appreciate if I could get some help on these:
>>>>>>>
>>>>>>> Q1. What is the meaning of confidence in the KLL algorithm?
>>>>>>>        Does it mean that once a sketch has been constructed, we
>>>>>>> start asking quantile queries. Confidence will tell how many of these
>>>>>>> queries will fall under the normalized error bound.
>>>>>>>         OR
>>>>>>>        If I construct the sketch n times on any data and ask the
>>>>>>> same query q each time, confidence will tell in how many cases out of 
>>>>>>> these
>>>>>>> n times q will fall under the normalized error bounds.
>>>>>>>
>>>>>>>        I tested the apache version with deterministic bits, that you
>>>>>>> have under KLL_VALIDATION flag.
>>>>>>>        There for many queries(different ones) I got values beyond
>>>>>>> error bounds, and the number of such queries was way beyond confidence 
>>>>>>> of
>>>>>>> 99%.
>>>>>>>        The deterministic bits followed a pattern (010101..), the
>>>>>>> random bit generation could also at some time generate such a pattern.
>>>>>>>        What do you do in such a case, where error bounds and
>>>>>>> confidence both will not hold?
>>>>>>>
>>>>>>> Q2. Though the paper mentions a relation between error bound and
>>>>>>> buffer size k, as O(sqrt(log e) / e))
>>>>>>>         What will be the actual relation between these two wrt
>>>>>>> constants as well, when we try to get a practical application of the
>>>>>>> algorithm.
>>>>>>>         In your implementation you have added normalized error to be
>>>>>>> equal to 2.296/k^0.9723.
>>>>>>>         I want to understand how you got to this value.
>>>>>>>
>>>>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will
>>>>>>> take more memory for storing weights to answer quantiles (after sorting
>>>>>>> across levels at the end of sketch creation, otherwise bounds might not
>>>>>>> hold). Why are these not included in memory taken by sketch?
>>>>>>> As a practical application of the paper will need these memories and
>>>>>>> an additional tmp storage space as well during cdf construction to 
>>>>>>> answer
>>>>>>> queries.
>>>>>>>
>>>>>>> Appreciate your time.
>>>>>>> Thank You,
>>>>>>> Gourav Kumar
>>>>>>>
>>>>>>
>>
>> --
>> Regards,
>> Gourav
>>
>

Re: Regarding error bounds and confidence of apache KLL implementation

Reply via email to