Re: Regarding error bounds and confidence of apache KLL implementation

Gourav Kumar Mon, 22 Jun 2020 21:44:23 -0700

Thanks Lee,

This clears a lot of things.
Although, I have few more questions on this.


Q1. From users perspective, can you tell apriori whenever the sketch
constructed falls in the 1% category (bad case),
      or maybe when the users asks a get_quantile query give some sort of
indication around confidence or error bound for that rank query.
       As a user, when I am relying on this algorithm how can I be sure
that I am not in the 1% case.
       If not, then as we don't have any bounds on rank in the bad case,
will users be skeptical in using it?

Q2. How did you arrive at the confidence value of 99%?
      The paper mentions some formula for delta, but it's not clear to me
if it will translate to 99%.

Q3. Why is min width's(m_) default value set to 8?
       For compaction to be meaningful, In my opinion having a min width
value greater than 1 would have been sufficient.
       The other reason which comes to my mind is faster convergence and
less number of compaction.
       In that case why 8, why not any value larger than that too?

Q4. Also in compaction, you always pick the first level which stores more
elements than its capacity.
       Would guarantees be affected or will it have any effect on the speed
of the algorithm when you change this strategy to
        lets say picking a level which overshoots its capacity by the
largest amount?
        In my opinion, this strategy would free up more space and should
lead to lesser number of compaction and hence decreasing empirical error
bounds.

Thank You,
Gourav

On Tue, 23 Jun 2020 at 07:56, Alexander Saydakov <sayda...@verizonmedia.com>
wrote:

> Adding the original poster just in case he is not subscribed to the list
>
> On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote:
>
>> I see a typo: What I called the Omega relation is actually Omicron (big
>> "Oh")  :)
>>
>>
>> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <justin.r.tha...@gmail.com>
>> wrote:
>>
>>> Your response looks great to me.
>>>
>>> I suppose Gourav will respond if he is still confused on any issues. And
>>> hopefully Edo will read this over at some point and double-check both of
>>> our work :-)
>>>
>>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote:
>>>
>>>> Hello Gourav, welcome to this forum!
>>>>
>>>> I want to make sure you have access to and have read the code
>>>> documentation for the KLL sketch in addition to the papers.  Although the
>>>> code documentation exists for both Java and C++, it is a little easier to
>>>> access the Javadocs as they are accessible from the website.  There is a
>>>> long section about the KLL sketch
>>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html>
>>>> and its implementation that may answer some of your questions.
>>>>
>>>> It is important to emphasize that the error of the quantile sketch is
>>>> only relevant to the rank and not the value. It is not clear from your
>>>> question, which you are referring to.
>>>>
>>>> 1. I asked Justin Thaler, one of our scientists, how to better explain
>>>> the error "guarantee" on the sketch.  Hopefully, he can respond here if
>>>> what I write here is not totally correct :)
>>>>
>>>> When you feed a stream of values into a KLL (or Quantiles) sketch the
>>>> result is actually a set of values (kept internally by the sketch) that
>>>> have been "sampled" from the stream using a random process. Let's call this
>>>> set a "*result set*".  When we issue a query against the sketch it
>>>> analyzes this *result set* and returns a value (or a rank depending on
>>>> the query).
>>>>
>>>> The guarantee goes something like this:
>>>>
>>>> With 99% probability, the *result set* of a sketch will be a set such
>>>> that a *getQuantile(rank)* query from that *result set* will produce
>>>> an *estimated quantile (or value)* such that the *true rank of the
>>>> true value* will be within +/- the absolute fractional rank error
>>>> returned by the function getNormalizedRankError(k, pmf).  If the *result
>>>> set* is a good one, i.e., one of the good 99%- of-the-time sketches,
>>>> then the guarantee is that *all possible queries* will fall within
>>>> the +/- rank error interval defined above.
>>>>
>>>> Because of the random sampling process used by the sketch, there is a
>>>> 1% chance that the *result set* is bad.  We don't know how bad:  it
>>>> could mess up only one query, or it could mess up all queries from that 
>>>> bad *result
>>>> set*.
>>>>
>>>> This applies to both of your Q1 scenarios, given the caveat above and
>>>> is actually a stronger guarantee.
>>>>
>>>> There is no guarantee on the accuracy of values, however, you can get a
>>>> sense of the range of values along the value axis that correspond to the
>>>> confidence interval range along the rank axis.  For example:  Suppose your
>>>> rank error is 1% and you getQuantile(0.5) which returns the value.  Now you
>>>> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the
>>>> corresponding range of values.
>>>>
>>>> Q2. Your Q2 is confusing, because I think you are conflating the
>>>> accuracy (1/epsilon) with the confidence (1/delta).  The Omega equation is
>>>> O(sqrt(log(1/delta))/epsilon).
>>>> There is actually a code comment in the Java version that states:
>>>>
>>>>   // constants were derived as the best fit to 99 percentile
>>>>> empirically measured max error in
>>>>>   // thousands of trials
>>>>>   public static double getNormalizedRankError(final int k, final
>>>>> boolean pmf) {
>>>>>     return pmf
>>>>>         ? 2.446 / pow(k, 0.9433)
>>>>>         : 2.296 / pow(k, 0.9723);
>>>>>   }
>>>>
>>>>
>>>> The hidden constant values behind the Omega equation above is not
>>>> discussed in the paper.  So we ran many thousands of trails to empirically
>>>> determine what those constants are, and the result of that study is the
>>>> equations given.
>>>>
>>>> Q3.  Space is initially limited to 3K, but for very long streams, the
>>>> space will grow very slowly ...  something along the line of
>>>> *log_base8(N/k)*.  We need to put growth tables on the web site like
>>>> we did for the older QuantilesSketch, we just haven't gotten around to it.
>>>>
>>>> You are correct that to quickly answer forward queries (like the Q(r))
>>>> we build a temporary, internal aux table.  If you have multiple queries
>>>> from the same sketch this speeds up the query process considerably.  For
>>>> all the other queries: R(q), CDF, PDF, this aux table is not needed.  Also,
>>>> when the sketch is serialized, this aux table is not needed as it can be
>>>> constructed anytime.  You are correct that we could document this temporary
>>>> size increase better.  However, the speed gain is considerable, and most of
>>>> our users prefer to have the speed gain at query time and don't worry so
>>>> much about the temporary allocation of the Aux table.  This is clearly a
>>>> speed/ space optimization we made as part of the engineering of the design.
>>>>
>>>> I hope this helps!
>>>>
>>>> Cheers,
>>>>
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>> Hope you are well in these COVID times.
>>>>>
>>>>> I was going through implementation of the KLL algorithm based on the
>>>>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" 
>>>>> and
>>>>> wanted to understand it more to understand applications in databases.
>>>>> (cpp version on apache github.)
>>>>> I had a few questions related to the paper on which I couldn't get my
>>>>> head around, I would appreciate if I could get some help on these:
>>>>>
>>>>> Q1. What is the meaning of confidence in the KLL algorithm?
>>>>>        Does it mean that once a sketch has been constructed, we start
>>>>> asking quantile queries. Confidence will tell how many of these queries
>>>>> will fall under the normalized error bound.
>>>>>         OR
>>>>>        If I construct the sketch n times on any data and ask the same
>>>>> query q each time, confidence will tell in how many cases out of these n
>>>>> times q will fall under the normalized error bounds.
>>>>>
>>>>>        I tested the apache version with deterministic bits, that you
>>>>> have under KLL_VALIDATION flag.
>>>>>        There for many queries(different ones) I got values beyond
>>>>> error bounds, and the number of such queries was way beyond confidence of
>>>>> 99%.
>>>>>        The deterministic bits followed a pattern (010101..), the
>>>>> random bit generation could also at some time generate such a pattern.
>>>>>        What do you do in such a case, where error bounds and
>>>>> confidence both will not hold?
>>>>>
>>>>> Q2. Though the paper mentions a relation between error bound and
>>>>> buffer size k, as O(sqrt(log e) / e))
>>>>>         What will be the actual relation between these two wrt
>>>>> constants as well, when we try to get a practical application of the
>>>>> algorithm.
>>>>>         In your implementation you have added normalized error to be
>>>>> equal to 2.296/k^0.9723.
>>>>>         I want to understand how you got to this value.
>>>>>
>>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will
>>>>> take more memory for storing weights to answer quantiles (after sorting
>>>>> across levels at the end of sketch creation, otherwise bounds might not
>>>>> hold). Why are these not included in memory taken by sketch?
>>>>> As a practical application of the paper will need these memories and
>>>>> an additional tmp storage space as well during cdf construction to answer
>>>>> queries.
>>>>>
>>>>> Appreciate your time.
>>>>> Thank You,
>>>>> Gourav Kumar
>>>>>
>>>>

-- 
Regards,
Gourav

Re: Regarding error bounds and confidence of apache KLL implementation

Reply via email to