Hi All,
Hope you are well in these COVID times.

I was going through implementation of the KLL algorithm based on the paper
"Streaming Quantiles Algorithms with Small Space and Update Time" and
wanted to understand it more to understand applications in databases.
(cpp version on apache github.)
I had a few questions related to the paper on which I couldn't get my head
around, I would appreciate if I could get some help on these:

Q1. What is the meaning of confidence in the KLL algorithm?
       Does it mean that once a sketch has been constructed, we start
asking quantile queries. Confidence will tell how many of these queries
will fall under the normalized error bound.
        OR
       If I construct the sketch n times on any data and ask the same query
q each time, confidence will tell in how many cases out of these n times q
will fall under the normalized error bounds.

       I tested the apache version with deterministic bits, that you have
under KLL_VALIDATION flag.
       There for many queries(different ones) I got values beyond error
bounds, and the number of such queries was way beyond confidence of 99%.
       The deterministic bits followed a pattern (010101..), the random bit
generation could also at some time generate such a pattern.
       What do you do in such a case, where error bounds and confidence
both will not hold?

Q2. Though the paper mentions a relation between error bound and buffer
size k, as O(sqrt(log e) / e))
        What will be the actual relation between these two wrt constants as
well, when we try to get a practical application of the algorithm.
        In your implementation you have added normalized error to be equal
to 2.296/k^0.9723.
        I want to understand how you got to this value.

Q3. Space taken by sketch is bounded by 3K, but the algorithm will take
more memory for storing weights to answer quantiles (after sorting across
levels at the end of sketch creation, otherwise bounds might not hold). Why
are these not included in memory taken by sketch?
As a practical application of the paper will need these memories and an
additional tmp storage space as well during cdf construction to answer
queries.

Appreciate your time.
Thank You,
Gourav Kumar

Reply via email to