Hi All, Hope you are well in these COVID times. I was going through implementation of the KLL algorithm based on the paper "Streaming Quantiles Algorithms with Small Space and Update Time" and wanted to understand it more to understand applications in databases. (cpp version on apache github.) I had a few questions related to the paper on which I couldn't get my head around, I would appreciate if I could get some help on these:
Q1. What is the meaning of confidence in the KLL algorithm? Does it mean that once a sketch has been constructed, we start asking quantile queries. Confidence will tell how many of these queries will fall under the normalized error bound. OR If I construct the sketch n times on any data and ask the same query q each time, confidence will tell in how many cases out of these n times q will fall under the normalized error bounds. I tested the apache version with deterministic bits, that you have under KLL_VALIDATION flag. There for many queries(different ones) I got values beyond error bounds, and the number of such queries was way beyond confidence of 99%. The deterministic bits followed a pattern (010101..), the random bit generation could also at some time generate such a pattern. What do you do in such a case, where error bounds and confidence both will not hold? Q2. Though the paper mentions a relation between error bound and buffer size k, as O(sqrt(log e) / e)) What will be the actual relation between these two wrt constants as well, when we try to get a practical application of the algorithm. In your implementation you have added normalized error to be equal to 2.296/k^0.9723. I want to understand how you got to this value. Q3. Space taken by sketch is bounded by 3K, but the algorithm will take more memory for storing weights to answer quantiles (after sorting across levels at the end of sketch creation, otherwise bounds might not hold). Why are these not included in memory taken by sketch? As a practical application of the paper will need these memories and an additional tmp storage space as well during cdf construction to answer queries. Appreciate your time. Thank You, Gourav Kumar