Thanks a lot for your answers to my first question, Lee and Justin. Justin, regarding this observation: "*All of that said, the library will not be able to say anything about what errors the user should expect if the data is pre-sampled, because in such a situation there are many factors that are out of the library's control.* " Trying to alleviate this problem. I know I can tune the DataSketches computation by means of trading-off memory vs accuracy. So is it correct that in the scenario where I am constrained to pre-sample the data, I should aim for the best optimization for accuracy even if this will require more memory, with the objective of alleviating the impact of my double sampling problem (meaning the pre-sampling I am constrained to do before + the sampling performed by Datasketches itself)? While in the scenarios where I am not constrained to use pre-sampling I still could use the default DataSketches configuration with a more balanced trade-off between accuracy and memory requirements?
Would you say this is a good best-effort strategy? Or in both cases you would recommend me to use the same configuration ? Thanks for your time and feedback, Sergio On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <justin.r.tha...@gmail.com> wrote: > Lee's response is correct, but I'll elaborate slightly (hopefully this is > helpful instead of confusing). > > There are some queries for which the following is true: if the data sample > is uniform from the original (unsampled) data, then accurate answers with > respect to the sample are also accurate with respect to the original > (unsampled) data. > > As one example, consider quantile queries: > > If you have n original data points from an ordered domain and you sample > at least t ~= log(n)/epsilon^2 of the data points at random, it is known > that, with high probability over the sample, for each domain item i, the > fractional rank of i in the sample (i.e., the number of sampled points less > than or equal to i, divided by the sample size t) will match the fractional > rank of i in the original unsampled data (i.e., the number of data points > less than or equal to i, divided by n) up to additive error at most > epsilon. > > In fact, at a conceptual level, the KLL quantiles algorithm that's > implemented in the library is implicitly performing a type of downsampling > internally and then summarizing the sample (this is a little bit of a > simplification). > > Something similar is true for frequent items. However, it is not true for > "non-additive" queries such as unique counts. > > All of that said, the library will not be able to say anything about what > errors the user should expect if the data is pre-sampled, because in such a > situation there are many factors that are out of the library's control. > > On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote: > >> Sorry, if you presample your data all bets are off in terms of accuracy. >> >> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com> >> wrote: >> >>> Hi, I am new to DataSketches. >>> >>> I know Datasketches provides an *approximate* calculation of >>> statistics with *mathematically proven error bounds*. >>> >>> My question is: >>> Say that I am constrained to take a sampling of the original data set >>> before handling it to Datasketches (for example, I cannot take more than >>> 10.000 random rows from a table). >>> What would be the consequence of this previous sampling in the >>> "mathematically proven error bounds" of the Datasketches statistics, >>> relative to the original data set? >>> >>> Best, >>> >>> Sergio >>> >>