I think the way to think about this is the following. If you downsample and then sketch, there are two sources of error: sampling error and sketching error. The former refers to how much the answer to your query over the sample deviates from the answer over the original data, while the second refers to how much the estimate returned by the sketch deviates from the exact answer on the sample.
If the sampling error is very large, then no matter how accurate your sketch is, your total error will be large, so you won't be gaining anything by throwing resources into minimizing sketching error. If sampling error is very small, then there's not really a need to drive sketching error any lower than you would otherwise choose it to be. So as a practical matter, my personal recommendation would be to make sure your sample is big enough that the sampling error is very small, and then set the sketching error as you normally would ignoring the subsampling. In case it's helpful, I should mention that there's been (at least) one academic paper devoted to precisely the question of what is the best approach to sketching for various query classes if data must first be subsampled if you'd like to check it out: https://core.ac.uk/download/pdf/212809966.pdf I should reiterate that there are certain types of queries that inherently don't play well with random sampling (i.e., it's basically impossible to give a meaningful bound on the sampling error, at least without making assumptions about the data, which is something that error guarantees provided by the library assiduously avoids). On Thu, Nov 19, 2020 at 7:20 AM Sergio Castro <sergio...@gmail.com> wrote: > Thanks a lot for your answers to my first question, Lee and Justin. > > Justin, regarding this observation: "*All of that said, the library will > not be able to say anything about what errors the user should expect if the > data is pre-sampled, because in such a situation there are many factors > that are out of the library's control.* " > Trying to alleviate this problem. I know I can tune the DataSketches > computation by means of trading-off memory vs accuracy. > So is it correct that in the scenario where I am constrained to pre-sample > the data, I should aim for the best optimization for accuracy even if this > will require more memory, with the objective of alleviating the impact of > my double sampling problem (meaning the pre-sampling I am constrained to do > before + the sampling performed by Datasketches itself)? While in the > scenarios where I am not constrained to use pre-sampling I still could use > the default DataSketches configuration with a more balanced trade-off > between accuracy and memory requirements? > > Would you say this is a good best-effort strategy? Or in both cases you > would recommend me to use the same configuration ? > > Thanks for your time and feedback, > > Sergio > > > On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <justin.r.tha...@gmail.com> > wrote: > >> Lee's response is correct, but I'll elaborate slightly (hopefully this is >> helpful instead of confusing). >> >> There are some queries for which the following is true: if the data >> sample is uniform from the original (unsampled) data, then accurate answers >> with respect to the sample are also accurate with respect to the original >> (unsampled) data. >> >> As one example, consider quantile queries: >> >> If you have n original data points from an ordered domain and you sample >> at least t ~= log(n)/epsilon^2 of the data points at random, it is known >> that, with high probability over the sample, for each domain item i, the >> fractional rank of i in the sample (i.e., the number of sampled points less >> than or equal to i, divided by the sample size t) will match the fractional >> rank of i in the original unsampled data (i.e., the number of data points >> less than or equal to i, divided by n) up to additive error at most >> epsilon. >> >> In fact, at a conceptual level, the KLL quantiles algorithm that's >> implemented in the library is implicitly performing a type of downsampling >> internally and then summarizing the sample (this is a little bit of a >> simplification). >> >> Something similar is true for frequent items. However, it is not true for >> "non-additive" queries such as unique counts. >> >> All of that said, the library will not be able to say anything about what >> errors the user should expect if the data is pre-sampled, because in such a >> situation there are many factors that are out of the library's control. >> >> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote: >> >>> Sorry, if you presample your data all bets are off in terms of accuracy. >>> >>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com> >>> wrote: >>> >>>> Hi, I am new to DataSketches. >>>> >>>> I know Datasketches provides an *approximate* calculation of >>>> statistics with *mathematically proven error bounds*. >>>> >>>> My question is: >>>> Say that I am constrained to take a sampling of the original data set >>>> before handling it to Datasketches (for example, I cannot take more than >>>> 10.000 random rows from a table). >>>> What would be the consequence of this previous sampling in the >>>> "mathematically proven error bounds" of the Datasketches statistics, >>>> relative to the original data set? >>>> >>>> Best, >>>> >>>> Sergio >>>> >>>