Hi Justin, the site you referenced returns an error 500 (internal server error). It might be down, or out-of-service. You might also check to make sure it is the correct URL.
Thanks! Lee. On Thu, Nov 19, 2020 at 6:05 AM Justin Thaler <justin.r.tha...@gmail.com> wrote: > I think the way to think about this is the following. If you downsample > and then sketch, there are two sources of error: sampling error and > sketching error. The former refers to how much the answer to your query > over the sample deviates from the answer over the original data, while the > second refers to how much the estimate returned by the sketch deviates from > the exact answer on the sample. > > If the sampling error is very large, then no matter how accurate your > sketch is, your total error will be large, so you won't be gaining anything > by throwing resources into minimizing sketching error. > > If sampling error is very small, then there's not really a need to drive > sketching error any lower than you would otherwise choose it to be. > > So as a practical matter, my personal recommendation would be to make sure > your sample is big enough that the sampling error is very small, and then > set the sketching error as you normally would ignoring the subsampling. > > In case it's helpful, I should mention that there's been (at least) one > academic paper devoted to precisely the question of what is the best > approach to sketching for various query classes if data must first be > subsampled if you'd like to check it out: > https://core.ac.uk/download/pdf/212809966.pdf > > I should reiterate that there are certain types of queries that inherently > don't play well with random sampling (i.e., it's basically impossible to > give a meaningful bound on the sampling error, at least without making > assumptions about the data, which is something that error guarantees > provided by the library assiduously avoids). > > On Thu, Nov 19, 2020 at 7:20 AM Sergio Castro <sergio...@gmail.com> wrote: > >> Thanks a lot for your answers to my first question, Lee and Justin. >> >> Justin, regarding this observation: "*All of that said, the library will >> not be able to say anything about what errors the user should expect if the >> data is pre-sampled, because in such a situation there are many factors >> that are out of the library's control.* " >> Trying to alleviate this problem. I know I can tune the DataSketches >> computation by means of trading-off memory vs accuracy. >> So is it correct that in the scenario where I am constrained to >> pre-sample the data, I should aim for the best optimization for accuracy >> even if this will require more memory, with the objective of alleviating >> the impact of my double sampling problem (meaning the pre-sampling I am >> constrained to do before + the sampling performed by Datasketches itself)? >> While in the scenarios where I am not constrained to use pre-sampling I >> still could use the default DataSketches configuration with a more balanced >> trade-off between accuracy and memory requirements? >> >> Would you say this is a good best-effort strategy? Or in both cases you >> would recommend me to use the same configuration ? >> >> Thanks for your time and feedback, >> >> Sergio >> >> >> On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <justin.r.tha...@gmail.com> >> wrote: >> >>> Lee's response is correct, but I'll elaborate slightly (hopefully this >>> is helpful instead of confusing). >>> >>> There are some queries for which the following is true: if the data >>> sample is uniform from the original (unsampled) data, then accurate answers >>> with respect to the sample are also accurate with respect to the original >>> (unsampled) data. >>> >>> As one example, consider quantile queries: >>> >>> If you have n original data points from an ordered domain and you sample >>> at least t ~= log(n)/epsilon^2 of the data points at random, it is known >>> that, with high probability over the sample, for each domain item i, the >>> fractional rank of i in the sample (i.e., the number of sampled points less >>> than or equal to i, divided by the sample size t) will match the fractional >>> rank of i in the original unsampled data (i.e., the number of data points >>> less than or equal to i, divided by n) up to additive error at most >>> epsilon. >>> >>> In fact, at a conceptual level, the KLL quantiles algorithm that's >>> implemented in the library is implicitly performing a type of downsampling >>> internally and then summarizing the sample (this is a little bit of a >>> simplification). >>> >>> Something similar is true for frequent items. However, it is not true >>> for "non-additive" queries such as unique counts. >>> >>> All of that said, the library will not be able to say anything about >>> what errors the user should expect if the data is pre-sampled, because in >>> such a situation there are many factors that are out of the library's >>> control. >>> >>> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote: >>> >>>> Sorry, if you presample your data all bets are off in terms of >>>> accuracy. >>>> >>>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com> >>>> wrote: >>>> >>>>> Hi, I am new to DataSketches. >>>>> >>>>> I know Datasketches provides an *approximate* calculation of >>>>> statistics with *mathematically proven error bounds*. >>>>> >>>>> My question is: >>>>> Say that I am constrained to take a sampling of the original data set >>>>> before handling it to Datasketches (for example, I cannot take more than >>>>> 10.000 random rows from a table). >>>>> What would be the consequence of this previous sampling in the >>>>> "mathematically proven error bounds" of the Datasketches statistics, >>>>> relative to the original data set? >>>>> >>>>> Best, >>>>> >>>>> Sergio >>>>> >>>>