Re: Consequences of sampling before analyzing data with DataSketches

Sergio Castro Thu, 19 Nov 2020 04:20:31 -0800

Thanks a lot for your answers to my first question, Lee and Justin.

Justin, regarding this observation: "*All of that said, the library will
not be able to say anything about what errors the user should expect if the
data is pre-sampled, because in such a situation there are many factors
that are out of the library's control.* "
Trying to alleviate this problem. I know I can tune the DataSketches
computation by means of trading-off memory vs accuracy.
So is it correct that in the scenario where I am constrained to pre-sample
the data, I should aim for the best optimization for accuracy even if this
will require more memory, with the objective of alleviating the impact of
my double sampling problem (meaning the pre-sampling I am constrained to do
before + the sampling performed by Datasketches itself)? While in the
scenarios where I am not constrained to use pre-sampling I still could use
the default DataSketches configuration with a more balanced trade-off
between accuracy and memory requirements?


Would you say this is a good best-effort strategy? Or in both cases you
would recommend me to use the same configuration ?

Thanks for your time and feedback,

Sergio


On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <justin.r.tha...@gmail.com>
wrote:

> Lee's response is correct, but I'll elaborate slightly (hopefully this is
> helpful instead of confusing).
>
> There are some queries for which the following is true: if the data sample
> is uniform from the original (unsampled) data, then accurate answers with
> respect to the sample are also accurate with respect to the original
> (unsampled) data.
>
> As one example, consider quantile queries:
>
> If you have n original data points from an ordered domain and you sample
> at least t ~= log(n)/epsilon^2 of the data points at random, it is known
> that, with high probability over the sample, for each domain item i, the
> fractional rank of i in the sample (i.e., the number of sampled points less
> than or equal to i, divided by the sample size t) will match the fractional
> rank of i in the original unsampled data (i.e., the number of data points
> less than or equal to i, divided by n) up to additive error at most
> epsilon.
>
> In fact, at a conceptual level, the KLL quantiles algorithm that's
> implemented in the library is implicitly performing a type of downsampling
> internally and then summarizing the sample (this is a little bit of a
> simplification).
>
> Something similar is true for frequent items. However, it is not true for
> "non-additive" queries such as unique counts.
>
> All of that said, the library will not be able to say anything about what
> errors the user should expect if the data is pre-sampled, because in such a
> situation there are many factors that are out of the library's control.
>
> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote:
>
>> Sorry, if you presample your data all bets are off in terms of accuracy.
>>
>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com>
>> wrote:
>>
>>> Hi, I am new to DataSketches.
>>>
>>>  I know Datasketches provides an *approximate* calculation of
>>> statistics with *mathematically proven error bounds*.
>>>
>>> My question is:
>>> Say that I am constrained to take a sampling of the original data set
>>> before handling it to Datasketches (for example, I cannot take more than
>>> 10.000 random rows from a table).
>>> What would be the consequence of this previous sampling in the
>>> "mathematically proven error bounds" of the Datasketches statistics,
>>> relative to the original data set?
>>>
>>> Best,
>>>
>>> Sergio
>>>
>>

Re: Consequences of sampling before analyzing data with DataSketches

Reply via email to