Re: Consequences of sampling before analyzing data with DataSketches

leerho Thu, 19 Nov 2020 07:57:23 -0800

Hi Justin,  the site you referenced returns an error 500 (internal server
error).  It might be down, or out-of-service.  You might also check to make
sure it is the correct URL.


Thanks!
Lee.

On Thu, Nov 19, 2020 at 6:05 AM Justin Thaler <justin.r.tha...@gmail.com>
wrote:

> I think the way to think about this is the following. If you downsample
> and then sketch, there are two sources of error: sampling error and
> sketching error. The former refers to how much the answer to your query
> over the sample deviates from the answer over the original data, while the
> second refers to how much the estimate returned by the sketch deviates from
> the exact answer on the sample.
>
> If the sampling error is very large, then no matter how accurate your
> sketch is, your total error will be large, so you won't be gaining anything
> by throwing resources into minimizing sketching error.
>
> If sampling error is very small, then there's not really a need to drive
> sketching error any lower than you would otherwise choose it to be.
>
> So as a practical matter, my personal recommendation would be to make sure
> your sample is big enough that the sampling error is very small, and then
> set the sketching error as you normally would ignoring the subsampling.
>
> In case it's helpful, I should mention that there's been (at least) one
> academic paper devoted to precisely the question of what is the best
> approach to sketching for various query classes if data must first be
> subsampled if you'd like to check it out:
> https://core.ac.uk/download/pdf/212809966.pdf
>
> I should reiterate that there are certain types of queries that inherently
> don't play well with random sampling (i.e., it's basically impossible to
> give a meaningful bound on the sampling error, at least without making
> assumptions about the data, which is something that error guarantees
> provided by the library assiduously avoids).
>
> On Thu, Nov 19, 2020 at 7:20 AM Sergio Castro <sergio...@gmail.com> wrote:
>
>> Thanks a lot for your answers to my first question, Lee and Justin.
>>
>> Justin, regarding this observation: "*All of that said, the library will
>> not be able to say anything about what errors the user should expect if the
>> data is pre-sampled, because in such a situation there are many factors
>> that are out of the library's control.* "
>> Trying to alleviate this problem. I know I can tune the DataSketches
>> computation by means of trading-off memory vs accuracy.
>> So is it correct that in the scenario where I am constrained to
>> pre-sample the data, I should aim for the best optimization for accuracy
>> even if this will require more memory, with the objective of alleviating
>> the impact of my double sampling problem (meaning the pre-sampling I am
>> constrained to do before + the sampling performed by Datasketches itself)?
>> While in the scenarios where I am not constrained to use pre-sampling I
>> still could use the default DataSketches configuration with a more balanced
>> trade-off between accuracy and memory requirements?
>>
>> Would you say this is a good best-effort strategy? Or in both cases you
>> would recommend me to use the same configuration ?
>>
>> Thanks for your time and feedback,
>>
>> Sergio
>>
>>
>> On Thu, Nov 19, 2020 at 1:24 AM Justin Thaler <justin.r.tha...@gmail.com>
>> wrote:
>>
>>> Lee's response is correct, but I'll elaborate slightly (hopefully this
>>> is helpful instead of confusing).
>>>
>>> There are some queries for which the following is true: if the data
>>> sample is uniform from the original (unsampled) data, then accurate answers
>>> with respect to the sample are also accurate with respect to the original
>>> (unsampled) data.
>>>
>>> As one example, consider quantile queries:
>>>
>>> If you have n original data points from an ordered domain and you sample
>>> at least t ~= log(n)/epsilon^2 of the data points at random, it is known
>>> that, with high probability over the sample, for each domain item i, the
>>> fractional rank of i in the sample (i.e., the number of sampled points less
>>> than or equal to i, divided by the sample size t) will match the fractional
>>> rank of i in the original unsampled data (i.e., the number of data points
>>> less than or equal to i, divided by n) up to additive error at most
>>> epsilon.
>>>
>>> In fact, at a conceptual level, the KLL quantiles algorithm that's
>>> implemented in the library is implicitly performing a type of downsampling
>>> internally and then summarizing the sample (this is a little bit of a
>>> simplification).
>>>
>>> Something similar is true for frequent items. However, it is not true
>>> for "non-additive" queries such as unique counts.
>>>
>>> All of that said, the library will not be able to say anything about
>>> what errors the user should expect if the data is pre-sampled, because in
>>> such a situation there are many factors that are out of the library's
>>> control.
>>>
>>> On Wed, Nov 18, 2020 at 3:08 PM leerho <lee...@gmail.com> wrote:
>>>
>>>> Sorry, if you presample your data all bets are off in terms of
>>>> accuracy.
>>>>
>>>> On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro <sergio...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, I am new to DataSketches.
>>>>>
>>>>>  I know Datasketches provides an *approximate* calculation of
>>>>> statistics with *mathematically proven error bounds*.
>>>>>
>>>>> My question is:
>>>>> Say that I am constrained to take a sampling of the original data set
>>>>> before handling it to Datasketches (for example, I cannot take more than
>>>>> 10.000 random rows from a table).
>>>>> What would be the consequence of this previous sampling in the
>>>>> "mathematically proven error bounds" of the Datasketches statistics,
>>>>> relative to the original data set?
>>>>>
>>>>> Best,
>>>>>
>>>>> Sergio
>>>>>
>>>>

Re: Consequences of sampling before analyzing data with DataSketches

Reply via email to