Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing
about this is that it approaches the correct values for SAMPLE() as you
approach a sample size of 100% (or 0.99), but gets worse as you start
getting to lower sample fractions.

       Brian


On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[email protected]> wrote:

> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
>
>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
>> nothing but "Math.random() < x" where "x" is a double.
>>
>>  You are right. Sample operator translates in to a filter operator with
> condition "Math.random() < x".
>
>
>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
>> records when running in local mode. I am curious if something can go wrong
>> when running it in MR mode.
>>
>
> I wouldn't expect different behavior in case of MR mode.
>
> Brian,
> Do you see this behavior across multiple runs ?
>
> -Thejas
>
>
>
>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]> wrote:
>>
>>  Hello Everyone,
>>>
>>>            I am wondering if anyone has run into an issue that I am
>>> having
>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
>>> orignal relation.
>>>
>>> Assume the relation "A" contains a single column of data (int type) with
>>> 1,000,000 records
>>>
>>> Asamp = SAMPLE A 0.00001;
>>> Asamp2 = SAMPLE A 0.0001;
>>>
>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
>>> records, respectively. However, what I find is Asamp and Asamp2 are
>>> closer
>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
>>> size. Interestingly, in the limiting case of:
>>>
>>> Asamp3 = SAMPLE A 0.99;
>>>
>>> The actual subsampled size is VERY close to the expected 99% size of the
>>> full sample size. Can anyone shed light as to what I may be doing wrong
>>> or
>>> share their experiences if they have also seen issues with using SAMPLE
>>> in
>>> PIG. Thank you.
>>>
>>>             Brian
>>>
>>>
>>
>

Reply via email to