As of 0.10 there are UDFs for building bloom filters.  Those could be used to 
construct a bloom join.

Alan.

On Jun 25, 2012, at 10:56 PM, Gianmarco De Francisci Morales wrote:

> Bloom filters would help efficiency here.
> A bloom join or semi-join would be a nice addition to Pig.
> 
> Cheers,
> --
> Gianmarco
> 
> 
> 
> 
> On Mon, Jun 25, 2012 at 7:50 PM, Alan Gates <[email protected]> wrote:
> 
>> Agreed.  And with some optimization we could make semi-join more efficient
>> than this since it only needs to keep one record per key per map instead of
>> all the records for a key.
>> 
>> Alan.
>> 
>> On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote:
>> 
>>> This could be a cool rewrite feature like CUBE/SAMPLE.
>>> 
>>> Russell Jurney http://datasyndrome.com
>>> 
>>> On Jun 25, 2012, at 9:39 AM, Alan Gates <[email protected]> wrote:
>>> 
>>>> This type of in is really a semi-join.  So you could rewrite this as:
>>>> 
>>>> B1 = join A by A1, C by A1;
>>>> B2 = filter B1 by SIZE(C) > 0;
>>>> B = foreach B2 flatten(A);
>>>> 
>>>> Alan.
>>>> 
>>>> On Jun 25, 2012, at 2:50 AM, yonghu wrote:
>>>> 
>>>>> Dear all,
>>>>> 
>>>>> in the sql, there is a in clause  which is used to check if the value
>>>>> is in a set or not? Does pig also have the same in clause? Such as:
>>>>> 
>>>>> B = filter A by A1 in C;
>>>>> 
>>>>> A,B,C are relation names and A1 is a column_name of A.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Yong
>>>> 
>> 
>> 

Reply via email to