As of 0.10 there are UDFs for building bloom filters. Those could be used to construct a bloom join.
Alan. On Jun 25, 2012, at 10:56 PM, Gianmarco De Francisci Morales wrote: > Bloom filters would help efficiency here. > A bloom join or semi-join would be a nice addition to Pig. > > Cheers, > -- > Gianmarco > > > > > On Mon, Jun 25, 2012 at 7:50 PM, Alan Gates <[email protected]> wrote: > >> Agreed. And with some optimization we could make semi-join more efficient >> than this since it only needs to keep one record per key per map instead of >> all the records for a key. >> >> Alan. >> >> On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote: >> >>> This could be a cool rewrite feature like CUBE/SAMPLE. >>> >>> Russell Jurney http://datasyndrome.com >>> >>> On Jun 25, 2012, at 9:39 AM, Alan Gates <[email protected]> wrote: >>> >>>> This type of in is really a semi-join. So you could rewrite this as: >>>> >>>> B1 = join A by A1, C by A1; >>>> B2 = filter B1 by SIZE(C) > 0; >>>> B = foreach B2 flatten(A); >>>> >>>> Alan. >>>> >>>> On Jun 25, 2012, at 2:50 AM, yonghu wrote: >>>> >>>>> Dear all, >>>>> >>>>> in the sql, there is a in clause which is used to check if the value >>>>> is in a set or not? Does pig also have the same in clause? Such as: >>>>> >>>>> B = filter A by A1 in C; >>>>> >>>>> A,B,C are relation names and A1 is a column_name of A. >>>>> >>>>> Thanks! >>>>> >>>>> Yong >>>> >> >>
