Bloom filters would help efficiency here. A bloom join or semi-join would be a nice addition to Pig.
Cheers, -- Gianmarco On Mon, Jun 25, 2012 at 7:50 PM, Alan Gates <[email protected]> wrote: > Agreed. And with some optimization we could make semi-join more efficient > than this since it only needs to keep one record per key per map instead of > all the records for a key. > > Alan. > > On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote: > > > This could be a cool rewrite feature like CUBE/SAMPLE. > > > > Russell Jurney http://datasyndrome.com > > > > On Jun 25, 2012, at 9:39 AM, Alan Gates <[email protected]> wrote: > > > >> This type of in is really a semi-join. So you could rewrite this as: > >> > >> B1 = join A by A1, C by A1; > >> B2 = filter B1 by SIZE(C) > 0; > >> B = foreach B2 flatten(A); > >> > >> Alan. > >> > >> On Jun 25, 2012, at 2:50 AM, yonghu wrote: > >> > >>> Dear all, > >>> > >>> in the sql, there is a in clause which is used to check if the value > >>> is in a set or not? Does pig also have the same in clause? Such as: > >>> > >>> B = filter A by A1 in C; > >>> > >>> A,B,C are relation names and A1 is a column_name of A. > >>> > >>> Thanks! > >>> > >>> Yong > >> > >
