Agreed. And with some optimization we could make semi-join more efficient than this since it only needs to keep one record per key per map instead of all the records for a key.
Alan. On Jun 25, 2012, at 10:17 AM, Russell Jurney wrote: > This could be a cool rewrite feature like CUBE/SAMPLE. > > Russell Jurney http://datasyndrome.com > > On Jun 25, 2012, at 9:39 AM, Alan Gates <[email protected]> wrote: > >> This type of in is really a semi-join. So you could rewrite this as: >> >> B1 = join A by A1, C by A1; >> B2 = filter B1 by SIZE(C) > 0; >> B = foreach B2 flatten(A); >> >> Alan. >> >> On Jun 25, 2012, at 2:50 AM, yonghu wrote: >> >>> Dear all, >>> >>> in the sql, there is a in clause which is used to check if the value >>> is in a set or not? Does pig also have the same in clause? Such as: >>> >>> B = filter A by A1 in C; >>> >>> A,B,C are relation names and A1 is a column_name of A. >>> >>> Thanks! >>> >>> Yong >>
