You can do better than joining on a constant. Pig does support CROSS, and it does it in parallel. (Joining on a constant would single thread the process.) It still creates a massive volume of data and is slow, but it will work.
Alan. On Jan 3, 2012, at 1:44 PM, Dmitriy Ryaboy wrote: > No Michael essentially wants a cross-product. It's a terrible thing and > should be avoided :). > > T1: > > a 1 > b 2 > c 3 > d 4 > > T2: > > a x > b y > c z > > joined this way on the first column becomes: > > a 1 b y > a 1 c z > b 2 a x > b 2 c z > c 3 a x > c 3 b y > d 4 a x > d 4 b y > d 4 c z > > Note the cardinality explosion. Now assume that you are doing this in Pig / > Hadoop because one of the relations is TB-sized, or at least > multi-gigabyte. > > And this is why Pig doesn't support it. > > But if you really want to, join on a constant (so all rows in T1 will match > all rows in T2) and filter out those for which T1.loc == T2.loc > > And don't say I didn't warn you :). > > D > > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins > <[email protected]>wrote: > >> If I understand correctly, this is nothing more than an anti-join which >> can be done with pig using a cogroup. >> >> So your SQL below: >> >>> select * from yee a left join yer b on a.loc != b.loc; >> >> becomes something like: >> >> a = load 'yee' as (loc:chararray, stuff:int); >> b = load 'yer' as (loc:chararray, stuff:int); >> >> c = cogroup a by loc, b by loc; >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a); >> >> which will result in d containing only the records from a where the >> 'loc' field doesn't match with the 'loc' field in b. >> >> --jacob >> @thedatachef >> >>
