You can do better than joining on a constant.  Pig does support CROSS, and it 
does it in parallel.  (Joining on a constant would single thread the process.)  
It still creates a massive volume of data and is slow, but it will work.

Alan.

On Jan 3, 2012, at 1:44 PM, Dmitriy Ryaboy wrote:

> No Michael essentially wants a cross-product. It's a terrible thing and
> should be avoided :).
> 
> T1:
> 
> a  1
> b  2
> c  3
> d  4
> 
> T2:
> 
> a x
> b y
> c z
> 
> joined this way on the first column becomes:
> 
> a 1 b y
> a 1 c z
> b 2 a x
> b 2 c z
> c 3 a x
> c 3 b y
> d 4 a x
> d 4 b y
> d 4 c z
> 
> Note the cardinality explosion. Now assume that you are doing this in Pig /
> Hadoop because one of the relations is  TB-sized, or at least
> multi-gigabyte.
> 
> And this is why Pig doesn't support it.
> 
> But if you really want to, join on a constant (so all rows in T1 will match
> all rows in T2) and filter out those for which T1.loc == T2.loc
> 
> And don't say I didn't warn you :).
> 
> D
> 
> On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins 
> <[email protected]>wrote:
> 
>> If I understand correctly, this is nothing more than an anti-join which
>> can be done with pig using a cogroup.
>> 
>> So your SQL below:
>> 
>>> select * from yee a left join yer b on a.loc != b.loc;
>> 
>> becomes something like:
>> 
>> a = load 'yee' as (loc:chararray, stuff:int);
>> b = load 'yer' as (loc:chararray, stuff:int);
>> 
>> c = cogroup a by loc, b by loc;
>> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>> 
>> which will result in d containing only the records from a where the
>> 'loc' field doesn't match with the 'loc' field in b.
>> 
>> --jacob
>> @thedatachef
>> 
>> 

Reply via email to