Re: Left joins with != condition

Dmitriy Ryaboy Tue, 03 Jan 2012 13:45:17 -0800

No Michael essentially wants a cross-product. It's a terrible thing and
should be avoided :).


T1:

a  1
b  2
c  3
d  4

T2:

a x
b y
c z

joined this way on the first column becomes:

a 1 b y
a 1 c z
b 2 a x
b 2 c z
c 3 a x
c 3 b y
d 4 a x
d 4 b y
d 4 c z

Note the cardinality explosion. Now assume that you are doing this in Pig /
Hadoop because one of the relations is  TB-sized, or at least
multi-gigabyte.

And this is why Pig doesn't support it.

But if you really want to, join on a constant (so all rows in T1 will match
all rows in T2) and filter out those for which T1.loc == T2.loc

And don't say I didn't warn you :).

D

On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <[email protected]>wrote:

> If I understand correctly, this is nothing more than an anti-join which
> can be done with pig using a cogroup.
>
> So your SQL below:
>
> > select * from yee a left join yer b on a.loc != b.loc;
>
> becomes something like:
>
> a = load 'yee' as (loc:chararray, stuff:int);
> b = load 'yer' as (loc:chararray, stuff:int);
>
> c = cogroup a by loc, b by loc;
> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>
> which will result in d containing only the records from a where the
> 'loc' field doesn't match with the 'loc' field in b.
>
> --jacob
> @thedatachef
>
>

Re: Left joins with != condition

Reply via email to