Do CROSS like Alan suggested, and then filter for non-equality, that's
better.
What Jacob proposed won't work, I think he misunderstood your question (or
I did...)

D

On Tue, Jan 3, 2012 at 3:49 PM, Michael <[email protected]> wrote:

> Hi folks,
>
> Thanks for the input. As terrible as the cross join is, I need the result
> to perform matching, ie. For each record in T1, match with all records in
> T2.
>
> My initial thought is that this can be easily done in the db and have the
> results exported from there, but due to the sheer amount of output, it
> makes more sense to have the results in hdfs.
>
> Not sure what Jacob proposed will generate, but will try that first as I'm
> unsure how I can perform a join on a constant :)
>
>
> Thanks again
>
>
>
> On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <[email protected]> wrote:
>
> > No Michael essentially wants a cross-product. It's a terrible thing and
> > should be avoided :).
> >
> > T1:
> >
> > a  1
> > b  2
> > c  3
> > d  4
> >
> > T2:
> >
> > a x
> > b y
> > c z
> >
> > joined this way on the first column becomes:
> >
> > a 1 b y
> > a 1 c z
> > b 2 a x
> > b 2 c z
> > c 3 a x
> > c 3 b y
> > d 4 a x
> > d 4 b y
> > d 4 c z
> >
> > Note the cardinality explosion. Now assume that you are doing this in
> Pig /
> > Hadoop because one of the relations is  TB-sized, or at least
> > multi-gigabyte.
> >
> > And this is why Pig doesn't support it.
> >
> > But if you really want to, join on a constant (so all rows in T1 will
> match
> > all rows in T2) and filter out those for which T1.loc == T2.loc
> >
> > And don't say I didn't warn you :).
> >
> > D
> >
> > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <[email protected]
> >wrote:
> >
> >> If I understand correctly, this is nothing more than an anti-join which
> >> can be done with pig using a cogroup.
> >>
> >> So your SQL below:
> >>
> >>> select * from yee a left join yer b on a.loc != b.loc;
> >>
> >> becomes something like:
> >>
> >> a = load 'yee' as (loc:chararray, stuff:int);
> >> b = load 'yer' as (loc:chararray, stuff:int);
> >>
> >> c = cogroup a by loc, b by loc;
> >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
> >>
> >> which will result in d containing only the records from a where the
> >> 'loc' field doesn't match with the 'loc' field in b.
> >>
> >> --jacob
> >> @thedatachef
> >>
> >>
>

Reply via email to