Do CROSS like Alan suggested, and then filter for non-equality, that's better. What Jacob proposed won't work, I think he misunderstood your question (or I did...)
D On Tue, Jan 3, 2012 at 3:49 PM, Michael <[email protected]> wrote: > Hi folks, > > Thanks for the input. As terrible as the cross join is, I need the result > to perform matching, ie. For each record in T1, match with all records in > T2. > > My initial thought is that this can be easily done in the db and have the > results exported from there, but due to the sheer amount of output, it > makes more sense to have the results in hdfs. > > Not sure what Jacob proposed will generate, but will try that first as I'm > unsure how I can perform a join on a constant :) > > > Thanks again > > > > On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <[email protected]> wrote: > > > No Michael essentially wants a cross-product. It's a terrible thing and > > should be avoided :). > > > > T1: > > > > a 1 > > b 2 > > c 3 > > d 4 > > > > T2: > > > > a x > > b y > > c z > > > > joined this way on the first column becomes: > > > > a 1 b y > > a 1 c z > > b 2 a x > > b 2 c z > > c 3 a x > > c 3 b y > > d 4 a x > > d 4 b y > > d 4 c z > > > > Note the cardinality explosion. Now assume that you are doing this in > Pig / > > Hadoop because one of the relations is TB-sized, or at least > > multi-gigabyte. > > > > And this is why Pig doesn't support it. > > > > But if you really want to, join on a constant (so all rows in T1 will > match > > all rows in T2) and filter out those for which T1.loc == T2.loc > > > > And don't say I didn't warn you :). > > > > D > > > > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <[email protected] > >wrote: > > > >> If I understand correctly, this is nothing more than an anti-join which > >> can be done with pig using a cogroup. > >> > >> So your SQL below: > >> > >>> select * from yee a left join yer b on a.loc != b.loc; > >> > >> becomes something like: > >> > >> a = load 'yee' as (loc:chararray, stuff:int); > >> b = load 'yer' as (loc:chararray, stuff:int); > >> > >> c = cogroup a by loc, b by loc; > >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a); > >> > >> which will result in d containing only the records from a where the > >> 'loc' field doesn't match with the 'loc' field in b. > >> > >> --jacob > >> @thedatachef > >> > >> >
