Hi folks, Thanks for the input. As terrible as the cross join is, I need the result to perform matching, ie. For each record in T1, match with all records in T2.
My initial thought is that this can be easily done in the db and have the results exported from there, but due to the sheer amount of output, it makes more sense to have the results in hdfs. Not sure what Jacob proposed will generate, but will try that first as I'm unsure how I can perform a join on a constant :) Thanks again On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <[email protected]> wrote: > No Michael essentially wants a cross-product. It's a terrible thing and > should be avoided :). > > T1: > > a 1 > b 2 > c 3 > d 4 > > T2: > > a x > b y > c z > > joined this way on the first column becomes: > > a 1 b y > a 1 c z > b 2 a x > b 2 c z > c 3 a x > c 3 b y > d 4 a x > d 4 b y > d 4 c z > > Note the cardinality explosion. Now assume that you are doing this in Pig / > Hadoop because one of the relations is TB-sized, or at least > multi-gigabyte. > > And this is why Pig doesn't support it. > > But if you really want to, join on a constant (so all rows in T1 will match > all rows in T2) and filter out those for which T1.loc == T2.loc > > And don't say I didn't warn you :). > > D > > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins > <[email protected]>wrote: > >> If I understand correctly, this is nothing more than an anti-join which >> can be done with pig using a cogroup. >> >> So your SQL below: >> >>> select * from yee a left join yer b on a.loc != b.loc; >> >> becomes something like: >> >> a = load 'yee' as (loc:chararray, stuff:int); >> b = load 'yer' as (loc:chararray, stuff:int); >> >> c = cogroup a by loc, b by loc; >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a); >> >> which will result in d containing only the records from a where the >> 'loc' field doesn't match with the 'loc' field in b. >> >> --jacob >> @thedatachef >> >>
