Hi Dimitry, CROSS worked like a charm :)
Thanks! On Wed, Jan 4, 2012 at 10:15 AM, Dmitriy Ryaboy <[email protected]> wrote: > Do CROSS like Alan suggested, and then filter for non-equality, that's > better. > What Jacob proposed won't work, I think he misunderstood your question (or > I did...) > > D > > On Tue, Jan 3, 2012 at 3:49 PM, Michael <[email protected]> wrote: > >> Hi folks, >> >> Thanks for the input. As terrible as the cross join is, I need the result >> to perform matching, ie. For each record in T1, match with all records in >> T2. >> >> My initial thought is that this can be easily done in the db and have the >> results exported from there, but due to the sheer amount of output, it >> makes more sense to have the results in hdfs. >> >> Not sure what Jacob proposed will generate, but will try that first as I'm >> unsure how I can perform a join on a constant :) >> >> >> Thanks again >> >> >> >> On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <[email protected]> wrote: >> >> > No Michael essentially wants a cross-product. It's a terrible thing and >> > should be avoided :). >> > >> > T1: >> > >> > a 1 >> > b 2 >> > c 3 >> > d 4 >> > >> > T2: >> > >> > a x >> > b y >> > c z >> > >> > joined this way on the first column becomes: >> > >> > a 1 b y >> > a 1 c z >> > b 2 a x >> > b 2 c z >> > c 3 a x >> > c 3 b y >> > d 4 a x >> > d 4 b y >> > d 4 c z >> > >> > Note the cardinality explosion. Now assume that you are doing this in >> Pig / >> > Hadoop because one of the relations is TB-sized, or at least >> > multi-gigabyte. >> > >> > And this is why Pig doesn't support it. >> > >> > But if you really want to, join on a constant (so all rows in T1 will >> match >> > all rows in T2) and filter out those for which T1.loc == T2.loc >> > >> > And don't say I didn't warn you :). >> > >> > D >> > >> > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <[email protected] >> >wrote: >> > >> >> If I understand correctly, this is nothing more than an anti-join which >> >> can be done with pig using a cogroup. >> >> >> >> So your SQL below: >> >> >> >>> select * from yee a left join yer b on a.loc != b.loc; >> >> >> >> becomes something like: >> >> >> >> a = load 'yee' as (loc:chararray, stuff:int); >> >> b = load 'yer' as (loc:chararray, stuff:int); >> >> >> >> c = cogroup a by loc, b by loc; >> >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a); >> >> >> >> which will result in d containing only the records from a where the >> >> 'loc' field doesn't match with the 'loc' field in b. >> >> >> >> --jacob >> >> @thedatachef >> >> >> >> >>
