Hi Dimitry,

CROSS worked like a charm :)


Thanks!


On Wed, Jan 4, 2012 at 10:15 AM, Dmitriy Ryaboy <[email protected]> wrote:
> Do CROSS like Alan suggested, and then filter for non-equality, that's
> better.
> What Jacob proposed won't work, I think he misunderstood your question (or
> I did...)
>
> D
>
> On Tue, Jan 3, 2012 at 3:49 PM, Michael <[email protected]> wrote:
>
>> Hi folks,
>>
>> Thanks for the input. As terrible as the cross join is, I need the result
>> to perform matching, ie. For each record in T1, match with all records in
>> T2.
>>
>> My initial thought is that this can be easily done in the db and have the
>> results exported from there, but due to the sheer amount of output, it
>> makes more sense to have the results in hdfs.
>>
>> Not sure what Jacob proposed will generate, but will try that first as I'm
>> unsure how I can perform a join on a constant :)
>>
>>
>> Thanks again
>>
>>
>>
>> On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <[email protected]> wrote:
>>
>> > No Michael essentially wants a cross-product. It's a terrible thing and
>> > should be avoided :).
>> >
>> > T1:
>> >
>> > a  1
>> > b  2
>> > c  3
>> > d  4
>> >
>> > T2:
>> >
>> > a x
>> > b y
>> > c z
>> >
>> > joined this way on the first column becomes:
>> >
>> > a 1 b y
>> > a 1 c z
>> > b 2 a x
>> > b 2 c z
>> > c 3 a x
>> > c 3 b y
>> > d 4 a x
>> > d 4 b y
>> > d 4 c z
>> >
>> > Note the cardinality explosion. Now assume that you are doing this in
>> Pig /
>> > Hadoop because one of the relations is  TB-sized, or at least
>> > multi-gigabyte.
>> >
>> > And this is why Pig doesn't support it.
>> >
>> > But if you really want to, join on a constant (so all rows in T1 will
>> match
>> > all rows in T2) and filter out those for which T1.loc == T2.loc
>> >
>> > And don't say I didn't warn you :).
>> >
>> > D
>> >
>> > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <[email protected]
>> >wrote:
>> >
>> >> If I understand correctly, this is nothing more than an anti-join which
>> >> can be done with pig using a cogroup.
>> >>
>> >> So your SQL below:
>> >>
>> >>> select * from yee a left join yer b on a.loc != b.loc;
>> >>
>> >> becomes something like:
>> >>
>> >> a = load 'yee' as (loc:chararray, stuff:int);
>> >> b = load 'yer' as (loc:chararray, stuff:int);
>> >>
>> >> c = cogroup a by loc, b by loc;
>> >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>> >>
>> >> which will result in d containing only the records from a where the
>> >> 'loc' field doesn't match with the 'loc' field in b.
>> >>
>> >> --jacob
>> >> @thedatachef
>> >>
>> >>
>>

Reply via email to