Re: An explanation of LEFT OUTER JOIN and NULL values

David Morel Thu, 24 Jan 2013 22:35:02 -0800

On 24 Jan 2013, at 20:39, bejoy...@yahoo.com wrote:

> Hi David,
>
> The default partitioner used in map reduce is the hash partitioner. So
> based on your keys they are send to a particular reducer.
>
> May be in your current data set, the keys that have no values in table
> are all falling in the same hash bucket and hence being processed by
> the same reducer.


Really not the case, no, so it doesn't make any sort of sense to me :\
At this point I am starting to think the only way to figure it out is to
set or add debugging at the reducer level. I hoped I could avoid it,
alas...

> If you are noticing a skew on a particular reducer, sometimes  a
> simple work around like increasing the no of reducers explicitly might
> help you get pass the hurdle.

Yes, that seemed to work in some cases, but is not very handy: it's
hard to tell my users 'try setting a different number of reducers when
your query is stuck' ;-)

> Also please ensure you have enabled skew join optimization.

Didn't have much success with this, but maybe my version of hive is a
bit old.

So to emulate a LEFT OUTER JOIN I had to do something really horrible:

JOIN (
    -- TABLE B is the joined table
        SELECT key, value FROM TABLE B
        UNION ALL
    -- TABLE A has all the unique ids 
        SELECT key, '' FROM TABLE A 
) AS SUB

and then use a count -1 to get a count of non-null rows, etc. At least 
it does work with no slowdowns, but I mean, yuk!

It's the best I could come up with so far, so if I could fully understand
the root cause of the problem, that would be much better. I guess I'll 
dig in a bit deeper then.

Thanks a lot!

David

>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -----Original Message-----
> From: "David Morel" <dmore...@gmail.com>
> Date: Thu, 24 Jan 2013 18:39:56
> To: <user@hive.apache.org>; <bejoy...@yahoo.com>
> Reply-To: user@hive.apache.org
> Subject: Re: An explanation of LEFT OUTER JOIN and NULL values
>
> On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:
>
>> Hi David
>>
>> An explain extended would give you the exact pointer.
>>
>> From my understanding, this is how it could work.
>>
>> You have two tables then two different map reduce job would be
>> processing those. Based on the join keys, combination of corresponding
>> columns would be chosen as key from mapper1 and mapper2. So if the
>> combination of columns having the same value those records from two
>> set of mappers would go into the same reducer.
>>
>> On the reducer if there is a corresponding value for a key from table
>> 1 to  table 2/mapper 2 that value would be populated. If no val for
>> mapper 2 then those columns from table 2 are made null.
>>
>> If there is a key-value just from table 2/mapper 2 and no
>> corresponding value from mapper 1. That value is just discarded.
>
> Hi Bejoy,
>
> Thanks! So schematically, something like this, right?
>
> mapper1 (bigger table):
> K1-A, V1A
> K2-A, V2A
> K3-A, V3A
>
> mapper2 (joined, smaller table):
> K1-B, V1B
>
> reducer1:
> K1-A, V1A
> K1-B, V1B
>
> returns:
> K1, V1A, V1B etc
>
> reducer2:
> K2-A, V2A
> *no* K2-B, V so: K2-B, NULL is created, same for next row.
> K3-A, V3A
>
> returns:
> K2, V2A, NULL etc
> K3, V3A, NULL etc
>
> I still don't understand why my reducer2 (and only this one, which
> apparently gets all the keys for which we don't have a row on table B)
> would become overloaded. Am I completely misunderstanding the whole
> thing?
>
> David
>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>>
>> -----Original Message-----
>> From: "David Morel" <dmore...@gmail.com>
>> Date: Thu, 24 Jan 2013 18:03:40
>> To: user@hive.apache.org<user@hive.apache.org>
>> Reply-To: user@hive.apache.org
>> Subject: An explanation of LEFT OUTER JOIN and NULL values
>>
>> Hi!
>>
>> After hitting the "curse of the last reducer" many times on LEFT OUTER
>> JOIN queries, and trying to think about it, I came to the conclusion
>> there's something I am missing regarding how keys are handled in
>> mapred jobs.
>>
>> The problem shows when I have table A containing billions of rows with
>> distinctive keys, that I need to join to table B that has a much lower
>> number of rows.
>>
>> I need to keep all the A rows, populated with NULL values from the B
>> side, so that's what a LEFT OUTER is for.
>>
>> Now, when transforming that into a mapred job, my -naive-
>> understanding would be that for every key on the A table, a missing
>> key on the B table would be generated with a NULL value. If that were
>> the case, I fail to understand why all NULL valued B keys would end up
>> on the same reducer, since the key defines which reducer is used, not
>> the value.
>>
>> So, obviously, this is not how it works.
>>
>> So my question is: how is this construct handled?
>>
>> Thanks a lot!
>>
>> D.Morel

Re: An explanation of LEFT OUTER JOIN and NULL values

Reply via email to