On 24 Jan 2013, at 20:39, bejoy...@yahoo.com wrote: > Hi David, > > The default partitioner used in map reduce is the hash partitioner. So > based on your keys they are send to a particular reducer. > > May be in your current data set, the keys that have no values in table > are all falling in the same hash bucket and hence being processed by > the same reducer.
Really not the case, no, so it doesn't make any sort of sense to me :\ At this point I am starting to think the only way to figure it out is to set or add debugging at the reducer level. I hoped I could avoid it, alas... > If you are noticing a skew on a particular reducer, sometimes a > simple work around like increasing the no of reducers explicitly might > help you get pass the hurdle. Yes, that seemed to work in some cases, but is not very handy: it's hard to tell my users 'try setting a different number of reducers when your query is stuck' ;-) > Also please ensure you have enabled skew join optimization. Didn't have much success with this, but maybe my version of hive is a bit old. So to emulate a LEFT OUTER JOIN I had to do something really horrible: JOIN ( -- TABLE B is the joined table SELECT key, value FROM TABLE B UNION ALL -- TABLE A has all the unique ids SELECT key, '' FROM TABLE A ) AS SUB and then use a count -1 to get a count of non-null rows, etc. At least it does work with no slowdowns, but I mean, yuk! It's the best I could come up with so far, so if I could fully understand the root cause of the problem, that would be much better. I guess I'll dig in a bit deeper then. Thanks a lot! David > > Regards > Bejoy KS > > Sent from remote device, Please excuse typos > > -----Original Message----- > From: "David Morel" <dmore...@gmail.com> > Date: Thu, 24 Jan 2013 18:39:56 > To: <user@hive.apache.org>; <bejoy...@yahoo.com> > Reply-To: user@hive.apache.org > Subject: Re: An explanation of LEFT OUTER JOIN and NULL values > > On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote: > >> Hi David >> >> An explain extended would give you the exact pointer. >> >> From my understanding, this is how it could work. >> >> You have two tables then two different map reduce job would be >> processing those. Based on the join keys, combination of corresponding >> columns would be chosen as key from mapper1 and mapper2. So if the >> combination of columns having the same value those records from two >> set of mappers would go into the same reducer. >> >> On the reducer if there is a corresponding value for a key from table >> 1 to table 2/mapper 2 that value would be populated. If no val for >> mapper 2 then those columns from table 2 are made null. >> >> If there is a key-value just from table 2/mapper 2 and no >> corresponding value from mapper 1. That value is just discarded. > > Hi Bejoy, > > Thanks! So schematically, something like this, right? > > mapper1 (bigger table): > K1-A, V1A > K2-A, V2A > K3-A, V3A > > mapper2 (joined, smaller table): > K1-B, V1B > > reducer1: > K1-A, V1A > K1-B, V1B > > returns: > K1, V1A, V1B etc > > reducer2: > K2-A, V2A > *no* K2-B, V so: K2-B, NULL is created, same for next row. > K3-A, V3A > > returns: > K2, V2A, NULL etc > K3, V3A, NULL etc > > I still don't understand why my reducer2 (and only this one, which > apparently gets all the keys for which we don't have a row on table B) > would become overloaded. Am I completely misunderstanding the whole > thing? > > David > >> Regards >> Bejoy KS >> >> Sent from remote device, Please excuse typos >> >> -----Original Message----- >> From: "David Morel" <dmore...@gmail.com> >> Date: Thu, 24 Jan 2013 18:03:40 >> To: user@hive.apache.org<user@hive.apache.org> >> Reply-To: user@hive.apache.org >> Subject: An explanation of LEFT OUTER JOIN and NULL values >> >> Hi! >> >> After hitting the "curse of the last reducer" many times on LEFT OUTER >> JOIN queries, and trying to think about it, I came to the conclusion >> there's something I am missing regarding how keys are handled in >> mapred jobs. >> >> The problem shows when I have table A containing billions of rows with >> distinctive keys, that I need to join to table B that has a much lower >> number of rows. >> >> I need to keep all the A rows, populated with NULL values from the B >> side, so that's what a LEFT OUTER is for. >> >> Now, when transforming that into a mapred job, my -naive- >> understanding would be that for every key on the A table, a missing >> key on the B table would be generated with a NULL value. If that were >> the case, I fail to understand why all NULL valued B keys would end up >> on the same reducer, since the key defines which reducer is used, not >> the value. >> >> So, obviously, this is not how it works. >> >> So my question is: how is this construct handled? >> >> Thanks a lot! >> >> D.Morel