On Thu, Mar 17, 2011 at 10:34 AM,  <[email protected]> wrote:
> Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you 
> work with open source it is definitely a good practice to upgrade those with 
> latest versions. With newer versions bugs would be minimal , performance 
> would be better and you get more functionalities. Your query looks fine an 
> upgrade of hive could sort things out.
> Regards
> Bejoy K S
>
> -----Original Message-----
> From: Edward Capriolo <[email protected]>
> Date: Thu, 17 Mar 2011 08:51:05
> To: [email protected]<[email protected]>
> Reply-To: [email protected]
> Subject: Re: Hadoop error 2 while joining two large tables
>
> I am pretty sure the cloudera distro has an upgrade path to a more recent 
> hive.
>
> On Thursday, March 17, 2011, hadoop n00b <[email protected]> wrote:
>> Hello All,
>>
>> Thanks a lot for your response. To clarify a few points -
>>
>> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of 
>> Hive as we have to use Cloudera distro only.
>>
>> All records in the smaller table have at least one record in the larger 
>> table (of course a few exceptions could be there but only a few).
>>
>> The join is using ON clause. The query is something like -
>>
>> select ...
>> from
>> (
>>   (select ... from smaller_table)
>>   join
>>   (select from larger_table)
>>   on (smaller_table.col = larger_table.col)
>> )
>>
>> I will try out setting mapred.child.java.opts -Xmx to a higher value and let 
>> you know.
>>
>> Is there a pattern or rule of thumb to follow on when to add more nodes?
>>
>> Thanks again!
>>
>> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <[email protected]> wrote:
>>
>>
>>
>> In addition, put the smaller table on the left-hand side of a JOIN:
>>
>> SELECT ... FROM small_table JOIN large_table ON ...
>>
>>
>>
>>
>> From: Bejoy Ks [mailto:[email protected]]
>> Sent: Wednesday, March 16, 2011 11:43 AM
>>
>> To: [email protected]
>> Subject: Re: Hadoop error 2 while joining two large tables
>>
>>
>>
>>
>>
>>
>> Hey hadoop n00b
>>     I second Mark's thought. But definitely you can try out re framing your 
>> query to get things rolling. I'm not sure on your hive Query.But still, from 
>> my experience with joins on huge tables (record counts in the range of 
>> hundreds of millions) you should give join conditions with JOIN ON clause 
>> rather than specifying all conditions in WHERE.
>>
>> Say if you have a query this way
>> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
>> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>>
>> You can definitely re frame this query as
>> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
>> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > 
>> b.Column2;
>>
>> From my understanding Hive supports equijoins so you can't have the 
>> inequality conditions there within JOIN ON, inequality should come to WHERE. 
>> This approach has worked for me when I encountered a similar situation as 
>> yours some time ago. Try this out,hope it helps.
>>
>> Regards
>> Bejoy.K.S
>>
>>
>>
>>
>>
>>
>>
>>
>> From: "Sunderlin, Mark" <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Sent: Wed, March 16, 2011 11:22:09 PM
>> Subject: RE: Hadoop error 2 while joining two large tables
>>
>>
>>
>>
>> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>>
>> Whatever else answers you get, you should append “ … and add more nodes.” 
>> More nodes is never a bad thing ;-)
>>
>>
>> ---
>> Mark E. Sunderlin
>> Solutions Architect |AOL Data Warehouse
>>
>> P: 703-256-6935 | C: 540-327-6222
>>
>> AIM: MESunderlin
>> 22000 AOL Way
>

In particular I remember back in the hive 0.3 and 0.4 days, many
troublesome issues like this were encountered. 0.5.0 had a lot of
engineering went in to make joins work better with less memory and
avoid many OutOfMemory conditions.

You can see some of my older struggles:

http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201002.mbox/%[email protected]%3E

Upgrading is HIGHLY suggested. I remember 0.4.0-> 0.5.0 made problems
vanish. (that was a happy day)

Reply via email to