Re: Hadoop error 2 while joining two large tables

bejoy_ks Thu, 17 Mar 2011 07:34:26 -0700

Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you 
work with open source it is definitely a good practice to upgrade those with 
latest versions. With newer versions bugs would be minimal , performance would 
be better and you get more functionalities. Your query looks fine an upgrade of 
hive could sort things out. 
Regards
Bejoy K S


-----Original Message-----
From: Edward Capriolo <[email protected]>
Date: Thu, 17 Mar 2011 08:51:05 
To: [email protected]<[email protected]>
Reply-To: [email protected]
Subject: Re: Hadoop error 2 while joining two large tables

I am pretty sure the cloudera distro has an upgrade path to a more recent hive.

On Thursday, March 17, 2011, hadoop n00b <[email protected]> wrote:
> Hello All,
>
> Thanks a lot for your response. To clarify a few points -
>
> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of 
> Hive as we have to use Cloudera distro only.
>
> All records in the smaller table have at least one record in the larger table 
> (of course a few exceptions could be there but only a few).
>
> The join is using ON clause. The query is something like -
>
> select ...
> from
> (
>   (select ... from smaller_table)
>   join
>   (select from larger_table)
>   on (smaller_table.col = larger_table.col)
> )
>
> I will try out setting mapred.child.java.opts -Xmx to a higher value and let 
> you know.
>
> Is there a pattern or rule of thumb to follow on when to add more nodes?
>
> Thanks again!
>
> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <[email protected]> wrote:
>
>
>
> In addition, put the smaller table on the left-hand side of a JOIN:
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
> From: Bejoy Ks [mailto:[email protected]]
> Sent: Wednesday, March 16, 2011 11:43 AM
>
> To: [email protected]
> Subject: Re: Hadoop error 2 while joining two large tables
>
>
>
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your 
> query to get things rolling. I'm not sure on your hive Query.But still, from 
> my experience with joins on huge tables (record counts in the range of 
> hundreds of millions) you should give join conditions with JOIN ON clause 
> rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the 
> inequality conditions there within JOIN ON, inequality should come to WHERE. 
> This approach has worked for me when I encountered a similar situation as 
> yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>
>
>
>
> From: "Sunderlin, Mark" <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Wed, March 16, 2011 11:22:09 PM
> Subject: RE: Hadoop error 2 while joining two large tables
>
>
>
>
> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
> Whatever else answers you get, you should append “ … and add more nodes.” 
> More nodes is never a bad thing ;-)
>
>
> ---
> Mark E. Sunderlin
> Solutions Architect |AOL Data Warehouse
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
> 22000 AOL Way

Re: Hadoop error 2 while joining two large tables

Reply via email to