Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you work with open source it is definitely a good practice to upgrade those with latest versions. With newer versions bugs would be minimal , performance would be better and you get more functionalities. Your query looks fine an upgrade of hive could sort things out. Regards Bejoy K S
-----Original Message----- From: Edward Capriolo <[email protected]> Date: Thu, 17 Mar 2011 08:51:05 To: [email protected]<[email protected]> Reply-To: [email protected] Subject: Re: Hadoop error 2 while joining two large tables I am pretty sure the cloudera distro has an upgrade path to a more recent hive. On Thursday, March 17, 2011, hadoop n00b <[email protected]> wrote: > Hello All, > > Thanks a lot for your response. To clarify a few points - > > I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of > Hive as we have to use Cloudera distro only. > > All records in the smaller table have at least one record in the larger table > (of course a few exceptions could be there but only a few). > > The join is using ON clause. The query is something like - > > select ... > from > ( > (select ... from smaller_table) > join > (select from larger_table) > on (smaller_table.col = larger_table.col) > ) > > I will try out setting mapred.child.java.opts -Xmx to a higher value and let > you know. > > Is there a pattern or rule of thumb to follow on when to add more nodes? > > Thanks again! > > On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <[email protected]> wrote: > > > > In addition, put the smaller table on the left-hand side of a JOIN: > > SELECT ... FROM small_table JOIN large_table ON ... > > > > > From: Bejoy Ks [mailto:[email protected]] > Sent: Wednesday, March 16, 2011 11:43 AM > > To: [email protected] > Subject: Re: Hadoop error 2 while joining two large tables > > > > > > > Hey hadoop n00b > I second Mark's thought. But definitely you can try out re framing your > query to get things rolling. I'm not sure on your hive Query.But still, from > my experience with joins on huge tables (record counts in the range of > hundreds of millions) you should give join conditions with JOIN ON clause > rather than specifying all conditions in WHERE. > > Say if you have a query this way > SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE > a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2; > > You can definitely re frame this query as > SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b > ON (a.Column4=b.Column1 AND a.Column2=b.Column4) WHERE a.Column3 > b.Column2; > > From my understanding Hive supports equijoins so you can't have the > inequality conditions there within JOIN ON, inequality should come to WHERE. > This approach has worked for me when I encountered a similar situation as > yours some time ago. Try this out,hope it helps. > > Regards > Bejoy.K.S > > > > > > > > > From: "Sunderlin, Mark" <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Wed, March 16, 2011 11:22:09 PM > Subject: RE: Hadoop error 2 while joining two large tables > > > > > hadoop n00b asks, “Is adding more nodes the solution to such problem?” > > Whatever else answers you get, you should append “ … and add more nodes.” > More nodes is never a bad thing ;-) > > > --- > Mark E. Sunderlin > Solutions Architect |AOL Data Warehouse > > P: 703-256-6935 | C: 540-327-6222 > > AIM: MESunderlin > 22000 AOL Way
