On Thu, Mar 17, 2011 at 10:34 AM, <[email protected]> wrote: > Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you > work with open source it is definitely a good practice to upgrade those with > latest versions. With newer versions bugs would be minimal , performance > would be better and you get more functionalities. Your query looks fine an > upgrade of hive could sort things out. > Regards > Bejoy K S > > -----Original Message----- > From: Edward Capriolo <[email protected]> > Date: Thu, 17 Mar 2011 08:51:05 > To: [email protected]<[email protected]> > Reply-To: [email protected] > Subject: Re: Hadoop error 2 while joining two large tables > > I am pretty sure the cloudera distro has an upgrade path to a more recent > hive. > > On Thursday, March 17, 2011, hadoop n00b <[email protected]> wrote: >> Hello All, >> >> Thanks a lot for your response. To clarify a few points - >> >> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of >> Hive as we have to use Cloudera distro only. >> >> All records in the smaller table have at least one record in the larger >> table (of course a few exceptions could be there but only a few). >> >> The join is using ON clause. The query is something like - >> >> select ... >> from >> ( >> (select ... from smaller_table) >> join >> (select from larger_table) >> on (smaller_table.col = larger_table.col) >> ) >> >> I will try out setting mapred.child.java.opts -Xmx to a higher value and let >> you know. >> >> Is there a pattern or rule of thumb to follow on when to add more nodes? >> >> Thanks again! >> >> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong <[email protected]> wrote: >> >> >> >> In addition, put the smaller table on the left-hand side of a JOIN: >> >> SELECT ... FROM small_table JOIN large_table ON ... >> >> >> >> >> From: Bejoy Ks [mailto:[email protected]] >> Sent: Wednesday, March 16, 2011 11:43 AM >> >> To: [email protected] >> Subject: Re: Hadoop error 2 while joining two large tables >> >> >> >> >> >> >> Hey hadoop n00b >> I second Mark's thought. But definitely you can try out re framing your >> query to get things rolling. I'm not sure on your hive Query.But still, from >> my experience with joins on huge tables (record counts in the range of >> hundreds of millions) you should give join conditions with JOIN ON clause >> rather than specifying all conditions in WHERE. >> >> Say if you have a query this way >> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE >> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2; >> >> You can definitely re frame this query as >> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b >> ON (a.Column4=b.Column1 AND a.Column2=b.Column4) WHERE a.Column3 > >> b.Column2; >> >> From my understanding Hive supports equijoins so you can't have the >> inequality conditions there within JOIN ON, inequality should come to WHERE. >> This approach has worked for me when I encountered a similar situation as >> yours some time ago. Try this out,hope it helps. >> >> Regards >> Bejoy.K.S >> >> >> >> >> >> >> >> >> From: "Sunderlin, Mark" <[email protected]> >> To: "[email protected]" <[email protected]> >> Sent: Wed, March 16, 2011 11:22:09 PM >> Subject: RE: Hadoop error 2 while joining two large tables >> >> >> >> >> hadoop n00b asks, “Is adding more nodes the solution to such problem?” >> >> Whatever else answers you get, you should append “ … and add more nodes.” >> More nodes is never a bad thing ;-) >> >> >> --- >> Mark E. Sunderlin >> Solutions Architect |AOL Data Warehouse >> >> P: 703-256-6935 | C: 540-327-6222 >> >> AIM: MESunderlin >> 22000 AOL Way >
In particular I remember back in the hive 0.3 and 0.4 days, many troublesome issues like this were encountered. 0.5.0 had a lot of engineering went in to make joins work better with less memory and avoid many OutOfMemory conditions. You can see some of my older struggles: http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201002.mbox/%[email protected]%3E Upgrading is HIGHLY suggested. I remember 0.4.0-> 0.5.0 made problems vanish. (that was a happy day)
