Hey hadoop n00b
I second Mark's thought. But definitely you can try out re framing your
query to get things rolling. I'm not sure on your hive Query.But still, from my
experience with joins on huge tables (record counts in the range of hundreds of
millions) you should give join conditions with JOIN ON clause rather than
specifying all conditions in WHERE.
Say if you have a query this way
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
You can definitely re frame this query as
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
ON (a.Column4=b.Column1 AND a.Column2=b.Column4) WHERE a.Column3 > b.Column2;
From my understanding Hive supports equijoins so you can't have the inequality
conditions there within JOIN ON, inequality should come to WHERE. This approach
has worked for me when I encountered a similar situation as yours some time
ago.
Try this out,hope it helps.
Regards
Bejoy.K.S
________________________________
From: "Sunderlin, Mark" <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Wed, March 16, 2011 11:22:09 PM
Subject: RE: Hadoop error 2 while joining two large tables
hadoop n00b asks, “Is adding more nodes the solution to such problem?”
Whatever else answers you get, you should append “ … and add more nodes.” More
nodes is never a bad thing ;-)
---
Mark E. Sunderlin
Solutions Architect |AOL Data Warehouse
P: 703-256-6935 | C: 540-327-6222
AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166
From:hadoop n00b [mailto:[email protected]]
Sent: Wednesday, March 16, 2011 3:33 AM
To: [email protected]
Subject: Fwd: Hadoop error 2 while joining two large tables
Hello,
I am trying to execute a query that joins two large tables (3 million and 20
million records). I am getting the Hadoop error code 2 during execution. This
happens mainly while the reducers are running. Sometimes the reducers complete
100% and then the error comes. The logs talk about running out of Heap space
and GC overhead limit exceeding.
I am running a 6 node cluster with child JVM memory of 1GB.
Are there any parameters I could tweak to make them run? Is adding more nodes
the solution to such problem?
Thanks!