Thanks for your reply Viral. However in later versions of hive you don't have to tell hive anything (which is the smaller table) . During runtime hive itself identifies the smaller table and do the local map task on the same irrespective of whether it comes on left or right side of the join. There is a face book post on such join optimizations within hive, you can get a better picture from the same . Regards Bejoy K S
-----Original Message----- From: Viral Bajaria <viral.baja...@gmail.com> Date: Fri, 1 Apr 2011 01:25:41 To: <user@hive.apache.org>; <bejoy...@yahoo.com> Reply-To: user@hive.apache.org Subject: Re: Hive map join - process a little larger tables with moderatenumber of rows Bejoy, We still use older version of Hive (0.5). In that version the join order used to matter. You needed to keep the largest table as the rightmost in your JOIN sequence to make sure that it is streamed and thus avoid the OOM exceptions which are caused by mappers which load the entire table in memory and run out of JVM -Xmx parameter. If you cannot do that, then you can use the STREAMTABLE hint as follows: SELECT /*+ STREAMTABLE(t1) */ * FROM t1 join t2 on t1.col1 = t2.col1 <.....> Thanks, Viral On Thu, Mar 31, 2011 at 10:15 PM, <bejoy...@yahoo.com> wrote: > Thanks Yongqiang for your reply. I'm running a hive script which has nearly > 10 joins within. From those joins all map joins(9 of them involves one small > table) involving smaller tables are running fine. Just 1 join is on two > larger tables and this map join fails, however since the back up task(common > join) is executed successfully the whole hive job runs to completion > successfully. > In brief my hive job is running successfully now, but I just want to > get the failed map join as well running instead of the common join being > executed. I'm curious to see what would be the performance improvement out > there with this difference in execution. > To get a map join executed on larger tables do I have to for memory > parameters with hadoop? > Since my entire task is already running to completion and I want get just a > map join working, shouldn't altering some hive map join parameters do my > job? > Please advise > > > Regards > Bejoy K S > > -----Original Message----- > From: yongqiang he <heyongqiang...@gmail.com> > Date: Thu, 31 Mar 2011 16:25:03 > To: <user@hive.apache.org> > Reply-To: user@hive.apache.org > Subject: Re: Hive map join - process a little larger tables with moderate > number of rows > > You possibly got a OOM error when processing the small tables. OOM is > a fatal error that can not be controlled by the hive configs. So can > you try to increase your memory setting? > > thanks > yongqiang > On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks <bejoy...@yahoo.com> wrote: > > Hi Experts > > I'm currently working with hive 0.7 mostly with JOINS. In all > > permissible cases i'm using map joins by setting the > > hive.auto.convert.join=true parameter. Usage of local map joins have > made a > > considerable performance improvement in hive queries.I have used this > local > > map join only on the default set of hive configuration parameters now i'd > > try to dig more deeper into this. Want to try out this local map join on > > little bigger tables with more no of rows. Given below is a failure log > of > > one of my local map tasks and in turn executing its back up common join > task > > > > 2011-03-31 09:56:54 Starting to launch local task to process map > > join; maximum memory = 932118528 > > 2011-03-31 09:56:57 Processing rows: 200000 Hashtable size: > > 199999 Memory usage: 115481024 rate: 0.124 > > 2011-03-31 09:57:00 Processing rows: 300000 Hashtable size: > > 299999 Memory usage: 169344064 rate: 0.182 > > 2011-03-31 09:57:03 Processing rows: 400000 Hashtable size: > > 399999 Memory usage: 232132792 rate: 0.249 > > 2011-03-31 09:57:06 Processing rows: 500000 Hashtable size: > > 499999 Memory usage: 282338544 rate: 0.303 > > 2011-03-31 09:57:10 Processing rows: 600000 Hashtable size: > > 599999 Memory usage: 336738640 rate: 0.361 > > 2011-03-31 09:57:14 Processing rows: 700000 Hashtable size: > > 699999 Memory usage: 391117888 rate: 0.42 > > 2011-03-31 09:57:22 Processing rows: 800000 Hashtable size: > > 799999 Memory usage: 453906496 rate: 0.487 > > 2011-03-31 09:57:27 Processing rows: 900000 Hashtable size: > > 899999 Memory usage: 508306552 rate: 0.545 > > 2011-03-31 09:57:34 Processing rows: 1000000 Hashtable size: > > 999999 Memory usage: 562706496 rate: 0.604 > > FAILED: Execution Error, return code 2 from > > org.apache.hadoop.hive.ql.exec.MapredLocalTask > > ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask > > Launching Job 4 out of 6 > > > > > > Here i"d like to make this local map task running, for the same i tried > > setting the following hive parameters as > > hive -f HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf > > hive.mapjoin.smalltable.filesize=40000000 -hiveconf > > hive.auto.convert.join=true > > Butting setting the two config parameters doesn't make my local map task > > proceed beyond this stage. I didn't try out > > overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from > my > > task log shows that the memory usage rate is just 0.604, so i assume > setting > > the same with a larger value wont cater to a solution in my case.Could > some > > one please guide me what are the actual parameters and the values I > should > > set to get things rolling. > > > > Thank You > > > > Regards > > Bejoy.K.S > > > > >