Re: Hive map join - process a little larger tables withmoderatenumber of rows

bejoy_ks Fri, 01 Apr 2011 07:18:52 -0700

Thanks for your reply Viral. However  in later versions of hive you don't have 
to tell hive anything (which is the smaller table) . During runtime hive itself 
identifies the smaller table and do the local map task on the same irrespective 
of whether it comes on left or right side of the join. There is a face book 
post on such join optimizations within hive, you can get a better picture from 
the same .
Regards
Bejoy K S


-----Original Message-----
From: Viral Bajaria <viral.baja...@gmail.com>
Date: Fri, 1 Apr 2011 01:25:41 
To: <user@hive.apache.org>; <bejoy...@yahoo.com>
Reply-To: user@hive.apache.org
Subject: Re: Hive map join - process a little larger tables with
 moderatenumber of rows

Bejoy,

We still use older version of Hive (0.5). In that version the join order
used to matter. You needed to keep the largest table as the rightmost in
your JOIN sequence to make sure that it is streamed and thus avoid the OOM
exceptions which are caused by mappers which load the entire table in memory
and run out of JVM -Xmx parameter.

If you cannot do that, then you can use the STREAMTABLE hint as follows:
SELECT /*+ STREAMTABLE(t1) */ * FROM t1 join t2 on t1.col1 = t2.col1 <.....>

Thanks,
Viral

On Thu, Mar 31, 2011 at 10:15 PM, <bejoy...@yahoo.com> wrote:

> Thanks Yongqiang for your reply. I'm running a hive script which has nearly
> 10 joins within. From those joins all map joins(9 of them involves one small
> table) involving smaller tables are running fine. Just 1 join is on two
> larger tables and this map join fails, however since the back up task(common
> join) is executed successfully the whole hive job runs to completion
> successfully.
>      In brief my hive job is running successfully now, but I just want to
> get the failed map join as well running instead of the common join being
> executed. I'm curious to see what would be the performance improvement out
> there with this difference in execution.
>       To get a map join executed on larger tables do I have to for memory
> parameters with hadoop?
> Since my entire task is already running to completion and I want get just a
> map join working, shouldn't altering some hive map join parameters do my
> job?
> Please advise
>
>
> Regards
> Bejoy K S
>
> -----Original Message-----
> From: yongqiang he <heyongqiang...@gmail.com>
> Date: Thu, 31 Mar 2011 16:25:03
> To: <user@hive.apache.org>
> Reply-To: user@hive.apache.org
> Subject: Re: Hive map join - process a little larger tables with moderate
>  number of rows
>
> You possibly got a OOM error when processing the small tables. OOM is
> a fatal error that can not be controlled by the hive configs. So can
> you try to increase your memory setting?
>
> thanks
> yongqiang
> On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks <bejoy...@yahoo.com> wrote:
> > Hi Experts
> >     I'm currently working with hive 0.7 mostly with JOINS. In all
> > permissible cases i'm using map joins by setting the
> > hive.auto.convert.join=true  parameter. Usage of local map joins have
> made a
> > considerable performance improvement in hive queries.I have used this
> local
> > map join only on the default set of hive configuration parameters now i'd
> > try to dig more deeper into this. Want to try out this local map join on
> > little bigger tables with more no of rows. Given below is a failure log
> of
> > one of my local map tasks and in turn executing its back up common join
> task
> >
> > 2011-03-31 09:56:54     Starting to launch local task to process map
> > join;      maximum memory = 932118528
> > 2011-03-31 09:56:57     Processing rows:        200000  Hashtable size:
> > 199999  Memory usage:   115481024       rate:   0.124
> > 2011-03-31 09:57:00     Processing rows:        300000  Hashtable size:
> > 299999  Memory usage:   169344064       rate:   0.182
> > 2011-03-31 09:57:03     Processing rows:        400000  Hashtable size:
> > 399999  Memory usage:   232132792       rate:   0.249
> > 2011-03-31 09:57:06     Processing rows:        500000  Hashtable size:
> > 499999  Memory usage:   282338544       rate:   0.303
> > 2011-03-31 09:57:10     Processing rows:        600000  Hashtable size:
> > 599999  Memory usage:   336738640       rate:   0.361
> > 2011-03-31 09:57:14     Processing rows:        700000  Hashtable size:
> > 699999  Memory usage:   391117888       rate:   0.42
> > 2011-03-31 09:57:22     Processing rows:        800000  Hashtable size:
> > 799999  Memory usage:   453906496       rate:   0.487
> > 2011-03-31 09:57:27     Processing rows:        900000  Hashtable size:
> > 899999  Memory usage:   508306552       rate:   0.545
> > 2011-03-31 09:57:34     Processing rows:        1000000 Hashtable size:
> > 999999  Memory usage:   562706496       rate:   0.604
> > FAILED: Execution Error, return code 2 from
> > org.apache.hadoop.hive.ql.exec.MapredLocalTask
> > ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
> > Launching Job 4 out of 6
> >
> >
> > Here i"d like to make this local map task running, for the same i tried
> > setting the following hive parameters as
> > hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
> > hive.mapjoin.smalltable.filesize=40000000 -hiveconf
> > hive.auto.convert.join=true
> > Butting setting the two config parameters doesn't make my local map task
> > proceed beyond this stage.  I didn't try out
> > overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from
> my
> > task log shows that the memory usage rate is just 0.604, so i assume
> setting
> > the same with a larger value wont cater to a solution in my case.Could
> some
> > one please guide me what are the actual parameters and the values I
> should
> > set to get things rolling.
> >
> > Thank You
> >
> > Regards
> > Bejoy.K.S
> >
> >
>

Re: Hive map join - process a little larger tables withmoderatenumber of rows

Reply via email to