RE: Cartesian Product in HIVE

Bejoy KS Sun, 30 Sep 2012 06:36:05 -0700

Hi Abshiek


No need of any similar columns for map join to work. It is just taking the join 
process to mapper rather then  doing the same in a reducer.

The actual bottle neck is the single reducer. Need to figure out why only one 
reducer is fired rather than the set value of 17. Are you using ORDER BY in 
your query? If so, it sets the number of reducers to 1.

Can you provide the full console stack here so that we'll be able to understand 
your issue and help you better? (starting from the properties you set, your 
query and the error ). Also can you get the exact data sizes for two tables.

Regards
Bejoy KS

> From: abhishek.dod...@gmail.com
> Date: Sat, 29 Sep 2012 07:44:06 -0700
> Subject: Re: Cartesian Product in HIVE
> To: user@hive.apache.org; bejoy...@yahoo.com
> 
> Thanks for the reply Bejoy.
> 
> I tried to map join, by setting the property mentioned by you and Even
> increased the small table file size
> 20k table size would be not more than 200 mb but it doesnot work.
> 
> Cartesian product of tables, they dont have any similar columns does
> map join work here??
> 
> By applying below setting with STREAM TABLE HINT it was processing
> around 5 Billion rows per hour,so process took around 4 hrs.
> 
>  Set io.sort.mb=512
>  Set mapred.reduce.tasks=17
>  Set io.sort.factor=256
>  Set mapred.child.jvm.opts=-Xmx2048mb
>  Set hive.map.aggr=true
>  Set hive.exec.parallel=true
>  Set mapred.tasks.reuse.num.tasks=-1
>  Set hive.mapred.map.speculative.execution=false
>  Set hive.mapred.reduce.speculative.execution=false
> 
> By using this map join hint set hive.auto.convert.join = true; and
> increasing the small table file size the job initiated but it was
> 
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> map 0 % -- reduce 0%
> 
> Till 30 min it was like this, so i killed the task.
> 
> My doubts are:
> 
> -- I have increased the reducer number mapred.reduce.tasks to 17, but
> the hive query engine fired only one reducer for the job.
> -- I have slave node memory around 96 GB can i over ride some
> parameters, other than the above mentioned and make efficient use of
> it.
> -- How can I increase number of rows processed by reducer at a single
> moment or per second
> -- Any other techniques to optimize the query
> 
> Thanks for response and your time Bejoy.
> 
> Regards
> abhi
> 
> 
> 
> 
> 
> 
> 
> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <bejoy...@yahoo.com> wrote:
> > Hi Abshiek
> >
> > What is the data size of the 20k rows? If it is lesser then you can go in
> > for map join, which will give you a performance boost.
> >
> > set hive.auto.convert.join = true;
> > Regards
> > Bejoy KS
> >
> > Sent from handheld, please excuse typos.
> > ________________________________
> > From: Abhishek <abhishek.dod...@gmail.com>
> > Date: Fri, 28 Sep 2012 23:56:16 -0400
> > To: Hive<user@hive.apache.org>
> > ReplyTo: user@hive.apache.org
> > Cc: Bejoy Ks<bejoy...@yahoo.com>
> > Subject: Cartesian Product in HIVE
> >
> > Hi all,
> >
> > I have use case where we are doing Cartesian product of two tables with
> > One table with
> > 990k rows
> > Second table
> > 20k rows
> >
> > Query is Cartesian product of just two columns.
> >
> > So it comes around 20 billion rows
> >
> > For one hour it is processing like around 5 billion rows.
> >
> > So the process takes around 4 hrs.
> >
> > I have over riden some of the properties in hive
> >
> > Set io.sort.mb=512
> >
> >     Set mapred.reduce.tasks=17
> >
> > Set io.sort.factor=256
> > Set mapred.child.jvm.opts=-Xmx2048mb
> > Set hive.map.aggr=true
> > Set hive.exec.parallel=true
> > Set mapred.tasks.reuse.num.tasks=-1
> > Set hive.mapred.map.speculative.execution=false
> > Set hive.mapred.reduce.speculative.execution=false
> >
> >
> > How can optimize it to get better results.
> >
> > Even though I have set reduce tasks to 17, only one reduce is spawned for
> > the query . Did I do some thing wrong ??
> >
> > My cluster configuration is having
> > 20 slave nodes running cdh3u5.
> > With 240 map slots
> >         120 reduce slots
> >  Block size is 128 mb
> >  Memory on the slave node is 96GB
> >
> > How can the query perform better??
> >
> > How can I increase number of rows processed by reducer at a single moment,
> > or per second
> >
> > Can help is greatly appreciated.
> >
> > Regards
> > Abhi

RE: Cartesian Product in HIVE

Reply via email to