Re: Cartesian Product in HIVE

Abhishek Sun, 30 Sep 2012 21:02:58 -0700

Thanks for the reply Bejoy.

I did not any order by in the query.


Here are the properities I have used and query, table sizes

-----  set mapred.reduce.tasks=17;
set mapred.child.java.opts=xmx2073741824;
set io.sort.mb=512;
set io.sort.factor=250;
set mapred.reduce.parallel.copies=true;
set mapred.job.reuse.jvm.num.tasks=1;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.map.tasks.speculative.execution=false;

CREATE TABLE t1 AS
SELECT /*+ STREAMTABLE(t2) */
t2.col1,
t3.col1
FROM table2 t2
JOIN table3 t3

table2 : 997406 rows
total bytes: 20848934 -- 19.88 mb

table3 : 20773 rows
total bytes: 353127 -- 0.33 mb

#of Mappers: 4
#of reducers: 1

Regards
Abhi

On Sep 30, 2012, at 9:35 AM, Bejoy KS <bejo...@outlook.com> wrote:

> Hi Abshiek
> 
> 
> No need of any similar columns for map join to work. It is just taking the 
> join process to mapper rather then  doing the same in a reducer.
> 
> The actual bottle neck is the single reducer. Need to figure out why only one 
> reducer is fired rather than the set value of 17. Are you using ORDER BY in 
> your query? If so, it sets the number of reducers to 1.
> 
> Can you provide the full console stack here so that we'll be able to 
> understand your issue and help you better? (starting from the properties you 
> set, your query and the error ). Also can you get the exact data sizes for 
> two tables.
> 
> Regards
> Bejoy KS
> 
> > From: abhishek.dod...@gmail.com
> > Date: Sat, 29 Sep 2012 07:44:06 -0700
> > Subject: Re: Cartesian Product in HIVE
> > To: user@hive.apache.org; bejoy...@yahoo.com
> > 
> > Thanks for the reply Bejoy.
> > 
> > I tried to map join, by setting the property mentioned by you and Even
> > increased the small table file size
> > 20k table size would be not more than 200 mb but it doesnot work.
> > 
> > Cartesian product of tables, they dont have any similar columns does
> > map join work here??
> > 
> > By applying below setting with STREAM TABLE HINT it was processing
> > around 5 Billion rows per hour,so process took around 4 hrs.
> > 
> > Set io.sort.mb=512
> > Set mapred.reduce.tasks=17
> > Set io.sort.factor=256
> > Set mapred.child.jvm.opts=-Xmx2048mb
> > Set hive.map.aggr=true
> > Set hive.exec.parallel=true
> > Set mapred.tasks.reuse.num.tasks=-1
> > Set hive.mapred.map.speculative.execution=false
> > Set hive.mapred.reduce.speculative.execution=false
> > 
> > By using this map join hint set hive.auto.convert.join = true; and
> > increasing the small table file size the job initiated but it was
> > 
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > map 0 % -- reduce 0%
> > 
> > Till 30 min it was like this, so i killed the task.
> > 
> > My doubts are:
> > 
> > -- I have increased the reducer number mapred.reduce.tasks to 17, but
> > the hive query engine fired only one reducer for the job.
> > -- I have slave node memory around 96 GB can i over ride some
> > parameters, other than the above mentioned and make efficient use of
> > it.
> > -- How can I increase number of rows processed by reducer at a single
> > moment or per second
> > -- Any other techniques to optimize the query
> > 
> > Thanks for response and your time Bejoy.
> > 
> > Regards
> > abhi
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <bejoy...@yahoo.com> wrote:
> > > Hi Abshiek
> > >
> > > What is the data size of the 20k rows? If it is lesser then you can go in
> > > for map join, which will give you a performance boost.
> > >
> > > set hive.auto.convert.join = true;
> > > Regards
> > > Bejoy KS
> > >
> > > Sent from handheld, please excuse typos.
> > > ________________________________
> > > From: Abhishek <abhishek.dod...@gmail.com>
> > > Date: Fri, 28 Sep 2012 23:56:16 -0400
> > > To: Hive<user@hive.apache.org>
> > > ReplyTo: user@hive.apache.org
> > > Cc: Bejoy Ks<bejoy...@yahoo.com>
> > > Subject: Cartesian Product in HIVE
> > >
> > > Hi all,
> > >
> > > I have use case where we are doing Cartesian product of two tables with
> > > One table with
> > > 990k rows
> > > Second table
> > > 20k rows
> > >
> > > Query is Cartesian product of just two columns.
> > >
> > > So it comes around 20 billion rows
> > >
> > > For one hour it is processing like around 5 billion rows.
> > >
> > > So the process takes around 4 hrs.
> > >
> > > I have over riden some of the properties in hive
> > >
> > > Set io.sort.mb=512
> > >
> > > Set mapred.reduce.tasks=17
> > >
> > > Set io.sort.factor=256
> > > Set mapred.child.jvm.opts=-Xmx2048mb
> > > Set hive.map.aggr=true
> > > Set hive.exec.parallel=true
> > > Set mapred.tasks.reuse.num.tasks=-1
> > > Set hive.mapred.map.speculative.execution=false
> > > Set hive.mapred.reduce.speculative.execution=false
> > >
> > >
> > > How can optimize it to get better results.
> > >
> > > Even though I have set reduce tasks to 17, only one reduce is spawned for
> > > the query . Did I do some thing wrong ??
> > >
> > > My cluster configuration is having
> > > 20 slave nodes running cdh3u5.
> > > With 240 map slots
> > > 120 reduce slots
> > > Block size is 128 mb
> > > Memory on the slave node is 96GB
> > >
> > > How can the query perform better??
> > >
> > > How can I increase number of rows processed by reducer at a single moment,
> > > or per second
> > >
> > > Can help is greatly appreciated.
> > >
> > > Regards
> > > Abhi

Re: Cartesian Product in HIVE

Reply via email to