Re: Cartesian Product in HIVE

abhishek dodda Sun, 30 Sep 2012 08:56:34 -0700

Thanks for the reply Bejoy.

On Sun, Sep 30, 2012 at 6:35 AM, Bejoy KS <bejo...@outlook.com> wrote:
> Hi Abshiek
>
>
> No need of any similar columns for map join to work. It is just taking the
> join process to mapper rather then  doing the same in a reducer.
>
> The actual bottle neck is the single reducer. Need to figure out why only
> one reducer is fired rather than the set value of 17. Are you using ORDER BY
> in your query? If so, it sets the number of reducers to 1.


-----  I didnot use order by in the query.

>
> Can you provide the full console stack here so that we'll be able to
> understand your issue and help you better? (starting from the properties you
> set, your query and the error ). Also can you get the exact data sizes for
> two tables.

-----  set mapred.reduce.tasks=17;
set mapred.child.java.opts=xmx2073741824;
set io.sort.mb=512;
set io.sort.factor=250;
set mapred.reduce.parallel.copies=true;
set mapred.job.reuse.jvm.num.tasks=1;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.mapred.map.tasks.speculative.execution=false;

CREATE TABLE t1 AS
SELECT /*+ STREAMTABLE(t2) */
t2.col1,
t3.col1
FROM table2 t2
JOIN table3 t3

table2 : 997406 rows
total bytes: 20848934 -- 19.88 mb

table3 : 20773 rows
total bytes: 353127 -- 0.33 mb

#of Mappers: 4
#of reducers: 1



>
> Regards
> Bejoy KS
>
>> From: abhishek.dod...@gmail.com
>> Date: Sat, 29 Sep 2012 07:44:06 -0700
>> Subject: Re: Cartesian Product in HIVE
>> To: user@hive.apache.org; bejoy...@yahoo.com
>
>>
>> Thanks for the reply Bejoy.
>>
>> I tried to map join, by setting the property mentioned by you and Even
>> increased the small table file size
>> 20k table size would be not more than 200 mb but it doesnot work.
>>
>> Cartesian product of tables, they dont have any similar columns does
>> map join work here??
>>
>> By applying below setting with STREAM TABLE HINT it was processing
>> around 5 Billion rows per hour,so process took around 4 hrs.
>>
>> Set io.sort.mb=512
>> Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false
>>
>> By using this map join hint set hive.auto.convert.join = true; and
>> increasing the small table file size the job initiated but it was
>>
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>> map 0 % -- reduce 0%
>>
>> Till 30 min it was like this, so i killed the task.
>>
>> My doubts are:
>>
>> -- I have increased the reducer number mapred.reduce.tasks to 17, but
>> the hive query engine fired only one reducer for the job.
>> -- I have slave node memory around 96 GB can i over ride some
>> parameters, other than the above mentioned and make efficient use of
>> it.
>> -- How can I increase number of rows processed by reducer at a single
>> moment or per second
>> -- Any other techniques to optimize the query
>>
>> Thanks for response and your time Bejoy.
>>
>> Regards
>> abhi
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <bejoy...@yahoo.com> wrote:
>> > Hi Abshiek
>> >
>> > What is the data size of the 20k rows? If it is lesser then you can go
>> > in
>> > for map join, which will give you a performance boost.
>> >
>> > set hive.auto.convert.join = true;
>> > Regards
>> > Bejoy KS
>> >
>> > Sent from handheld, please excuse typos.
>> > ________________________________
>> > From: Abhishek <abhishek.dod...@gmail.com>
>> > Date: Fri, 28 Sep 2012 23:56:16 -0400
>> > To: Hive<user@hive.apache.org>
>> > ReplyTo: user@hive.apache.org
>> > Cc: Bejoy Ks<bejoy...@yahoo.com>
>> > Subject: Cartesian Product in HIVE
>> >
>> > Hi all,
>> >
>> > I have use case where we are doing Cartesian product of two tables with
>> > One table with
>> > 990k rows
>> > Second table
>> > 20k rows
>> >
>> > Query is Cartesian product of just two columns.
>> >
>> > So it comes around 20 billion rows
>> >
>> > For one hour it is processing like around 5 billion rows.
>> >
>> > So the process takes around 4 hrs.
>> >
>> > I have over riden some of the properties in hive
>> >
>> > Set io.sort.mb=512
>> >
>> > Set mapred.reduce.tasks=17
>> >
>> > Set io.sort.factor=256
>> > Set mapred.child.jvm.opts=-Xmx2048mb
>> > Set hive.map.aggr=true
>> > Set hive.exec.parallel=true
>> > Set mapred.tasks.reuse.num.tasks=-1
>> > Set hive.mapred.map.speculative.execution=false
>> > Set hive.mapred.reduce.speculative.execution=false
>> >
>> >
>> > How can optimize it to get better results.
>> >
>> > Even though I have set reduce tasks to 17, only one reduce is spawned
>> > for
>> > the query . Did I do some thing wrong ??
>> >
>> > My cluster configuration is having
>> > 20 slave nodes running cdh3u5.
>> > With 240 map slots
>> > 120 reduce slots
>> > Block size is 128 mb
>> > Memory on the slave node is 96GB
>> >
>> > How can the query perform better??
>> >
>> > How can I increase number of rows processed by reducer at a single
>> > moment,
>> > or per second
>> >
>> > Can help is greatly appreciated.
>> >
>> > Regards
>> > Abhi

Re: Cartesian Product in HIVE

Reply via email to