Hi Abshiek

What is the data size of the 20k rows? If it is lesser then you can go in for 
map join, which will give you a performance boost.

set hive.auto.convert.join = true;

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Abhishek <abhishek.dod...@gmail.com>
Date: Fri, 28 Sep 2012 23:56:16 
To: Hive<user@hive.apache.org>
Reply-To: user@hive.apache.org
Cc: Bejoy Ks<bejoy...@yahoo.com>
Subject: Cartesian Product in HIVE

Hi all,

I have use case where we are doing Cartesian product of two tables with
One table with 
990k rows
Second table 
20k rows

Query is Cartesian product of just two columns. 

So it comes around 20 billion rows

For one hour it is processing like around 5 billion rows.

So the process takes around 4 hrs.

I have over riden some of the properties in hive

>> Set io.sort.mb=512
    Set mapred.reduce.tasks=17
>> Set io.sort.factor=256
>> Set mapred.child.jvm.opts=-Xmx2048mb
>> Set hive.map.aggr=true
>> Set hive.exec.parallel=true
>> Set mapred.tasks.reuse.num.tasks=-1
>> Set hive.mapred.map.speculative.execution=false
>> Set hive.mapred.reduce.speculative.execution=false

How can optimize it to get better results.

Even though I have set reduce tasks to 17, only one reduce is spawned for the 
query . Did I do some thing wrong ??

My cluster configuration is having
20 slave nodes running cdh3u5.
With 240 map slots
        120 reduce slots 
 Block size is 128 mb
 Memory on the slave node is 96GB

How can the query perform better??

How can I increase number of rows processed by reducer at a single moment, or 
per second

Can help is greatly appreciated.

Regards
Abhi

Reply via email to