Hi Abshiek What is the data size of the 20k rows? If it is lesser then you can go in for map join, which will give you a performance boost.
set hive.auto.convert.join = true; Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Abhishek <abhishek.dod...@gmail.com> Date: Fri, 28 Sep 2012 23:56:16 To: Hive<user@hive.apache.org> Reply-To: user@hive.apache.org Cc: Bejoy Ks<bejoy...@yahoo.com> Subject: Cartesian Product in HIVE Hi all, I have use case where we are doing Cartesian product of two tables with One table with 990k rows Second table 20k rows Query is Cartesian product of just two columns. So it comes around 20 billion rows For one hour it is processing like around 5 billion rows. So the process takes around 4 hrs. I have over riden some of the properties in hive >> Set io.sort.mb=512 Set mapred.reduce.tasks=17 >> Set io.sort.factor=256 >> Set mapred.child.jvm.opts=-Xmx2048mb >> Set hive.map.aggr=true >> Set hive.exec.parallel=true >> Set mapred.tasks.reuse.num.tasks=-1 >> Set hive.mapred.map.speculative.execution=false >> Set hive.mapred.reduce.speculative.execution=false How can optimize it to get better results. Even though I have set reduce tasks to 17, only one reduce is spawned for the query . Did I do some thing wrong ?? My cluster configuration is having 20 slave nodes running cdh3u5. With 240 map slots 120 reduce slots Block size is 128 mb Memory on the slave node is 96GB How can the query perform better?? How can I increase number of rows processed by reducer at a single moment, or per second Can help is greatly appreciated. Regards Abhi