Thanks for the reply Bejoy. I did not any order by in the query.
Here are the properities I have used and query, table sizes ----- set mapred.reduce.tasks=17; set mapred.child.java.opts=xmx2073741824; set io.sort.mb=512; set io.sort.factor=250; set mapred.reduce.parallel.copies=true; set mapred.job.reuse.jvm.num.tasks=1; set hive.mapred.reduce.tasks.speculative.execution=false; set hive.mapred.map.tasks.speculative.execution=false; CREATE TABLE t1 AS SELECT /*+ STREAMTABLE(t2) */ t2.col1, t3.col1 FROM table2 t2 JOIN table3 t3 table2 : 997406 rows total bytes: 20848934 -- 19.88 mb table3 : 20773 rows total bytes: 353127 -- 0.33 mb #of Mappers: 4 #of reducers: 1 Regards Abhi On Sep 30, 2012, at 9:35 AM, Bejoy KS <bejo...@outlook.com> wrote: > Hi Abshiek > > > No need of any similar columns for map join to work. It is just taking the > join process to mapper rather then doing the same in a reducer. > > The actual bottle neck is the single reducer. Need to figure out why only one > reducer is fired rather than the set value of 17. Are you using ORDER BY in > your query? If so, it sets the number of reducers to 1. > > Can you provide the full console stack here so that we'll be able to > understand your issue and help you better? (starting from the properties you > set, your query and the error ). Also can you get the exact data sizes for > two tables. > > Regards > Bejoy KS > > > From: abhishek.dod...@gmail.com > > Date: Sat, 29 Sep 2012 07:44:06 -0700 > > Subject: Re: Cartesian Product in HIVE > > To: user@hive.apache.org; bejoy...@yahoo.com > > > > Thanks for the reply Bejoy. > > > > I tried to map join, by setting the property mentioned by you and Even > > increased the small table file size > > 20k table size would be not more than 200 mb but it doesnot work. > > > > Cartesian product of tables, they dont have any similar columns does > > map join work here?? > > > > By applying below setting with STREAM TABLE HINT it was processing > > around 5 Billion rows per hour,so process took around 4 hrs. > > > > Set io.sort.mb=512 > > Set mapred.reduce.tasks=17 > > Set io.sort.factor=256 > > Set mapred.child.jvm.opts=-Xmx2048mb > > Set hive.map.aggr=true > > Set hive.exec.parallel=true > > Set mapred.tasks.reuse.num.tasks=-1 > > Set hive.mapred.map.speculative.execution=false > > Set hive.mapred.reduce.speculative.execution=false > > > > By using this map join hint set hive.auto.convert.join = true; and > > increasing the small table file size the job initiated but it was > > > > map 0 % -- reduce 0% > > map 0 % -- reduce 0% > > map 0 % -- reduce 0% > > map 0 % -- reduce 0% > > map 0 % -- reduce 0% > > > > Till 30 min it was like this, so i killed the task. > > > > My doubts are: > > > > -- I have increased the reducer number mapred.reduce.tasks to 17, but > > the hive query engine fired only one reducer for the job. > > -- I have slave node memory around 96 GB can i over ride some > > parameters, other than the above mentioned and make efficient use of > > it. > > -- How can I increase number of rows processed by reducer at a single > > moment or per second > > -- Any other techniques to optimize the query > > > > Thanks for response and your time Bejoy. > > > > Regards > > abhi > > > > > > > > > > > > > > > > On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <bejoy...@yahoo.com> wrote: > > > Hi Abshiek > > > > > > What is the data size of the 20k rows? If it is lesser then you can go in > > > for map join, which will give you a performance boost. > > > > > > set hive.auto.convert.join = true; > > > Regards > > > Bejoy KS > > > > > > Sent from handheld, please excuse typos. > > > ________________________________ > > > From: Abhishek <abhishek.dod...@gmail.com> > > > Date: Fri, 28 Sep 2012 23:56:16 -0400 > > > To: Hive<user@hive.apache.org> > > > ReplyTo: user@hive.apache.org > > > Cc: Bejoy Ks<bejoy...@yahoo.com> > > > Subject: Cartesian Product in HIVE > > > > > > Hi all, > > > > > > I have use case where we are doing Cartesian product of two tables with > > > One table with > > > 990k rows > > > Second table > > > 20k rows > > > > > > Query is Cartesian product of just two columns. > > > > > > So it comes around 20 billion rows > > > > > > For one hour it is processing like around 5 billion rows. > > > > > > So the process takes around 4 hrs. > > > > > > I have over riden some of the properties in hive > > > > > > Set io.sort.mb=512 > > > > > > Set mapred.reduce.tasks=17 > > > > > > Set io.sort.factor=256 > > > Set mapred.child.jvm.opts=-Xmx2048mb > > > Set hive.map.aggr=true > > > Set hive.exec.parallel=true > > > Set mapred.tasks.reuse.num.tasks=-1 > > > Set hive.mapred.map.speculative.execution=false > > > Set hive.mapred.reduce.speculative.execution=false > > > > > > > > > How can optimize it to get better results. > > > > > > Even though I have set reduce tasks to 17, only one reduce is spawned for > > > the query . Did I do some thing wrong ?? > > > > > > My cluster configuration is having > > > 20 slave nodes running cdh3u5. > > > With 240 map slots > > > 120 reduce slots > > > Block size is 128 mb > > > Memory on the slave node is 96GB > > > > > > How can the query perform better?? > > > > > > How can I increase number of rows processed by reducer at a single moment, > > > or per second > > > > > > Can help is greatly appreciated. > > > > > > Regards > > > Abhi