Thanks for the reply Bejoy. On Sun, Sep 30, 2012 at 6:35 AM, Bejoy KS <bejo...@outlook.com> wrote: > Hi Abshiek > > > No need of any similar columns for map join to work. It is just taking the > join process to mapper rather then doing the same in a reducer. > > The actual bottle neck is the single reducer. Need to figure out why only > one reducer is fired rather than the set value of 17. Are you using ORDER BY > in your query? If so, it sets the number of reducers to 1.
----- I didnot use order by in the query. > > Can you provide the full console stack here so that we'll be able to > understand your issue and help you better? (starting from the properties you > set, your query and the error ). Also can you get the exact data sizes for > two tables. ----- set mapred.reduce.tasks=17; set mapred.child.java.opts=xmx2073741824; set io.sort.mb=512; set io.sort.factor=250; set mapred.reduce.parallel.copies=true; set mapred.job.reuse.jvm.num.tasks=1; set hive.mapred.reduce.tasks.speculative.execution=false; set hive.mapred.map.tasks.speculative.execution=false; CREATE TABLE t1 AS SELECT /*+ STREAMTABLE(t2) */ t2.col1, t3.col1 FROM table2 t2 JOIN table3 t3 table2 : 997406 rows total bytes: 20848934 -- 19.88 mb table3 : 20773 rows total bytes: 353127 -- 0.33 mb #of Mappers: 4 #of reducers: 1 > > Regards > Bejoy KS > >> From: abhishek.dod...@gmail.com >> Date: Sat, 29 Sep 2012 07:44:06 -0700 >> Subject: Re: Cartesian Product in HIVE >> To: user@hive.apache.org; bejoy...@yahoo.com > >> >> Thanks for the reply Bejoy. >> >> I tried to map join, by setting the property mentioned by you and Even >> increased the small table file size >> 20k table size would be not more than 200 mb but it doesnot work. >> >> Cartesian product of tables, they dont have any similar columns does >> map join work here?? >> >> By applying below setting with STREAM TABLE HINT it was processing >> around 5 Billion rows per hour,so process took around 4 hrs. >> >> Set io.sort.mb=512 >> Set mapred.reduce.tasks=17 >> Set io.sort.factor=256 >> Set mapred.child.jvm.opts=-Xmx2048mb >> Set hive.map.aggr=true >> Set hive.exec.parallel=true >> Set mapred.tasks.reuse.num.tasks=-1 >> Set hive.mapred.map.speculative.execution=false >> Set hive.mapred.reduce.speculative.execution=false >> >> By using this map join hint set hive.auto.convert.join = true; and >> increasing the small table file size the job initiated but it was >> >> map 0 % -- reduce 0% >> map 0 % -- reduce 0% >> map 0 % -- reduce 0% >> map 0 % -- reduce 0% >> map 0 % -- reduce 0% >> >> Till 30 min it was like this, so i killed the task. >> >> My doubts are: >> >> -- I have increased the reducer number mapred.reduce.tasks to 17, but >> the hive query engine fired only one reducer for the job. >> -- I have slave node memory around 96 GB can i over ride some >> parameters, other than the above mentioned and make efficient use of >> it. >> -- How can I increase number of rows processed by reducer at a single >> moment or per second >> -- Any other techniques to optimize the query >> >> Thanks for response and your time Bejoy. >> >> Regards >> abhi >> >> >> >> >> >> >> >> On Fri, Sep 28, 2012 at 10:15 PM, Bejoy KS <bejoy...@yahoo.com> wrote: >> > Hi Abshiek >> > >> > What is the data size of the 20k rows? If it is lesser then you can go >> > in >> > for map join, which will give you a performance boost. >> > >> > set hive.auto.convert.join = true; >> > Regards >> > Bejoy KS >> > >> > Sent from handheld, please excuse typos. >> > ________________________________ >> > From: Abhishek <abhishek.dod...@gmail.com> >> > Date: Fri, 28 Sep 2012 23:56:16 -0400 >> > To: Hive<user@hive.apache.org> >> > ReplyTo: user@hive.apache.org >> > Cc: Bejoy Ks<bejoy...@yahoo.com> >> > Subject: Cartesian Product in HIVE >> > >> > Hi all, >> > >> > I have use case where we are doing Cartesian product of two tables with >> > One table with >> > 990k rows >> > Second table >> > 20k rows >> > >> > Query is Cartesian product of just two columns. >> > >> > So it comes around 20 billion rows >> > >> > For one hour it is processing like around 5 billion rows. >> > >> > So the process takes around 4 hrs. >> > >> > I have over riden some of the properties in hive >> > >> > Set io.sort.mb=512 >> > >> > Set mapred.reduce.tasks=17 >> > >> > Set io.sort.factor=256 >> > Set mapred.child.jvm.opts=-Xmx2048mb >> > Set hive.map.aggr=true >> > Set hive.exec.parallel=true >> > Set mapred.tasks.reuse.num.tasks=-1 >> > Set hive.mapred.map.speculative.execution=false >> > Set hive.mapred.reduce.speculative.execution=false >> > >> > >> > How can optimize it to get better results. >> > >> > Even though I have set reduce tasks to 17, only one reduce is spawned >> > for >> > the query . Did I do some thing wrong ?? >> > >> > My cluster configuration is having >> > 20 slave nodes running cdh3u5. >> > With 240 map slots >> > 120 reduce slots >> > Block size is 128 mb >> > Memory on the slave node is 96GB >> > >> > How can the query perform better?? >> > >> > How can I increase number of rows processed by reducer at a single >> > moment, >> > or per second >> > >> > Can help is greatly appreciated. >> > >> > Regards >> > Abhi