Thanks Bejoy KS for your reply, I want to ask one thing that If I want to set this parameter on Amazon Elastic Mapreduce then how can I set these variable like: e.g. SET mapred.min.split.size=m; SET mapred.max.split.size=m+n; set dfs.block.size=128 set mapred.compress.map.output=true set io.sort.mb=400 etc....
For all this do I need to write shell script for setting this variables on the particular path /home/hadoop/hive/bin/hive -e 'set .....' or pass all this steps in bootstrap actions??? I found this link to pass the bootstrap actions http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined What should I do in such case?? On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <bejoy...@yahoo.com> wrote: > Hi Bhavesh > > In sqoop you can optimize the performance by using --direct mode for > import and increasing the number of mappers used for import. When you > increase the number of mappers you need to ensure that the RDBMS connection > pool will handle those number of connections gracefully. Also use a evenly > distributed column as --split-by, that'll ensure that all mappers are kind > of equally loaded. > min split size and map split size can be set on a job level. But, there > are chances of slight loss in data locality if you increase these values. > By increasing these values you are increasing the data volume processed per > mapper so less number of mappers , now you need to see whether this will > that get you substantial performance gains. I havent seen much gains there > when I tried out those on some of my workflows in the past. A better > approach than this would be increasing the hdfs block size itself if your > cluster deals with relatively larger files. Of you change the hdfs block > size then make the changes accordingly on min split and max split values. > You can set all min and max split sizes using SET command in hive CLI > itself. > hive> SET mapred.min.split.size=m; > hive> SET mapred.max.split.size=m+n; > > Regards > Bejoy KS > > > ------------------------------ > *From:* Bhavesh Shah <bhavesh25s...@gmail.com> > *To:* user@hive.apache.org > *Sent:* Tuesday, May 8, 2012 11:35 AM > *Subject:* Re: Want to improve the performance for execution of Hive Jobs. > > Thanks Both of you for their replies, > If I decide to deploy my JAR on Amazon Elastic Mapreduce then, > > 1) Default block size is 64 MB, so insuch case I have to set it to 128 > MB..... is it right??? > 2) Amazon EMR has already values for mapred.min.split.size > and mapred.max.split.size, and mapper and reducer too. So is there any need > to set the values there? If yes then how to set for all clusters? Is it > possible by setting all these above parameters in --bootstrap-actions.... > to apply this for all nodes while submitting jobs to Amazon EMR?? > > Thanks both of u very much > > -- > Regards, > Bhavesh Shah > > > On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <mapred.le...@gmail.com>wrote: > > Try setting this value to your block > Size, for 128 mb block size, > > *set mapred.min.split.size=128000* > > > Sent from my iPhone > > On May 7, 2012, at 10:11 PM, Bhavesh Shah <bhavesh25s...@gmail.com> wrote: > > Thanks Nitin for your reply. > > In short my Task is > 1) Initially I want to import the data from MS SQL Server into HDFS using > SQOOP. > 2) Through Hive I am processing the data and generating the result in one > table > 3) That result containing table from Hive is again exported to MS SQL > SERVER back. > > Actually the data which I am importing from MS SQL Server is very large > (near about 5,00,000 entries in one table. Like wise I have 30 tables). For > this I have written a task in Hive which contains only queries (And each > query has used a lot of joins in it). So due to this the performance is > very poor on my single local machine ( It takes near about 3 hrs to > execute completely). I have observed that when I have submitted a single > query to Hive CLI it took 10-11 jobs to execute completely. > > * set mapred.min.split.size > set mapred.max.split.size* > Should this value to be set in bootstrap action while submitting jobs to > amazon EMR? What value to be set for it as I don't know? > > > -- > Regards, > Bhavesh Shah > > > On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <nitinpawar...@gmail.com> > nitinpawar...@gmail.com> wrote: > > 1) check the jobtracker url to see how many maps/reducers have been > launched > 2) if you have a large dataset and wants to execute it fast, you > set mapred.min.split.size and mapred.max.split.size to an optimal value so > that more mappers will be launched and will finish > 3) if you are doing joins, there are different ways to go according to the > data you have and size of data > > it will be helpful if you can let us know your datasizes and query details > > > On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bhavesh25s...@gmail.com> > bhavesh25s...@gmail.com> wrote: > > Hello all, > I have written a Hive JDBC code and created a JAR of it. I am running that > JAR on 10 cluster. > But the problem as I am using the 10 cluster still the performance is same > as that on single cluster. > > What to do to improve the performance of Hive Jobs? Is there anything > configuration setting to set before the submitting Hive Jobs to cluster? > One more thing I want to know is that How can we come to know that is job > running on all cluster? > > Please let me know if anyone knows about it? > > -- > Regards, > Bhavesh Shah > > > > > -- > Nitin Pawar > > > > > > > > > -- Regards, Bhavesh Shah