Thanks Bejoy for your reply. Yes I saw that for ewvery job new XML is created. In that I saw that whatever variable I set is different from that. Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2 and In for all job XML it is showing value for map is 1 and for reduce is 0. Same thing are with other parameters too. why is it?
On Tue, May 8, 2012 at 5:32 PM, Bejoy KS <bejoy...@yahoo.com> wrote: > ** > Hi Bhavesh > On a job level, if you set/override some properties it won't go into > mapred-site.xml. Check your corresponding Job.xml to get the values. Also > confirm from task logs that there is no warnings with respect to overriding > those properties. If these two are good then you can confirm that the > properties supplied by you are actually utilized for the job. > > Disclaimer: I'm not a EWS guy to comment on some specifics in there. My > responses are related to generic hadoop behavior. :) > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Bhavesh Shah <bhavesh25s...@gmail.com> > *Date: *Tue, 8 May 2012 17:15:44 +0530 > *To: *<user@hive.apache.org>; Bejoy Ks<bejoy...@yahoo.com> > *ReplyTo: * user@hive.apache.org > *Subject: *Re: Want to improve the performance for execution of Hive Jobs. > > Hello Bejoy KS, > I did in the same way by executing "hive -f <filename>" on Amazon EMR. > and when I observed the mapred-site.xml, all variables that I have set in > above file are set by default with their values. I didn't see my set values. > > And the performance is slow too. > I have tried this on my local cluster by setting this values and I saw > some boost in the performance. > > > On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <bejoy...@yahoo.com> wrote: > >> Hi Bhavesh >> >> I'm not sure of AWS, but from a quick reading cluster wide settings >> like hdfs block size can be set on hdfs-site.xml through bootstrap actions. >> Since you are changing hdfs block size set min and max split size across >> the cluster using bootstrap actions itself. The rest of the properties can >> on set on a per job level. >> >> Doesn't AWS provide an option to use "hive -f"? If so, just provide all >> the properties required for tuning the query followed by queries(in order) >> in a file and simply execute it using "hive -f <file name>". >> >> Regards >> Bejoy KS >> ------------------------------ >> *From:* Bhavesh Shah <bhavesh25s...@gmail.com> >> *To:* user@hive.apache.org; Bejoy Ks <bejoy...@yahoo.com> >> *Sent:* Tuesday, May 8, 2012 3:33 PM >> >> *Subject:* Re: Want to improve the performance for execution of Hive >> Jobs. >> >> Thanks Bejoy KS for your reply, >> I want to ask one thing that If I want to set this parameter on Amazon >> Elastic Mapreduce then how can I set these variable like: >> e.g. SET mapred.min.split.size=m; >> SET mapred.max.split.size=m+n; >> set dfs.block.size=128 >> set mapred.compress.map.output=true >> set io.sort.mb=400 etc.... >> >> For all this do I need to write shell script for setting this variables >> on the particular path /home/hadoop/hive/bin/hive -e 'set .....' >> or pass all this steps in bootstrap actions??? >> >> I found this link to pass the bootstrap actions >> >> http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined >> >> What should I do in such case?? >> >> >> >> On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <bejoy...@yahoo.com> wrote: >> >> Hi Bhavesh >> >> In sqoop you can optimize the performance by using --direct mode for >> import and increasing the number of mappers used for import. When you >> increase the number of mappers you need to ensure that the RDBMS connection >> pool will handle those number of connections gracefully. Also use a evenly >> distributed column as --split-by, that'll ensure that all mappers are kind >> of equally loaded. >> min split size and map split size can be set on a job level. But, >> there are chances of slight loss in data locality if you increase these >> values. By increasing these values you are increasing the data volume >> processed per mapper so less number of mappers , now you need to see >> whether this will that get you substantial performance gains. I havent seen >> much gains there when I tried out those on some of my workflows in the >> past. A better approach than this would be increasing the hdfs block size >> itself if your cluster deals with relatively larger files. Of >> you change the hdfs block size then make the changes accordingly on min >> split and max split values. >> You can set all min and max split sizes using SET command in hive CLI >> itself. >> hive> SET mapred.min.split.size=m; >> hive> SET mapred.max.split.size=m+n; >> >> Regards >> Bejoy KS >> >> >> ------------------------------ >> *From:* Bhavesh Shah <bhavesh25s...@gmail.com> >> *To:* user@hive.apache.org >> *Sent:* Tuesday, May 8, 2012 11:35 AM >> *Subject:* Re: Want to improve the performance for execution of Hive >> Jobs. >> >> Thanks Both of you for their replies, >> If I decide to deploy my JAR on Amazon Elastic Mapreduce then, >> >> 1) Default block size is 64 MB, so insuch case I have to set it to 128 >> MB..... is it right??? >> 2) Amazon EMR has already values for mapred.min.split.size >> and mapred.max.split.size, and mapper and reducer too. So is there any need >> to set the values there? If yes then how to set for all clusters? Is it >> possible by setting all these above parameters in --bootstrap-actions.... >> to apply this for all nodes while submitting jobs to Amazon EMR?? >> >> Thanks both of u very much >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <mapred.le...@gmail.com>wrote: >> >> Try setting this value to your block >> Size, for 128 mb block size, >> >> *set mapred.min.split.size=128000* >> >> >> Sent from my iPhone >> >> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bhavesh25s...@gmail.com> >> wrote: >> >> Thanks Nitin for your reply. >> >> In short my Task is >> 1) Initially I want to import the data from MS SQL Server into HDFS using >> SQOOP. >> 2) Through Hive I am processing the data and generating the result in one >> table >> 3) That result containing table from Hive is again exported to MS SQL >> SERVER back. >> >> Actually the data which I am importing from MS SQL Server is very large >> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For >> this I have written a task in Hive which contains only queries (And each >> query has used a lot of joins in it). So due to this the performance is >> very poor on my single local machine ( It takes near about 3 hrs to >> execute completely). I have observed that when I have submitted a single >> query to Hive CLI it took 10-11 jobs to execute completely. >> >> * set mapred.min.split.size >> set mapred.max.split.size* >> Should this value to be set in bootstrap action while submitting jobs to >> amazon EMR? What value to be set for it as I don't know? >> >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <nitinpawar...@gmail.com> >> nitinpawar...@gmail.com> wrote: >> >> 1) check the jobtracker url to see how many maps/reducers have been >> launched >> 2) if you have a large dataset and wants to execute it fast, you >> set mapred.min.split.size and mapred.max.split.size to an optimal value so >> that more mappers will be launched and will finish >> 3) if you are doing joins, there are different ways to go according to >> the data you have and size of data >> >> it will be helpful if you can let us know your datasizes and query >> details >> >> >> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bhavesh25s...@gmail.com> >> bhavesh25s...@gmail.com> wrote: >> >> Hello all, >> I have written a Hive JDBC code and created a JAR of it. I am running >> that JAR on 10 cluster. >> But the problem as I am using the 10 cluster still the performance is >> same as that on single cluster. >> >> What to do to improve the performance of Hive Jobs? Is there anything >> configuration setting to set before the submitting Hive Jobs to cluster? >> One more thing I want to know is that How can we come to know that is job >> running on all cluster? >> >> Please let me know if anyone knows about it? >> >> -- >> Regards, >> Bhavesh Shah >> >> >> >> >> -- >> Nitin Pawar >> >> >> >> >> >> >> >> >> >> >> >> -- >> Regards, >> Bhavesh Shah >> >> >> >> > > > -- > Regards, > Bhavesh Shah > > -- Regards, Bhavesh Shah