Hi Bhavesh For the two properties you mentioned, mapred.map.tasks
Number of map tasks is determined from input split and input format. mapred.reduce.tasks Your hive job may not require a reduce task, hence hive sets number of reducers to zero Other parameters, I'm not sure why it is not even reflecting in job.xml. Regards Bejoy KS ________________________________ From: Bhavesh Shah <[email protected]> To: [email protected]; [email protected] Sent: Tuesday, May 8, 2012 6:16 PM Subject: Re: Want to improve the performance for execution of Hive Jobs. Thanks Bejoy for your reply. Yes I saw that for ewvery job new XML is created. In that I saw that whatever variable I set is different from that. Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2 and In for all job XML it is showing value for map is 1 and for reduce is 0. Same thing are with other parameters too. why is it? On Tue, May 8, 2012 at 5:32 PM, Bejoy KS <[email protected]> wrote: Hi Bhavesh >On a job level, if you set/override some properties it won't go into >mapred-site.xml. Check your corresponding Job.xml to get the values. Also >confirm from task logs that there is no warnings with respect to overriding >those properties. If these two are good then you can confirm that the >properties supplied by you are actually utilized for the job. > >Disclaimer: I'm not a EWS guy to comment on some specifics in there. My >responses are related to generic hadoop behavior. :) > > >Regards >Bejoy KS > >Sent from handheld, please excuse typos. > >________________________________ > >From: Bhavesh Shah <[email protected]> >Date: Tue, 8 May 2012 17:15:44 +0530 >To: <[email protected]>; Bejoy Ks<[email protected]> >ReplyTo: [email protected] >Subject: Re: Want to improve the performance for execution of Hive Jobs. > >Hello Bejoy KS, >I did in the same way by executing "hive -f <filename>" on Amazon EMR. >and when I observed the mapred-site.xml, all variables that I have set in >above file are set by default with their values. I didn't see my set values. > >And the performance is slow too. >I have tried this on my local cluster by setting this values and I saw some >boost in the performance. > > > >On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <[email protected]> wrote: > >Hi Bhavesh >> >> >> I'm not sure of AWS, but from a quick reading cluster wide settings >>like hdfs block size can be set on hdfs-site.xml through bootstrap actions. >>Since you are changing hdfs block size set min and max split size across the >>cluster using bootstrap actions itself. The rest of the properties can on set >>on a per job level. >> >> >>Doesn't AWS provide an option to use "hive -f"? If so, just provide all the >>properties required for tuning the query followed by queries(in order) in a >>file and simply execute it using "hive -f <file name>". >> >> >>Regards >>Bejoy KS >> >>________________________________ >> From: Bhavesh Shah <[email protected]> >>To: [email protected]; Bejoy Ks <[email protected]> >>Sent: Tuesday, May 8, 2012 3:33 PM >> >>Subject: Re: Want to improve the performance for execution of Hive Jobs. >> >> >> >>Thanks Bejoy KS for your reply, >>I want to ask one thing that If I want to set this parameter on Amazon >>Elastic Mapreduce then how can I set these variable like: >>e.g. SET mapred.min.split.size=m; >> SET mapred.max.split.size=m+n; >> set dfs.block.size=128 >> set mapred.compress.map.output=true >> set io.sort.mb=400 etc.... >> >>For all this do I need to write shell script for setting this variables on >>the particular path /home/hadoop/hive/bin/hive -e 'set .....' >>or pass all this steps in bootstrap actions??? >> >>I found this link to pass the bootstrap actions >>http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined >> >>What should I do in such case?? >> >> >> >> >>On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <[email protected]> wrote: >> >>Hi Bhavesh >>> >>> >>> In sqoop you can optimize the performance by using --direct mode for >>>import and increasing the number of mappers used for import. When you >>>increase the number of mappers you need to ensure that the RDBMS connection >>>pool will handle those number of connections gracefully. Also use a evenly >>>distributed column as --split-by, that'll ensure that all mappers are kind >>>of equally loaded. >>> min split size and map split size can be set on a job level. But, there >>>are chances of slight loss in data locality if you increase these values. By >>>increasing these values you are increasing the data volume processed per >>>mapper so less number of mappers , now you need to see whether this will >>>that get you substantial performance gains. I havent seen much gains there >>>when I tried out those on some of my workflows in the past. A better >>>approach than this would be increasing the hdfs block size itself if your >>>cluster deals with relatively larger files. Of you change the hdfs block >>>size then make the changes accordingly on min split and max split values. >>> You can set all min and max split sizes using SET command in hive CLI >>>itself. >>>hive> SET mapred.min.split.size=m; >>>hive> SET mapred.max.split.size=m+n; >>> >>> >>>Regards >>>Bejoy KS >>> >>> >>> >>> >>>________________________________ >>> From: Bhavesh Shah <[email protected]> >>>To: [email protected] >>>Sent: Tuesday, May 8, 2012 11:35 AM >>>Subject: Re: Want to improve the performance for execution of Hive Jobs. >>> >>> >>> >>>Thanks Both of you for their replies, >>>If I decide to deploy my JAR on Amazon Elastic Mapreduce then, >>> >>>1) Default block size is 64 MB, so insuch case I have to set it to 128 >>>MB..... is it right??? >>>2) Amazon EMR has already values for mapred.min.split.size >>>and mapred.max.split.size, and mapper and reducer too. So is there any need >>>to set the values there? If yes then how to set for all clusters? Is it >>>possible by setting all these above parameters in --bootstrap-actions.... to >>>apply this for all nodes while submitting jobs to Amazon EMR?? >>> >>>Thanks both of u very much >>> >>>-- >>>Regards, >>>Bhavesh Shah >>> >>> >>>On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <[email protected]> wrote: >>> >>>Try setting this value to your block >>>>Size, for 128 mb block size, >>>> >>>> >>>>set mapred.min.split.size=128000 >>>>Sent from my iPhone >>>> >>>>On May 7, 2012, at 10:11 PM, Bhavesh Shah <[email protected]> wrote: >>>> >>>> >>>>Thanks Nitin for your reply. >>>>> >>>>>In short my Task is >>>>>1) Initially I want to import the data from MS SQL Server into HDFS using >>>>>SQOOP. >>>>>2) Through Hive I am processing the data and generating the result in one >>>>>table >>>>>3) That result containing table from Hive is again exported to MS SQL >>>>>SERVER back. >>>>> >>>>>Actually the data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely). I have observed that when I have submitted a single query to Hive CLI it took 10-11 jobs to execute completely. >>>>> >>>>>set mapred.min.split.size >>>>>set mapred.max.split.size >>>>>Should this value to be set in bootstrap action while submitting jobs to >>>>>amazon EMR? What value to be set for it as I don't know? >>>>> >>>>> >>>>>-- >>>>>Regards, >>>>>Bhavesh Shah >>>>> >>>>> >>>>>On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <[email protected]> >>>>>wrote: >>>>> >>>>>1) check the jobtracker url to see how many maps/reducers have been >>>>>launched >>>>>>2) if you have a large dataset and wants to execute it fast, you >>>>>>set mapred.min.split.size and mapred.max.split.size to an optimal value >>>>>>so that more mappers will be launched and will finish >>>>>>3) if you are doing joins, there are different ways to go according to >>>>>>the data you have and size of data >>>>>> >>>>>> >>>>>>it will be helpful if you can let us know your datasizes and query >>>>>>details >>>>>> >>>>>> >>>>>> >>>>>>On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <[email protected]> >>>>>>wrote: >>>>>> >>>>>>Hello all, >>>>>>>I have written a Hive JDBC code and created a JAR of it. I am running >>>>>>>that JAR on 10 cluster. >>>>>>>But the problem as I am using the 10 cluster still the performance is >>>>>>>same as that on single cluster. >>>>>>> >>>>>>>What to do to improve the performance of Hive Jobs? Is there anything >>>>>>>configuration setting to set before the submitting Hive Jobs to cluster? >>>>>>>One more thing I want to know is that How can we come to know that is >>>>>>>job running on all cluster? >>>>>>> >>>>>>>Please let me know if anyone knows about it? >>>>>>> >>>>>>>-- >>>>>>>Regards, >>>>>>>Bhavesh Shah >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>-- >>>>>>Nitin Pawar >>>>>> >>>>>> >>>>> >>>>> >>> >>> >>> >>> >>> >> >> >>-- >>Regards, >>Bhavesh Shah >> >> >> > > >-- >Regards, >Bhavesh Shah > -- Regards, Bhavesh Shah
