Re: Want to improve the performance for execution of Hive Jobs.

Bejoy Ks Tue, 08 May 2012 23:45:19 -0700

Hi Bhavesh

     For the two properties you mentioned,
mapred.map.tasks


Number of map tasks is determined from input split and input format.

mapred.reduce.tasks
Your hive job may not require a reduce task, hence hive sets number of reducers 
to zero

Other parameters, I'm not sure why it is not even reflecting in job.xml.

Regards
Bejoy KS



________________________________
 From: Bhavesh Shah <[email protected]>
To: [email protected]; [email protected] 
Sent: Tuesday, May 8, 2012 6:16 PM
Subject: Re: Want to improve the performance for execution of Hive Jobs.
 

Thanks Bejoy for your reply.
Yes I saw that for ewvery job new XML is created. In that I saw that whatever 
variable I set is different from that.
Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2 
and In for all job XML it is showing value for  map is 1 and for reduce is 0.
Same thing are with other parameters too.
why is it? 




On Tue, May 8, 2012 at 5:32 PM, Bejoy KS <[email protected]> wrote:

Hi Bhavesh
>On a job level, if you set/override some properties it won't go into 
>mapred-site.xml. Check your corresponding Job.xml to get the values. Also 
>confirm from task logs that there is no warnings with respect to overriding 
>those properties. If these two are good then you can confirm that the 
>properties supplied by you are actually utilized for the job.
>
>Disclaimer: I'm not a EWS guy to comment on some specifics in there. My 
>responses are related to generic hadoop behavior. :)
>
>
>Regards
>Bejoy KS
>
>Sent from handheld, please excuse typos.
>
>________________________________
>
>From:  Bhavesh Shah <[email protected]> 
>Date: Tue, 8 May 2012 17:15:44 +0530
>To: <[email protected]>; Bejoy Ks<[email protected]>
>ReplyTo:  [email protected] 
>Subject: Re: Want to improve the performance for execution of Hive Jobs.
>
>Hello Bejoy KS,
>I did in the same way by executing "hive -f  <filename>" on Amazon EMR.
>and when I observed the mapred-site.xml, all variables that I have set in 
>above file are set by default with their values. I didn't see my set values.
>
>And the performance is slow too.
>I have tried this on my local cluster by setting this values and I saw some 
>boost in the performance.
>
>
>
>On Tue, May 8, 2012 at 4:23 PM, Bejoy Ks <[email protected]> wrote:
>
>Hi Bhavesh
>>
>>
>>      I'm not sure of AWS, but from a quick reading cluster wide settings 
>>like hdfs block size can be set on hdfs-site.xml through bootstrap actions. 
>>Since you are changing hdfs block size set min and max split size across the 
>>cluster using bootstrap actions itself. The rest of the properties can on set 
>>on a per job level. 
>>
>>
>>Doesn't AWS provide an option to use "hive -f"? If so, just provide all the 
>>properties required for tuning the query followed by queries(in order) in a 
>>file and simply execute it using "hive -f <file name>".
>>
>>
>>Regards
>>Bejoy KS
>>
>>________________________________
>> From: Bhavesh Shah <[email protected]>
>>To: [email protected]; Bejoy Ks <[email protected]> 
>>Sent: Tuesday, May 8, 2012 3:33 PM
>>
>>Subject: Re: Want to improve the performance for execution of Hive Jobs.
>> 
>>
>>
>>Thanks Bejoy KS for your reply,
>>I want to ask one thing that If I want to set this parameter on Amazon 
>>Elastic Mapreduce then how can I set these variable like:
>>e.g. SET mapred.min.split.size=m;
>>      SET mapred.max.split.size=m+n;
>>      set dfs.block.size=128
>>      set mapred.compress.map.output=true
>>      set io.sort.mb=400  etc....
>>
>>For all this do I need to write shell script for setting this variables on 
>>the particular path /home/hadoop/hive/bin/hive -e 'set .....'
>>or pass all this steps in bootstrap actions??? 
>>
>>I found this link to pass the bootstrap actions
>>http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined
>>
>>What should I do in such case??
>>
>>
>>
>>
>>On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <[email protected]> wrote:
>>
>>Hi Bhavesh
>>>
>>>
>>>     In sqoop you can optimize the performance by using --direct mode for 
>>>import and increasing the number of mappers used for import. When you 
>>>increase the number of mappers you need to ensure that the RDBMS connection 
>>>pool will handle those number of connections gracefully. Also use a evenly 
>>>distributed column as --split-by, that'll ensure that all mappers are kind 
>>>of equally loaded.
>>>   min split size and map split size can be set on a job level. But, there 
>>>are chances of slight loss in data locality if you increase these values. By 
>>>increasing these values you are increasing the data volume processed per 
>>>mapper so less number of mappers , now you need to see whether this will 
>>>that get you substantial performance gains. I havent seen much gains there 
>>>when I tried out those on some of my workflows in the past. A better 
>>>approach than this would be increasing the hdfs block size itself if your 
>>>cluster deals with relatively larger files. Of you change the hdfs block 
>>>size then make the changes accordingly on min split and max split values.
>>>    You can set all min and max split sizes using SET command in hive CLI 
>>>itself.
>>>hive> SET mapred.min.split.size=m;
>>>hive> SET mapred.max.split.size=m+n;
>>>
>>>
>>>Regards
>>>Bejoy KS
>>>     
>>>
>>>
>>>
>>>________________________________
>>> From: Bhavesh Shah <[email protected]>
>>>To: [email protected] 
>>>Sent: Tuesday, May 8, 2012 11:35 AM
>>>Subject: Re: Want to improve the performance for execution of Hive Jobs.
>>> 
>>>
>>>
>>>Thanks Both of you for their replies,
>>>If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>>>
>>>1) Default block size is 64 MB, so insuch case I have to set it to 128 
>>>MB..... is it right???
>>>2) Amazon EMR has already values for  mapred.min.split.size 
>>>and mapred.max.split.size, and mapper and reducer too. So is there any need 
>>>to set the values there? If yes then how to set for all clusters? Is it 
>>>possible by setting all these above parameters in --bootstrap-actions.... to 
>>>apply this for all nodes while submitting jobs to Amazon EMR??
>>>
>>>Thanks both of u very much
>>>
>>>-- 
>>>Regards,
>>>Bhavesh Shah
>>>
>>>
>>>On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <[email protected]> wrote:
>>>
>>>Try setting this value to your block
>>>>Size, for 128 mb block size,
>>>>
>>>>
>>>>set mapred.min.split.size=128000
>>>>Sent from my iPhone
>>>>
>>>>On May 7, 2012, at 10:11 PM, Bhavesh Shah <[email protected]> wrote:
>>>>
>>>>
>>>>Thanks Nitin for your reply.
>>>>>
>>>>>In short my Task is 
>>>>>1) Initially I want to import the data from MS SQL Server into HDFS using 
>>>>>SQOOP.
>>>>>2) Through Hive I am processing the data and generating the result in one 
>>>>>table
>>>>>3) That result containing table from Hive is again exported to MS SQL 
>>>>>SERVER back.
>>>>>
>>>>>Actually the data which I am importing from MS SQL Server is very large 
(near about 5,00,000 entries in one table. Like wise I have 30 tables). 
For this I have written a task in Hive which contains only queries (And 
each query has used a lot of joins in it). So due to this the 
performance is very poor on  my single local machine ( It takes near 
about 3 hrs to execute completely). I have observed that when I have submitted 
a single query to Hive CLI it took 10-11 jobs to execute completely.
>>>>>
>>>>>set mapred.min.split.size 
>>>>>set mapred.max.split.size
>>>>>Should this value to be set in bootstrap action while submitting jobs to 
>>>>>amazon EMR? What value to be set for it as I don't know?
>>>>>
>>>>>
>>>>>-- 
>>>>>Regards,
>>>>>Bhavesh Shah
>>>>>
>>>>>
>>>>>On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <[email protected]> 
>>>>>wrote:
>>>>>
>>>>>1) check the jobtracker url to see how many maps/reducers have been 
>>>>>launched
>>>>>>2) if you have a large dataset and wants to execute it fast, you 
>>>>>>set mapred.min.split.size and mapred.max.split.size to an optimal value 
>>>>>>so that more mappers will be launched and will finish 
>>>>>>3) if you are doing joins, there are different ways to go according to 
>>>>>>the data you have and size of data 
>>>>>>
>>>>>>
>>>>>>it will be helpful if you can let us know your datasizes and query 
>>>>>>details 
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <[email protected]> 
>>>>>>wrote:
>>>>>>
>>>>>>Hello all,
>>>>>>>I have written a Hive JDBC code and created a JAR of it. I am running 
>>>>>>>that JAR on 10 cluster.
>>>>>>>But the problem as I am using the 10 cluster still the performance is 
>>>>>>>same as that on single cluster.
>>>>>>>
>>>>>>>What to do to improve the performance of Hive Jobs? Is there anything 
>>>>>>>configuration setting to set before the submitting Hive Jobs to cluster?
>>>>>>>One more thing I want to know is that How can we come to know that is 
>>>>>>>job running on all cluster?
>>>>>>>
>>>>>>>Please let me know if anyone knows about it?
>>>>>>>
>>>>>>>-- 
>>>>>>>Regards,
>>>>>>>Bhavesh Shah
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-- 
>>>>>>Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>-- 
>>Regards,
>>Bhavesh Shah
>>
>>
>>
>
>
>-- 
>Regards,
>Bhavesh Shah
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Reply via email to