Hi Bhavesh

     In sqoop you can optimize the performance by using --direct mode for 
import and increasing the number of mappers used for import. When you increase 
the number of mappers you need to ensure that the RDBMS connection pool will 
handle those number of connections gracefully. Also use a evenly distributed 
column as --split-by, that'll ensure that all mappers are kind of equally 
loaded.
   min split size and map split size can be set on a job level. But, there are 
chances of slight loss in data locality if you increase these values. By 
increasing these values you are increasing the data volume processed per mapper 
so less number of mappers , now you need to see whether this will that get you 
substantial performance gains. I havent seen much gains there when I tried out 
those on some of my workflows in the past. A better approach than this would be 
increasing the hdfs block size itself if your cluster deals with relatively 
larger files. Of you change the hdfs block size then make the changes 
accordingly on min split and max split values.
    You can set all min and max split sizes using SET command in hive CLI 
itself.
hive> SET mapred.min.split.size=m;
hive> SET mapred.max.split.size=m+n;

Regards
Bejoy KS
     


________________________________
 From: Bhavesh Shah <bhavesh25s...@gmail.com>
To: user@hive.apache.org 
Sent: Tuesday, May 8, 2012 11:35 AM
Subject: Re: Want to improve the performance for execution of Hive Jobs.
 

Thanks Both of you for their replies,
If I decide to deploy my JAR on Amazon Elastic Mapreduce then,

1) Default block size is 64 MB, so insuch case I have to set it to 128 MB..... 
is it right???
2) Amazon EMR has already values for  mapred.min.split.size 
and mapred.max.split.size, and mapper and reducer too. So is there any need to 
set the values there? If yes then how to set for all clusters? Is it possible 
by setting all these above parameters in --bootstrap-actions.... to apply this 
for all nodes while submitting jobs to Amazon EMR??

Thanks both of u very much

-- 
Regards,
Bhavesh Shah


On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <mapred.le...@gmail.com> wrote:

Try setting this value to your block
>Size, for 128 mb block size,
>
>
>set mapred.min.split.size=128000
>Sent from my iPhone
>
>On May 7, 2012, at 10:11 PM, Bhavesh Shah <bhavesh25s...@gmail.com> wrote:
>
>
>Thanks Nitin for your reply.
>>
>>In short my Task is 
>>1) Initially I want to import the data from MS SQL Server into HDFS using 
>>SQOOP.
>>2) Through Hive I am processing the data and generating the result in one 
>>table
>>3) That result containing table from Hive is again exported to MS SQL SERVER 
>>back.
>>
>>Actually the data which I am importing from MS SQL Server is very large 
(near about 5,00,000 entries in one table. Like wise I have 30 tables). 
For this I have written a task in Hive which contains only queries (And 
each query has used a lot of joins in it). So due to this the 
performance is very poor on  my single local machine ( It takes near 
about 3 hrs to execute completely). I have observed that when I have submitted 
a single query to Hive CLI it took 10-11 jobs to execute completely.
>>
>>set mapred.min.split.size 
>>set mapred.max.split.size
>>Should this value to be set in bootstrap action while submitting jobs to 
>>amazon EMR? What value to be set for it as I don't know?
>>
>>
>>-- 
>>Regards,
>>Bhavesh Shah
>>
>>
>>On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <nitinpawar...@gmail.com> wrote:
>>
>>1) check the jobtracker url to see how many maps/reducers have been launched
>>>2) if you have a large dataset and wants to execute it fast, you 
>>>set mapred.min.split.size and mapred.max.split.size to an optimal value so 
>>>that more mappers will be launched and will finish 
>>>3) if you are doing joins, there are different ways to go according to the 
>>>data you have and size of data 
>>>
>>>
>>>it will be helpful if you can let us know your datasizes and query details 
>>>
>>>
>>>
>>>On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bhavesh25s...@gmail.com> 
>>>wrote:
>>>
>>>Hello all,
>>>>I have written a Hive JDBC code and created a JAR of it. I am running that 
>>>>JAR on 10 cluster.
>>>>But the problem as I am using the 10 cluster still the performance is same 
>>>>as that on single cluster.
>>>>
>>>>What to do to improve the performance of Hive Jobs? Is there anything 
>>>>configuration setting to set before the submitting Hive Jobs to cluster?
>>>>One more thing I want to know is that How can we come to know that is job 
>>>>running on all cluster?
>>>>
>>>>Please let me know if anyone knows about it?
>>>>
>>>>-- 
>>>>Regards,
>>>>Bhavesh Shah
>>>>
>>>
>>>
>>>
>>>-- 
>>>Nitin Pawar
>>>
>>>
>>
>>

Reply via email to