Re: Want to improve the performance for execution of Hive Jobs.

Mapred Learn Mon, 07 May 2012 22:50:10 -0700

Try setting this value to your block
Size, for 128 mb block size,

> set mapred.min.split.size=128000


Sent from my iPhone

On May 7, 2012, at 10:11 PM, Bhavesh Shah <[email protected]> wrote:

> Thanks Nitin for your reply.
> 
> In short my Task is 
> 1) Initially I want to import the data from MS SQL Server into HDFS using 
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one 
> table
> 3) That result containing table from Hive is again exported to MS SQL SERVER 
> back.
> 
> Actually the data which I am importing from MS SQL Server is very large (near 
> about 5,00,000 entries in one table. Like wise I have 30 tables). For this I 
> have written a task in Hive which contains only queries (And each query has 
> used a lot of joins in it). So due to this the performance is very poor on  
> my single local machine ( It takes near about 3 hrs to execute completely). I 
> have observed that when I have submitted a single query to Hive CLI it took 
> 10-11 jobs to execute completely.
> 
> set mapred.min.split.size 
> set mapred.max.split.size
> Should this value to be set in bootstrap action while submitting jobs to 
> amazon EMR? What value to be set for it as I don't know?
> 
> 
> -- 
> Regards,
> Bhavesh Shah
> 
> 
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <[email protected]> wrote:
> 1) check the jobtracker url to see how many maps/reducers have been launched
> 2) if you have a large dataset and wants to execute it fast, you set 
> mapred.min.split.size and mapred.max.split.size to an optimal value so that 
> more mappers will be launched and will finish 
> 3) if you are doing joins, there are different ways to go according to the 
> data you have and size of data 
> 
> it will be helpful if you can let us know your datasizes and query details 
> 
> 
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <[email protected]> wrote:
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that 
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same as 
> that on single cluster.
> 
> What to do to improve the performance of Hive Jobs? Is there anything 
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job 
> running on all cluster?
> 
> Please let me know if anyone knows about it?
> 
> -- 
> Regards,
> Bhavesh Shah
> 
> 
> 
> 
> -- 
> Nitin Pawar
> 
> 
>

Re: Want to improve the performance for execution of Hive Jobs.

Reply via email to