Re: Want to improve the performance for execution of Hive Jobs.

Bhavesh Shah Tue, 08 May 2012 03:04:18 -0700

Thanks Bejoy KS for your reply,
I want to ask one thing that If I want to set this parameter on Amazon
Elastic Mapreduce then how can I set these variable like:
e.g. SET mapred.min.split.size=m;
      SET mapred.max.split.size=m+n;
      set dfs.block.size=128
      set mapred.compress.map.output=true
      set io.sort.mb=400  etc....


For all this do I need to write shell script for setting this variables on
the particular path /home/hadoop/hive/bin/hive -e 'set .....'
or pass all this steps in bootstrap actions???

I found this link to pass the bootstrap actions
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html#BootstrapPredefined

What should I do in such case??



On Tue, May 8, 2012 at 2:55 PM, Bejoy Ks <bejoy...@yahoo.com> wrote:

> Hi Bhavesh
>
>      In sqoop you can optimize the performance by using --direct mode for
> import and increasing the number of mappers used for import. When you
> increase the number of mappers you need to ensure that the RDBMS connection
> pool will handle those number of connections gracefully. Also use a evenly
> distributed column as --split-by, that'll ensure that all mappers are kind
> of equally loaded.
>    min split size and map split size can be set on a job level. But, there
> are chances of slight loss in data locality if you increase these values.
> By increasing these values you are increasing the data volume processed per
> mapper so less number of mappers , now you need to see whether this will
> that get you substantial performance gains. I havent seen much gains there
> when I tried out those on some of my workflows in the past. A better
> approach than this would be increasing the hdfs block size itself if your
> cluster deals with relatively larger files. Of you change the hdfs block
> size then make the changes accordingly on min split and max split values.
>     You can set all min and max split sizes using SET command in hive CLI
> itself.
> hive> SET mapred.min.split.size=m;
> hive> SET mapred.max.split.size=m+n;
>
> Regards
> Bejoy KS
>
>
>   ------------------------------
> *From:* Bhavesh Shah <bhavesh25s...@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Tuesday, May 8, 2012 11:35 AM
> *Subject:* Re: Want to improve the performance for execution of Hive Jobs.
>
> Thanks Both of you for their replies,
> If I decide to deploy my JAR on Amazon Elastic Mapreduce then,
>
> 1) Default block size is 64 MB, so insuch case I have to set it to 128
> MB..... is it right???
> 2) Amazon EMR has already values for  mapred.min.split.size
> and mapred.max.split.size, and mapper and reducer too. So is there any need
> to set the values there? If yes then how to set for all clusters? Is it
> possible by setting all these above parameters in --bootstrap-actions....
> to apply this for all nodes while submitting jobs to Amazon EMR??
>
> Thanks both of u very much
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 11:19 AM, Mapred Learn <mapred.le...@gmail.com>wrote:
>
> Try setting this value to your block
> Size, for 128 mb block size,
>
> *set mapred.min.split.size=128000*
>
>
> Sent from my iPhone
>
> On May 7, 2012, at 10:11 PM, Bhavesh Shah <bhavesh25s...@gmail.com> wrote:
>
> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar < <nitinpawar...@gmail.com>
> nitinpawar...@gmail.com> wrote:
>
> 1) check the jobtracker url to see how many maps/reducers have been
> launched
> 2) if you have a large dataset and wants to execute it fast, you
> set mapred.min.split.size and mapred.max.split.size to an optimal value so
> that more mappers will be launched and will finish
> 3) if you are doing joins, there are different ways to go according to the
> data you have and size of data
>
> it will be helpful if you can let us know your datasizes and query details
>
>
> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah < <bhavesh25s...@gmail.com>
> bhavesh25s...@gmail.com> wrote:
>
> Hello all,
> I have written a Hive JDBC code and created a JAR of it. I am running that
> JAR on 10 cluster.
> But the problem as I am using the 10 cluster still the performance is same
> as that on single cluster.
>
> What to do to improve the performance of Hive Jobs? Is there anything
> configuration setting to set before the submitting Hive Jobs to cluster?
> One more thing I want to know is that How can we come to know that is job
> running on all cluster?
>
> Please let me know if anyone knows about it?
>
> --
> Regards,
> Bhavesh Shah
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
>
>
>
>


-- 
Regards,
Bhavesh Shah

Re: Want to improve the performance for execution of Hive Jobs.

Reply via email to