[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2017-01-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794215#comment-15794215
 ] 

Rui Li commented on HIVE-15313:
---

[~Ferd], yeah I'm OK to update the wiki for performance first. I'll update 
again once the minimum set is determined.

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
> cd $SPARK_HOME/jars
> zip spark-archive.zip ./*.jar # this is important, enter the jars folder then 
> zip
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2017-01-02 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15794156#comment-15794156
 ] 

Ferdinand Xu commented on HIVE-15313:
-

[~lirui], any progress or plan about HIVE-15302? I think we can resolve this 
ticket firstly since HIVE-15302 doesn't block HIVE-15313? Any suggestions? 
[~xuefuz] [~lirui]

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
> cd $SPARK_HOME/jars
> zip spark-archive.zip ./*.jar # this is important, enter the jars folder then 
> zip
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2016-11-30 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710935#comment-15710935
 ] 

Xuefu Zhang commented on HIVE-15313:


+1 from me as well. Let's first flush out everything around spark dependency 
and then provide wiki documentation.

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
>  zip spark-archive.zip $SPARK_HOME/jars/*
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2016-11-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710505#comment-15710505
 ] 

Rui Li commented on HIVE-15313:
---

Seems these two configs are useful in several ways :) I'm also looking at them 
in HIVE-15302.
My plan is to identify the minimum set of needed jars (good for performance the 
avoid conflicts) and update the wiki.
We can also make code change to automatically set it if user hasn't.

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
>  zip spark-archive.zip $SPARK_HOME/jars/*
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15313) Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

2016-11-30 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710449#comment-15710449
 ] 

Ferdinand Xu commented on HIVE-15313:
-

 I'd like to update Hive WIKI about configuring spark.yarn.jars configuration 
for better performance for queries at small data scale. [~xuefuz] [~lirui], any 
suggestions about this?

> Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark 
> document
> ---
>
> Key: HIVE-15313
> URL: https://issues.apache.org/jira/browse/HIVE-15313
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang_intel
>Priority: Minor
> Attachments: performance.improvement.after.set.spark.yarn.archive.PNG
>
>
> According to 
> [wiki|https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started],
>  run queries in HOS16 and HOS20 in yarn mode.
> Following table shows the difference in query time between HOS16 and HOS20.
> ||Version||Total time||Time for Jobs||Time for preparing jobs||
> |Spark16|51|39|12|
> |Spark20|54|40|14| 
>  HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing 
> the source code of spark, found that following point causes this:
> code:[Client#distribute|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L546],
>  In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in 
> spark configuration file, it will first copy all jars in $SPARK_HOME/jars to 
> a tmp directory and upload the tmp directory to distribute cache. Comparing 
> [spark16|https://github.com/apache/spark/blob/branch-1.6/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1145],
>  
> In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
> In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a 
> tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
> We can accelerate the startup of hive on spark 20 by settintg 
> "spark.yarn.archive" or "spark.yarn.jars":
> set "spark.yarn.archive":
> {code}
>  zip spark-archive.zip $SPARK_HOME/jars/*
> $ hadoop fs -copyFromLocal spark-archive.zip 
> $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> 
> conf/spark-defaults.conf
> {code}
> set "spark.yarn.jars":
> {code}
> $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
> $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
> $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> 
> conf/spark-defaults.conf
> {code}
> Suggest to add this part in wiki.
> performance.improvement.after.set.spark.yarn.archive.PNG shows the detail 
> performance impovement after setting spark.yarn.archive in small queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)