[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-21 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510628#comment-15510628
 ] 

Sahil Takiar commented on HIVE-14240:
-

[~Ferd], [~lirui] yes I forgot that there are two ways qtests get run on spark, 
one is in local-cluster mode and the other is in yarn-client mode. I believe 
the dependency on a SPARK_HOME directory is present in both modes. So unless we 
can figure out a way to change this in Spark, I think we still need the 
dependency on the Spark distribution.

> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-20 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508403#comment-15508403
 ] 

Rui Li commented on HIVE-14240:
---

We have two kinds of test for HoS - TestSparkCliDriver runs on local-cluster, 
and TestMiniSparkOnYarnCliDriver runs on a mini yarn cluster. I know 
local-cluster is not intended to be used outside spark. So if local-cluster 
causes trouble for this task, I think it's acceptable to migrate the qtest in 
TestSparkCliDriver to TestMiniSparkOnYarnCliDriver.

> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508363#comment-15508363
 ] 

liyunzhang_intel commented on HIVE-14240:
-

[~Ferd]: 
bq. In Pig, they don't require Spark distribution since they only test Spark 
standalone mode in their integration test.

In Pig on Spark, we don't need download spark distribution to run unit test 
because now we only enable "local"(SPARK_MASTER) mode. we don't support 
standalone, yarn-client, yarn-cluster mode now. We just [copy all spark 
dependency jars published from mvn repository to the run-time 
classpath|https://github.com/apache/pig/blob/spark/bin/pig#L399] when running 
unit tests.

> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-20 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508335#comment-15508335
 ] 

Ferdinand Xu commented on HIVE-14240:
-

Thanks [~stakiar] for your input. 
AFAIK, TestSparkCliDriver needs SparkSubmit to submit a job which requires 
SPARK_HOME to direct to a Spark distribution because it tests SparkOnYarn. 
[~kellyzly] [~mohitsabharwal], please correct it if any  following statements 
are wrong. In Pig, they don't require Spark distribution since they only test 
Spark standalone mode in their integration test.


> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508218#comment-15508218
 ] 

Sahil Takiar commented on HIVE-14240:
-

I looked into this today and tried to get something working, but I don't think 
its possible without making some modifications to Spark.

* The HoS integration tests run with {{spark.master=local-cluster[2,2,1024]}}
** Basically, the {{TestSparkCliDriver}} JVM run the SparkSubmit command (which 
will spawn a new process), the SparkSubmit process will then create 2 more 
processes (the Spark Executors do the actual work) with 2 cores and 1024 Mb 
memory each
** The {{local-cluster}} option is not present in the Spark docs because it is 
mainly used for integration testing within the Spark project itself; it 
basically provides a way of deploying a mini cluster locally
** The advantage of the {{local-cluster}} is that it does not require Spark 
Masters or Workers to be running
*** Spark Workers are basically like NodeManagers, a Spark Master is basically 
like HS2
* Looked through the Spark code that launches actual Spark Executors and they 
more or less require a {{SPARK_HOME}} directory to be present (ref: 
https://github.com/apache/spark/blob/branch-2.0/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java)
** {{SPARK_HOME}} is suppose to point to a directory containing a Spark 
distribution

Thus, we would need to modify the {{AbstractCommandBuilder.java}} class in 
Spark so that it doesn't require {{SPARK_HOME}} to be set. However, I'm not 
sure how difficult this will be to do in Spark.

We could change the {{spark.master} from {{local-cluster}} to {{local}}, in 
which case everything will be run locally. However, I think this removes some 
functionality from the HoS tests since running locally isn't the same as 
running against a real mini-cluster.

> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507644#comment-15507644
 ] 

Sahil Takiar commented on HIVE-14240:
-

Hey [~Ferd] I haven't had time to look into this, although it shouldn't be 
particularly difficult (I would hope). I don't think this blocks HIVE-14029 but 
I'm trying to talk to some Spark committers to see what they think.

> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14240) HoS itests shouldn't depend on a Spark distribution

2016-09-20 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506745#comment-15506745
 ] 

Ferdinand Xu commented on HIVE-14240:
-

Hi [~stakiar], do you have any updates for this ticket? I am trying to move 
HIVE-14029 forwards.

Thanks,
Ferd

> HoS itests shouldn't depend on a Spark distribution
> ---
>
> Key: HIVE-14240
> URL: https://issues.apache.org/jira/browse/HIVE-14240
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Affects Versions: 2.0.0, 2.1.0, 2.0.1
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> The HoS integration tests download a full Spark Distribution (a tar-ball) 
> from CloudFront. It uses this distribution to run Spark locally. It runs a 
> few tests with Spark in embedded mode, and some tests against a local Spark 
> on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download 
> the tar-ball from a pre-defined location.
> This is problematic because the Spark Distribution shades all its 
> dependencies, including Hadoop dependencies. This can cause problems when 
> upgrading the Hadoop version for Hive (ref: HIVE-13930).
> Removing it will also avoid having to download the tar-ball during every 
> build, and simplify the build process for the itests module.
> The Hive itests should instead directly depend on Spark artifacts published 
> in Maven Central. It will require some effort to get this working. The 
> current Hive Spark Client uses a launch script in the Spark installation to 
> run Spark jobs. The script basically does some setup work and invokes 
> org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class 
> directly, which avoids the need to have a full Spark distribution available 
> locally (in fact this option already exists, but isn't tested).
> There may be other issues around classpath conflicts between Hive and Spark. 
> For example, Hive and Spark require different versions of Kyro. One solution 
> to this would be to take Spark artifacts and shade Kyro inside them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)