[ https://issues.apache.org/jira/browse/HIVE-14240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508218#comment-15508218 ]
Sahil Takiar edited comment on HIVE-14240 at 9/21/16 12:16 AM: --------------------------------------------------------------- I looked into this today and tried to get something working, but I don't think its possible without making some modifications to Spark. * The HoS integration tests run with {{spark.master=local-cluster[2,2,1024]}} ** Basically, the {{TestSparkCliDriver}} JVM run the SparkSubmit command (which will spawn a new process), the SparkSubmit process will then create 2 more processes (the Spark Executors do the actual work) with 2 cores and 1024 Mb memory each ** The {{local-cluster}} option is not present in the Spark docs because it is mainly used for integration testing within the Spark project itself; it basically provides a way of deploying a mini cluster locally ** The advantage of the {{local-cluster}} is that it does not require Spark Masters or Workers to be running *** Spark Workers are basically like NodeManagers, a Spark Master is basically like HS2 * Looked through the Spark code that launches actual Spark Executors and they more or less require a {{SPARK_HOME}} directory to be present (ref: https://github.com/apache/spark/blob/branch-2.0/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java) ** {{SPARK_HOME}} is suppose to point to a directory containing a Spark distribution Thus, we would need to modify the {{AbstractCommandBuilder.java}} class in Spark so that it doesn't require {{SPARK_HOME}} to be set. However, I'm not sure how difficult this will be to do in Spark. We could change the {{spark.master}} from {{local-cluster}} to {{local}}, in which case everything will be run locally. However, I think this removes some functionality from the HoS tests since running locally isn't the same as running against a real mini-cluster. was (Author: stakiar): I looked into this today and tried to get something working, but I don't think its possible without making some modifications to Spark. * The HoS integration tests run with {{spark.master=local-cluster[2,2,1024]}} ** Basically, the {{TestSparkCliDriver}} JVM run the SparkSubmit command (which will spawn a new process), the SparkSubmit process will then create 2 more processes (the Spark Executors do the actual work) with 2 cores and 1024 Mb memory each ** The {{local-cluster}} option is not present in the Spark docs because it is mainly used for integration testing within the Spark project itself; it basically provides a way of deploying a mini cluster locally ** The advantage of the {{local-cluster}} is that it does not require Spark Masters or Workers to be running *** Spark Workers are basically like NodeManagers, a Spark Master is basically like HS2 * Looked through the Spark code that launches actual Spark Executors and they more or less require a {{SPARK_HOME}} directory to be present (ref: https://github.com/apache/spark/blob/branch-2.0/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java) ** {{SPARK_HOME}} is suppose to point to a directory containing a Spark distribution Thus, we would need to modify the {{AbstractCommandBuilder.java}} class in Spark so that it doesn't require {{SPARK_HOME}} to be set. However, I'm not sure how difficult this will be to do in Spark. We could change the {{spark.master} from {{local-cluster}} to {{local}}, in which case everything will be run locally. However, I think this removes some functionality from the HoS tests since running locally isn't the same as running against a real mini-cluster. > HoS itests shouldn't depend on a Spark distribution > --------------------------------------------------- > > Key: HIVE-14240 > URL: https://issues.apache.org/jira/browse/HIVE-14240 > Project: Hive > Issue Type: Improvement > Components: Spark > Affects Versions: 2.0.0, 2.1.0, 2.0.1 > Reporter: Sahil Takiar > Assignee: Sahil Takiar > > The HoS integration tests download a full Spark Distribution (a tar-ball) > from CloudFront. It uses this distribution to run Spark locally. It runs a > few tests with Spark in embedded mode, and some tests against a local Spark > on YARN cluster. The {{itests/pom.xml}} actually contains scripts to download > the tar-ball from a pre-defined location. > This is problematic because the Spark Distribution shades all its > dependencies, including Hadoop dependencies. This can cause problems when > upgrading the Hadoop version for Hive (ref: HIVE-13930). > Removing it will also avoid having to download the tar-ball during every > build, and simplify the build process for the itests module. > The Hive itests should instead directly depend on Spark artifacts published > in Maven Central. It will require some effort to get this working. The > current Hive Spark Client uses a launch script in the Spark installation to > run Spark jobs. The script basically does some setup work and invokes > org.apache.spark.deploy.SparkSubmit. It is possible to invoke this class > directly, which avoids the need to have a full Spark distribution available > locally (in fact this option already exists, but isn't tested). > There may be other issues around classpath conflicts between Hive and Spark. > For example, Hive and Spark require different versions of Kyro. One solution > to this would be to take Spark artifacts and shade Kyro inside them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)