[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-31 Thread Min Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649062#comment-14649062
 ] 

Min Wu commented on SPARK-8646:
---

Hi, I got same issue when I running the pyspark program with yarn-client mode 
and spark 1.4.1 from Biginsight 4.1(Ambari). Because the assembly jar no longer 
contains the python scripts of pyspark and py4j, so I set the spark home via 
SparkContext.setSparkHome() to spark-client location(because this is one Ambari 
hadoop, so the spark-client contains the python folder, and it includes the 
py4j and pyspark scripts). The API document shows this will be applied to slave 
nodes, I assume this can be applied for "spark on yarn" also, but it does not 
work.  The worker nodes always get the PYTHONPATH from cached assembly jar. 
After checked the SparkContext code, seems the sparkHome will be set into 
SparkConf as "spark.home", so I think maybe it should be distributed to all 
executor and pyspark can use this parameter to locate the PYTHONPATH also.

> PySpark does not run on YARN if master not provided in command line
> ---
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
>Assignee: Lianhui Wang
> Fix For: 1.5.0
>
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629409#comment-14629409
 ] 

Lianhui Wang commented on SPARK-8646:
-

yes,  when i use this command: ./bin/spark-submit ./pi.py yarn-client 10,  
yarn' client do not upload pyspark.zip, so that can not be worked. i submit a 
PR that resolve this problem based on master branch. 


> PySpark does not run on YARN if master not provided in command line
> ---
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629407#comment-14629407
 ] 

Apache Spark commented on SPARK-8646:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7438

> PySpark does not run on YARN if master not provided in command line
> ---
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN if master not provided in command line

2015-07-15 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629103#comment-14629103
 ] 

Lianhui Wang commented on SPARK-8646:
-

yes, when we set master=yarn-client on pyspark/SparkContext.py, it do not take 
effect. so spark-submit consider master=local. i will look at it. 

> PySpark does not run on YARN if master not provided in command line
> ---
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628971#comment-14628971
 ] 

Juliet Hougland commented on SPARK-8646:


Yea, it works fine if I add that arg. There are two reasons I think this should 
be fixed in Spark, despite there being a work around. First, I think API 
compatibility should include scripts that 

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628959#comment-14628959
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

I think I know what's going on. Since you're not passing the "--master" command 
line argument to spark-submit, SparkSubmit does not know you'll be running the 
app in yarn mode, so it does not collect information about the pyspark archives 
to upload. So pyspark modules are not available in the cluster when you run 
your app.

If you just add "--master yarn-client" to your command line, it should work, 
even if it is redundant. Nevertheless, it would be nice to fix this in the 
Spark code too.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628947#comment-14628947
 ] 

Juliet Hougland commented on SPARK-8646:


The failure happens at the point that I need to write out a file on the cluster 
and pyspark facilities need to be available to executors, not just the driver 
program. I can parse args and start a spark context fine, it fails at the point 
that I call sc.saveAsTextFile. Relevant lines:

{panel}
def analyze(data_io):
sc = data_io.sc()

sc.addPyFile("file:/home/juliet/src/out-of-stock/outofstock/GeometricModel.py")
keyed_ts_rdd = to_keyed_ts(sc.textFile(data_io.input_path)).cache()

 # Compute days between sales on a store-item basis
 keyed_days_btwn_sales = keyed_ts_rdd.mapValues(days_between_sales).cache()

# Identify days with an sales numbers that are outliers, using tukey's 
criterion
keyed_outliers = keyed_ts_rdd.mapValues(flag_outliers)
to_csv_lines(keyed_outliers).saveAsTextFile(data_io.sales_outliers_path) # 
Point of failure


if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Analyze store-item sales 
history for anomolies.')
parser.add_argument('input_path')
parser.add_argument('output_dir')
parser.add_argument('mode')
args = parser.parse_args()

dataIO = DataIO(args.input_path, args.output_dir, mode=args.mode)
analyze(dataIO)
{panel}

This runs fine on Spark 1.3, and produces reasonable results that get written 
to files in hdfs. I'm pretty confident that my use of argparse and other logic 
in my code work fine. 

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628927#comment-14628927
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

[~j_houg], could you share the exact code you're using to instantiate the 
context?

Here's a script I wrote:

{code}
import sys
from pyspark import SparkContext
SparkContext(master=sys.argv[1]).stop()
{code}

And invoking spark-submit yields the expected results.

{{./bin/spark-submit /tmp/script.py local}} works. 

{{./bin/spark-submit /tmp/script.py foo}} fails because "foo" is not a valid 
master.

So everything seems to be working as expected, which makes me suspicious of 
your code.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-15 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628822#comment-14628822
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

Hmmm, the command output looks fine, so it seems this was not a regression 
caused by the launcher library. But let me try it locally and see what I get.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625725#comment-14625725
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~juliet] can you provide your spark-submit command? 
i think the correct command in spark 1.4 is $SPARK_HOME/bin/spark-submit 
--master yarn-client outofstock/data_transform.py 
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/
is it the same as your command?

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625245#comment-14625245
 ] 

Juliet Hougland commented on SPARK-8646:


[~lianhuiwang] in $SPARK_HOME/conf I only have the spark-defaults.conf.template 
file, not a non-template version. I also do not set the spark master to local 
programmatically.

[~vanzin] The command logged to stderr is:

Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/bin/java -cp 
/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/conf/:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/juliet/bin/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/etc/hadoop/conf/
 -Xms512m -Xmx512m -XX:
MaxPermSize=128m org.apache.spark.deploy.SparkSubmit --verbose 
outofstock/data_transform.py
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex7/ yarn-client

(sorry for the way the classpath gets chopped up between lines.) yarn-client is 
getting passed as a argument to my code, but because I am not specifying the 
master via the cli --master flag or via spark-defaults.conf it does not affect 
how the job initially starts up.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625087#comment-14625087
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

[~j_houg] could you also run the command with the SPARK_PRINT_LAUNCH_COMMAND=1 
env variable set, and post the command logged to stderr?

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624521#comment-14624521
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~juliet] from your spark1.4-verbose.log, i find that master= local[*]. so 
maybe in spark-defaults.conf, you config spark.master=local? other situation is 
in your data_transform.py, maybe you use sparkConf.set("spark.master","local"). 
Can you check whether these situations have been happened?

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-10 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623008#comment-14623008
 ] 

Juliet Hougland commented on SPARK-8646:


[~lianhuiwang] I just uploaded the log files from using --verbose. I think I 
may have important clues as to where the problem lies. Instead of using 
'--master yarn-client' as part of my spark-submit command, I parse my own cli 
arg in my main class to get the spark master and initialize a configuration 
with it. If I add --master yarn-client in addition to my normal master 
specification, the job runs fine.

The following command works in Spark 1.3 but not in 1.4:
$SPARK_HOME/bin/spark-submit --verbose outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

If I add the --master yarn-client parameter to the command it works. 
Specifically:
$SPARK_HOME/bin/spark-submit --verbose --master yarn-client 
outofstock/data_transform.py \
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ yarn-client

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: executor.log, pi-test.log, 
> spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-09 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620466#comment-14620466
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~j_houg] can you add --verbose to spark-submit command? and look at what is 
your spark.submit.pyArchives.
because from you logs, i find that it do not upload pyArchive files, 
like:pyspark.zip and py4j-0.8.2.1-src.zip. and you can check whether in 
SPARK_HOME/python/lib path it has these two zips.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615809#comment-14615809
 ] 

Juliet Hougland commented on SPARK-8646:


[~davies] Please look at the logs I have attached. The pandas.algo import error 
only appears in the pi-test.log file. I ran pi-test as a method to help debug 
this problem at the request of [~vanzin]. If you look at three other log files 
(with env diferences in the file names) those are from running my out-of-stock 
job. That job does have quite a few dependencies but I make sure those are 
available to the driver and workers. 

The real (first) issue that this ticket is related to is that pyspark isn't 
available on worker nodes. The same command I can use to run my job on spark 
1.3 does not work with spark 1.4.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615293#comment-14615293
 ] 

Davies Liu commented on SPARK-8646:
---

To be clear, PySpark does NOT depends on pandas. In dataframe.py, it works with 
pandas dataframe only when you have it.

[~juliet] example/pi.py should run fine in YARN (it does not need panda at 
all). Is it possible that `outofstock/data_transform.py` depends on 
`pandas.algos` (pandas.algos is used by a closure from driver), and you upload 
the wrong log file?


> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614885#comment-14614885
 ] 

Sean Owen commented on SPARK-8646:
--

Right, none of this uses pandas directly. As [~vanzin] says the code appears to 
be careful about only calling "import pandas" when needed {{toPandas()}} or 
catching for the error when it's not available. My guess is that {{has_pandas}} 
is true on the driver but then that causes it to do things that the executors 
can't honor since they don't have pandas.

It does sound like a docs issue. Some Pyspark operations need pandas and you 
need a uniform Python installation across driver and executor -- either both 
have it or both don't. I suppose that's always good practice, but not obvious, 
that it could manifest like this.

How about adding some docs?

Or [~davies] et al is there a better way to guard this? rather than check once 
whether pandas can be imported, check at "runtime" in the createDataFrame 
method? kind of like {{toPandas}} does? 

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614856#comment-14614856
 ] 

Juliet Hougland commented on SPARK-8646:


[~sowen] The pandas error came when I tried to run the pi job-- which doesn't 
import pandas at all. The only imports in 
$SPARK_1.4_HOME/examples/src/main/python/pi.py are as follows:

import sys
from random import random
from operator import add
from pyspark import SparkContext


 PySpark itself doesn't require pandas (if it does, that should be documented) 
so having the pi job (doesn't require pandas) fail with a pandas not found 
error is wrong, because at no point should the pi job or pyspark itself require 
pandas. The pandas error is very, very weird but not obviously directly related 
to this ticket. The problem I reported here has to do with pyspark itself not 
being shipped or perhaps available to the worker nodes when I run a pyspark app 
from spark 1.4 using YARN.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614673#comment-14614673
 ] 

Sean Owen commented on SPARK-8646:
--

[~j_houg] is the resolution here just that pandas has to be installed if pandas 
is used?

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604226#comment-14604226
 ] 

Lianhui Wang commented on SPARK-8646:
-

now i use spark-1.5.0-SNAPSHOT to run pi.py without install pandas  and it is 
ok. now i find that sql/dataframe.py must need to import pandas and if you do 
not use sql/dataframe.py i think it do not need pandas. [~juliet] can you 
provide executor's logs that can be got more details?

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604014#comment-14604014
 ] 

Juliet Hougland commented on SPARK-8646:


When I configure spark to use my virtualenv that is on every node of the 
cluster and includes pandas, the pi job works fine. This makes sense to me 
because in the job that I have that fails, a spark context can be created 
without a module import error. the part that doesn't make sense to me is why 
pandas.algo would be needed at all. Looking at the code for the pi job, it is 
not part an import that is declared in the file. This is orthogonal to the 
point of this ticket, but is very very strange to me.

The module import error that is the core of this JIRA occurs when I need to 
write out results of a computation (ie calling sc.writeTextFile) which require 
the pyspark module to be available on the worker nodes.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603973#comment-14603973
 ] 

Lianhui Wang commented on SPARK-8646:
-

from [~juliet] 's logs, i think you miss python 'pandas.algos' module that 
pyspark does not provide. i think that you need to install it on nodes.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603523#comment-14603523
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

bq. Still a module missing error, this time it is pandas.algo.

Seems like you may have pandas installed on your driver node but not on cluster 
nodes. Could you check that? The code that uses pandas (in sql/context.py) 
seems to be careful about only using it when available.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603228#comment-14603228
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

Hi [~j_houg],

Seems there's something weird going on in your setup. I downloaded the 1.4 
hadoop 2.6 archive you're using, and ran this command line, without setting any 
extra env variables:

{code}
HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --master yarn-client 
examples/src/main/python/pi.py
{code}

And it works. Notably, I see these two lines that seem to be missing from your 
logs:

{noformat}
15/06/26 10:14:28 INFO yarn.Client: Uploading resource 
file:/tmp/spark-1.4.0-bin-hadoop2.6/python/lib/pyspark.zip -> 
hdfs://vanzin-st1-1.vpc.cloudera.com:8020/user/systest/.sparkStaging/application_143540717_0002/pyspark.zip
15/06/26 10:14:28 INFO yarn.Client: Uploading resource 
file:/tmp/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip -> 
hdfs://vanzin-st1-1.vpc.cloudera.com:8020/user/systest/.sparkStaging/application_143540717_0002/py4j-0.8.2.1-src.zip
{noformat}

That's the code added in the change you mention; it's actually what allows 
pyspark to run with that large assembly (which python cannot read).

Can you double check the command line you're running (or try the simple example 
above)? Also, make sure your {{$SPARK_HOME/conf}} directory is not pointing at 
some other Spark configuration, or that you don't have any other env variables 
that may be affecting Spark configuration.

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602585#comment-14602585
 ] 

Sean Owen commented on SPARK-8646:
--

You're saying it doesn't work at all on YARN? I'd hope there are some unit 
tests for this but I am not sure if it covers this case. Do we know more about 
the likely issue here -- something isn't packaging pyspark, or not unpacking 
it? CC [~lianhuiwang]

> PySpark does not run on YARN
> 
>
> Key: SPARK-8646
> URL: https://issues.apache.org/jira/browse/SPARK-8646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.4.0
> Environment: SPARK_HOME=local/path/to/spark1.4install/dir
> also with
> SPARK_HOME=local/path/to/spark1.4install/dir
> PYTHONPATH=$SPARK_HOME/python/lib
> Spark apps are submitted with the command:
> $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
> hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
> data_transform contains a main method, and the rest of the args are parsed in 
> my own code.
>Reporter: Juliet Hougland
> Attachments: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
> spark1.4-SPARK_HOME-set.log
>
>
> Running pyspark jobs result in a "no module named pyspark" when run in 
> yarn-client mode in spark 1.4.
> [I believe this JIRA represents the change that introduced this error.| 
> https://issues.apache.org/jira/browse/SPARK-6869 ]
> This does not represent a binary compatible change to spark. Scripts that 
> worked on previous spark versions (ie comands the use spark-submit) should 
> continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org