Re: spark-submit --py-files remote: Only local additional python files are supported

2015-01-21 Thread Vladimir Grigor
Thank you Andrew for you reply!

I am very intested in having this feature. It is possible to run PySpark on
AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923),
but that kills the whole idea of running batch jobs in EMR on PySpark.

Could you please (help to) create a task(with some details of possible
implementation) for this feature? I'd like to implement that but I'm too
new to Spark to know how to do it in a good way...

-Vladimir

On Tue, Jan 20, 2015 at 8:40 PM, Andrew Or and...@databricks.com wrote:

 Hi Vladimir,

 Yes, as the error messages suggests, PySpark currently only supports local
 files. This does not mean it only runs in local mode, however; you can
 still run PySpark on any cluster manager (though only in client mode). All
 this means is that your python files must be on your local file system.
 Until this is supported, the straightforward workaround then is to just
 copy the files to your local machine.

 -Andrew

 2015-01-20 7:38 GMT-08:00 Vladimir Grigor vladi...@kiosked.com:

 Hi all!

 I found this problem when I tried running python application on Amazon's
 EMR yarn cluster.

 It is possible to run bundled example applications on EMR but I cannot
 figure out how to run a little bit more complex python application which
 depends on some other python scripts. I tried adding those files with
 '--py-files' and it works fine in local mode but it fails and gives me
 following message when run in EMR:
 Error: Only local python files are supported:
 s3://pathtomybucket/mylibrary.py.

 Simplest way to reproduce in local:
 bin/spark-submit --py-files s3://whatever.path.com/library.py main.py

 Actual commands to run it in EMR
 #launch cluster
 aws emr create-cluster --name SparkCluster --ami-version 3.3.1
 --instance-type m1.medium --instance-count 2  --ec2-attributes
 KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
 --enable-debugging --use-default-roles  --bootstrap-action
 Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s,
 http://pathtomybucket/bootstrap-actions/spark
 ,-l,WARN,-v,1.2,-b,2014121700,-x]
 #{
 #   ClusterId: j-2Y58DME79MPQJ
 #}

 #run application
 aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps
 ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
 #{
 #StepIds: [
 #s-2UP4PP75YX0KU
 #]
 #}
 And in stderr of that step I get Error: Only local python files are
 supported: s3://pathtomybucket/tasks/demo/main.py.

 What is the workaround or correct way to do it? Using hadoop's distcp to
 copy dependency files from s3 to nodes as another pre-step?

 Regards, Vladimir





Re: spark-submit --py-files remote: Only local additional python files are supported

2015-01-20 Thread Andrew Or
Hi Vladimir,

Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

-Andrew

2015-01-20 7:38 GMT-08:00 Vladimir Grigor vladi...@kiosked.com:

 Hi all!

 I found this problem when I tried running python application on Amazon's
 EMR yarn cluster.

 It is possible to run bundled example applications on EMR but I cannot
 figure out how to run a little bit more complex python application which
 depends on some other python scripts. I tried adding those files with
 '--py-files' and it works fine in local mode but it fails and gives me
 following message when run in EMR:
 Error: Only local python files are supported:
 s3://pathtomybucket/mylibrary.py.

 Simplest way to reproduce in local:
 bin/spark-submit --py-files s3://whatever.path.com/library.py main.py

 Actual commands to run it in EMR
 #launch cluster
 aws emr create-cluster --name SparkCluster --ami-version 3.3.1
 --instance-type m1.medium --instance-count 2  --ec2-attributes
 KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
 --enable-debugging --use-default-roles  --bootstrap-action
 Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s,
 http://pathtomybucket/bootstrap-actions/spark
 ,-l,WARN,-v,1.2,-b,2014121700,-x]
 #{
 #   ClusterId: j-2Y58DME79MPQJ
 #}

 #run application
 aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps
 ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
 #{
 #StepIds: [
 #s-2UP4PP75YX0KU
 #]
 #}
 And in stderr of that step I get Error: Only local python files are
 supported: s3://pathtomybucket/tasks/demo/main.py.

 What is the workaround or correct way to do it? Using hadoop's distcp to
 copy dependency files from s3 to nodes as another pre-step?

 Regards, Vladimir