Thank you Andrew for you reply!
I am very intested in having this feature. It is possible to run PySpark on
AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923),
but that kills the whole idea of running batch jobs in EMR on PySpark.
Could you please (help to) create a task(with some details of possible
implementation) for this feature? I'd like to implement that but I'm too
new to Spark to know how to do it in a good way...
-Vladimir
On Tue, Jan 20, 2015 at 8:40 PM, Andrew Or and...@databricks.com wrote:
Hi Vladimir,
Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.
-Andrew
2015-01-20 7:38 GMT-08:00 Vladimir Grigor vladi...@kiosked.com:
Hi all!
I found this problem when I tried running python application on Amazon's
EMR yarn cluster.
It is possible to run bundled example applications on EMR but I cannot
figure out how to run a little bit more complex python application which
depends on some other python scripts. I tried adding those files with
'--py-files' and it works fine in local mode but it fails and gives me
following message when run in EMR:
Error: Only local python files are supported:
s3://pathtomybucket/mylibrary.py.
Simplest way to reproduce in local:
bin/spark-submit --py-files s3://whatever.path.com/library.py main.py
Actual commands to run it in EMR
#launch cluster
aws emr create-cluster --name SparkCluster --ami-version 3.3.1
--instance-type m1.medium --instance-count 2 --ec2-attributes
KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
--enable-debugging --use-default-roles --bootstrap-action
Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s,
http://pathtomybucket/bootstrap-actions/spark
,-l,WARN,-v,1.2,-b,2014121700,-x]
#{
# ClusterId: j-2Y58DME79MPQJ
#}
#run application
aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps
ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
#{
#StepIds: [
#s-2UP4PP75YX0KU
#]
#}
And in stderr of that step I get Error: Only local python files are
supported: s3://pathtomybucket/tasks/demo/main.py.
What is the workaround or correct way to do it? Using hadoop's distcp to
copy dependency files from s3 to nodes as another pre-step?
Regards, Vladimir