Hi Vladimir,

Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

-Andrew

2015-01-20 7:38 GMT-08:00 Vladimir Grigor <vladi...@kiosked.com>:

> Hi all!
>
> I found this problem when I tried running python application on Amazon's
> EMR yarn cluster.
>
> It is possible to run bundled example applications on EMR but I cannot
> figure out how to run a little bit more complex python application which
> depends on some other python scripts. I tried adding those files with
> '--py-files' and it works fine in local mode but it fails and gives me
> following message when run in EMR:
> "Error: Only local python files are supported:
> s3://pathtomybucket/mylibrary.py".
>
> Simplest way to reproduce in local:
> bin/spark-submit --py-files s3://whatever.path.com/library.py main.py
>
> Actual commands to run it in EMR
> #launch cluster
> aws emr create-cluster --name SparkCluster --ami-version 3.3.1
> --instance-type m1.medium --instance-count 2  --ec2-attributes
> KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
> --enable-debugging --use-default-roles  --bootstrap-action
> Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s","
> http://pathtomybucket/bootstrap-actions/spark
> ","-l","WARN","-v","1.2","-b","2014121700","-x"]
> #{
> #   "ClusterId": "j-2Y58DME79MPQJ"
> #}
>
> #run application
> aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps
> ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
> #{
> #    "StepIds": [
> #        "s-2UP4PP75YX0KU"
> #    ]
> #}
> And in stderr of that step I get "Error: Only local python files are
> supported: s3://pathtomybucket/tasks/demo/main.py".
>
> What is the workaround or correct way to do it? Using hadoop's distcp to
> copy dependency files from s3 to nodes as another pre-step?
>
> Regards, Vladimir
>

Reply via email to