Re: spark-submit --py-files remote: Only local additional python files are supported
Thank you Andrew for you reply! I am very intested in having this feature. It is possible to run PySpark on AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923), but that kills the whole idea of running batch jobs in EMR on PySpark. Could you please (help to) create a task(with some details of possible implementation) for this feature? I'd like to implement that but I'm too new to Spark to know how to do it in a good way... -Vladimir On Tue, Jan 20, 2015 at 8:40 PM, Andrew Or and...@databricks.com wrote: Hi Vladimir, Yes, as the error messages suggests, PySpark currently only supports local files. This does not mean it only runs in local mode, however; you can still run PySpark on any cluster manager (though only in client mode). All this means is that your python files must be on your local file system. Until this is supported, the straightforward workaround then is to just copy the files to your local machine. -Andrew 2015-01-20 7:38 GMT-08:00 Vladimir Grigor vladi...@kiosked.com: Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding those files with '--py-files' and it works fine in local mode but it fails and gives me following message when run in EMR: Error: Only local python files are supported: s3://pathtomybucket/mylibrary.py. Simplest way to reproduce in local: bin/spark-submit --py-files s3://whatever.path.com/library.py main.py Actual commands to run it in EMR #launch cluster aws emr create-cluster --name SparkCluster --ami-version 3.3.1 --instance-type m1.medium --instance-count 2 --ec2-attributes KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs --enable-debugging --use-default-roles --bootstrap-action Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s, http://pathtomybucket/bootstrap-actions/spark ,-l,WARN,-v,1.2,-b,2014121700,-x] #{ # ClusterId: j-2Y58DME79MPQJ #} #run application aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py] #{ #StepIds: [ #s-2UP4PP75YX0KU #] #} And in stderr of that step I get Error: Only local python files are supported: s3://pathtomybucket/tasks/demo/main.py. What is the workaround or correct way to do it? Using hadoop's distcp to copy dependency files from s3 to nodes as another pre-step? Regards, Vladimir
spark-submit --py-files remote: Only local additional python files are supported
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding those files with '--py-files' and it works fine in local mode but it fails and gives me following message when run in EMR: Error: Only local python files are supported: s3://pathtomybucket/mylibrary.py. Simplest way to reproduce in local: bin/spark-submit --py-files s3://whatever.path.com/library.py main.py Actual commands to run it in EMR #launch cluster aws emr create-cluster --name SparkCluster --ami-version 3.3.1 --instance-type m1.medium --instance-count 2 --ec2-attributes KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs --enable-debugging --use-default-roles --bootstrap-action Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s, http://pathtomybucket/bootstrap-actions/spark ,-l,WARN,-v,1.2,-b,2014121700,-x] #{ # ClusterId: j-2Y58DME79MPQJ #} #run application aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py] #{ #StepIds: [ #s-2UP4PP75YX0KU #] #} And in stderr of that step I get Error: Only local python files are supported: s3://pathtomybucket/tasks/demo/main.py. What is the workaround or correct way to do it? Using hadoop's distcp to copy dependency files from s3 to nodes as another pre-step? Regards, Vladimir
spark-submit --py-files remote: Only local additional python files are supported
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding those files with '--py-files' and it works fine in local mode but it fails and gives me following message when run in EMR: Error: Only local python files are supported: s3://pathtomybucket/mylibrary.py. Simplest way to reproduce in local: bin/spark-submit --py-files s3://whatever.path.com/library.py main.py Actual commands to run it in EMR #launch cluster aws emr create-cluster --name SparkCluster --ami-version 3.3.1 --instance-type m1.medium --instance-count 2 --ec2-attributes KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs --enable-debugging --use-default-roles --bootstrap-action Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s, http://pathtomybucket/bootstrap-actions/spark ,-l,WARN,-v,1.2,-b,2014121700,-x] #{ # ClusterId: j-2Y58DME79MPQJ #} #run application aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py] #{ #StepIds: [ #s-2UP4PP75YX0KU #] #} And in stderr of that step I get Error: Only local python files are supported: s3://pathtomybucket/tasks/demo/main.py. What is the workaround or correct way to do it? Using hadoop's distcp to copy dependency files from s3 to nodes as another pre-step? Regards, Vladimir
Re: spark-submit --py-files remote: Only local additional python files are supported
Hi Vladimir, Yes, as the error messages suggests, PySpark currently only supports local files. This does not mean it only runs in local mode, however; you can still run PySpark on any cluster manager (though only in client mode). All this means is that your python files must be on your local file system. Until this is supported, the straightforward workaround then is to just copy the files to your local machine. -Andrew 2015-01-20 7:38 GMT-08:00 Vladimir Grigor vladi...@kiosked.com: Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding those files with '--py-files' and it works fine in local mode but it fails and gives me following message when run in EMR: Error: Only local python files are supported: s3://pathtomybucket/mylibrary.py. Simplest way to reproduce in local: bin/spark-submit --py-files s3://whatever.path.com/library.py main.py Actual commands to run it in EMR #launch cluster aws emr create-cluster --name SparkCluster --ami-version 3.3.1 --instance-type m1.medium --instance-count 2 --ec2-attributes KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs --enable-debugging --use-default-roles --bootstrap-action Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s, http://pathtomybucket/bootstrap-actions/spark ,-l,WARN,-v,1.2,-b,2014121700,-x] #{ # ClusterId: j-2Y58DME79MPQJ #} #run application aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py] #{ #StepIds: [ #s-2UP4PP75YX0KU #] #} And in stderr of that step I get Error: Only local python files are supported: s3://pathtomybucket/tasks/demo/main.py. What is the workaround or correct way to do it? Using hadoop's distcp to copy dependency files from s3 to nodes as another pre-step? Regards, Vladimir
spark-submit --py-files remote: Only local additional python files are supported
Hi all! I found this problem when I tried running python application on Amazon's EMR yarn cluster. It is possible to run bundled example applications on EMR but I cannot figure out how to run a little bit more complex python application which depends on some other python scripts. I tried adding those files with '--py-files' and it works fine in local mode but it fails and gives me following message when run in EMR: Error: Only local python files are supported: s3://pathtomybucket/mylibrary.py. Simplest way to reproduce in local: bin/spark-submit --py-files s3://whatever.path.com/library.py main.py Actual commands to run it in EMR #launch cluster aws emr create-cluster --name SparkCluster --ami-version 3.3.1 --instance-type m1.medium --instance-count 2 --ec2-attributes KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs --enable-debugging --use-default-roles --bootstrap-action Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=[-s,http://pathtomybucket/bootstrap-actions/spark,-l,WARN,-v,1.2,-b,2014121700,-x;] #{ # ClusterId: j-2Y58DME79MPQJ #} #run application aws emr add-steps --cluster-id j-2Y58DME79MPQJ --steps ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py] #{ #StepIds: [ #s-2UP4PP75YX0KU #] #} And in stderr of that step I get Error: Only local python files are supported: s3://pathtomybucket/tasks/demo/main.py. What is the workaround or correct way to do it? Using hadoop's distcp to copy dependency files from s3 to nodes as another pre-step? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-py-files-remote-Only-local-additional-python-files-are-supported-tp21216.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org