Hi ,
We avaluating PySpark and successfully executed examples of PySpark on
Yarn.
Next step what we want to do:
We have a python project ( bunch of python script using Anaconda
packages).
Question:
What is the way to execute PySpark on Yarn having a lot of python
files ( ~ 50)?
Hi Oleg,
In order to simplify the process of package and distribute you codes,
you could deploy
an shared storage (such as NFS), and put your project in it, mount it
to all the slaves
as /projects.
In the spark job scripts, you can access your project by put the path
into sys.path, such
as:
Hi:
Curious... is there any reason not to use one of the below pyspark options
(in red)? Assuming each file is, say 10k in size, is 50 files too much?
Does that touch on some practical limitation?
Usage: ./bin/pyspark [options]
Options:
--master MASTER_URL spark://host:port,
Ok , I didn't explain my self correct:
In case of java having a lot of classes jar should be used.
All examples for PySpark I found is one py script( Pi , wordcount ...) ,
but in real environment analytics has more then one py file.
My question is how to use PySpark on Yarn analytics in
Hi Oleg,
We do support serving python files in zips. If you use --py-files, you can
provide a comma delimited list of zips instead of python files. This will
allow you to automatically add these files to the python path on the
executors without you having to manually copy them to every single
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote:
In daily development, it's common to modify your projects and re-run
the jobs. If using zip or egg to package your code, you need to do
this every time after modification, I think it will be boring.
That's why shell
Here is a store about how shared storage simplify all the things:
In Douban, we use Moose FS[1] instead of HDFS as the distributed file system,
it's POSIX compatible and can be mounted just as NFS.
We put all the data and tools and code in it, so we can access them easily on
all the machines,
Hi Davies,
On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote:
In Douban, we use Moose FS[1] instead of HDFS as the distributed file system,
it's POSIX compatible and can be mounted just as NFS.
Sure, if you already have the infrastructure in place, it might be
worthwhile