PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Oleg Ruchovets
Hi , We avaluating PySpark and successfully executed examples of PySpark on Yarn. Next step what we want to do: We have a python project ( bunch of python script using Anaconda packages). Question: What is the way to execute PySpark on Yarn having a lot of python files ( ~ 50)?

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
Hi Oleg, In order to simplify the process of package and distribute you codes, you could deploy an shared storage (such as NFS), and put your project in it, mount it to all the slaves as /projects. In the spark job scripts, you can access your project by put the path into sys.path, such as:

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Dimension Data, LLC.
Hi: Curious... is there any reason not to use one of the below pyspark options (in red)? Assuming each file is, say 10k in size, is 50 files too much? Does that touch on some practical limitation? Usage: ./bin/pyspark [options] Options: --master MASTER_URL spark://host:port,

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Oleg Ruchovets
Ok , I didn't explain my self correct: In case of java having a lot of classes jar should be used. All examples for PySpark I found is one py script( Pi , wordcount ...) , but in real environment analytics has more then one py file. My question is how to use PySpark on Yarn analytics in

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Andrew Or
Hi Oleg, We do support serving python files in zips. If you use --py-files, you can provide a comma delimited list of zips instead of python files. This will allow you to automatically add these files to the python path on the executors without you having to manually copy them to every single

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote: In daily development, it's common to modify your projects and re-run the jobs. If using zip or egg to package your code, you need to do this every time after modification, I think it will be boring. That's why shell

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
Here is a store about how shared storage simplify all the things: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. We put all the data and tools and code in it, so we can access them easily on all the machines,

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
Hi Davies, On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. Sure, if you already have the infrastructure in place, it might be worthwhile