Re: PySpark on Yarn a lot of python scripts project

Andrew Or Fri, 05 Sep 2014 10:53:28 -0700

Hi Oleg,

We do support serving python files in zips. If you use --py-files, you can
provide a comma delimited list of zips instead of python files. This will
allow you to automatically add these files to the python path on the
executors without you having to manually copy them to every single slave
node.


Andrew


2014-09-05 10:50 GMT-07:00 Davies Liu <dav...@databricks.com>:

> On Fri, Sep 5, 2014 at 10:21 AM, Oleg Ruchovets <oruchov...@gmail.com>
> wrote:
> > Ok , I  didn't explain my self correct:
> >    In case of java having a lot of classes jar should be used.
> >    All examples for PySpark I found is one py script( Pi , wordcount
> ...) ,
> > but in real environment analytics has more then one py file.
> >    My question is how to use PySpark on Yarn analytics in case multiple
> > python files.
> >
> > I a not so sure that using coma separated python files is a good option
> in
> > my case ( we have quite a lot of files).
> >   In case of using zip option:
> >      Is it just a zip all python files like in jar in java?
> >      In java there is a Manifest file which points to the main method?
> >      Is the zip option best practice or there are other techniques?
>
> In daily development, it's common to modify your projects and re-run
> the jobs. If using zip or egg to package your code, you need to do
> this every time after modification, I think it will be boring.
>
> If the code is storaged and shared to the slaves via an shared file
> system, then it's pretty easy to modify and re-run your job, just like
> in local machine.
>
> > Thanks
> > Oleg.
> >
> >
> > On Sat, Sep 6, 2014 at 1:01 AM, Dimension Data, LLC.
> > <subscripti...@didata.us> wrote:
> >>
> >> Hi:
> >>
> >> Curious... is there any reason not to use one of the below pyspark
> options
> >> (in red)? Assuming each file is, say 10k in size, is 50 files too much?
> >> Does that touch on some practical limitation?
> >>
> >>
> >> Usage: ./bin/pyspark [options]
> >> Options:
> >>   --master MASTER_URL         spark://host:port, mesos://host:port,
> yarn,
> >> or local.
> >>   --deploy-mode DEPLOY_MODE   Where to run the driver program: either
> >> "client" to run
> >>                               on the local machine, or "cluster" to run
> >> inside cluster.
> >>   --class CLASS_NAME          Your application's main class (for Java /
> >> Scala apps).
> >>   --name NAME                 A name of your application.
> >>   --jars JARS                 Comma-separated list of local jars to
> >> include on the driver
> >>                               and executor classpaths.
> >>
> >>   --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py
> >> files to place
> >>                               on the PYTHONPATH for Python apps.
> >>
> >>   --files FILES               Comma-separated list of files to be placed
> >> in the working
> >>                               directory of each executor.
> >> [ ... snip ... ]
> >>
> >>
> >>
> >>
> >>
> >> On 09/05/2014 12:00 PM, Davies Liu wrote:
> >> > Hi Oleg,
> >> >
> >> > In order to simplify the process of package and distribute you
> >> > codes, you could deploy an shared storage (such as NFS), and put your
> >> > project in it, mount it to all the slaves as "/projects".
> >> >
> >> > In the spark job scripts, you can access your project by put the
> >> > path into sys.path, such as:
> >> >
> >> > import sys sys.path.append("/projects") import myproject
> >> >
> >> > Davies
> >> >
> >> > On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets <oruchov...@gmail.com>
> >> > wrote:
> >> >> Hi , We avaluating PySpark  and successfully executed examples of
> >> >> PySpark on Yarn.
> >> >>
> >> >> Next step what we want to do: We have a python project ( bunch of
> >> >> python script using Anaconda packages). Question: What is the way
> >> >> to execute PySpark on Yarn having a lot of python files ( ~ 50)?
> >> >> Should it be packaged in archive? How the command to execute
> >> >> Pyspark on Yarn with a lot of files will looks like? Currently
> >> >> command looks like:
> >> >>
> >> >> ./bin/spark-submit --master yarn  --num-executors 3
> >> >> --driver-memory 4g --executor-memory 2g --executor-cores 1
> >> >> examples/src/main/python/wordcount.py   1000
> >> >>
> >> >> Thanks Oleg.
> >> >
> >> > ---------------------------------------------------------------------
> >> >
> >> >
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> >> >
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: PySpark on Yarn a lot of python scripts project

Reply via email to