Re: PySpark on Yarn a lot of python scripts project

Oleg Ruchovets Fri, 05 Sep 2014 10:22:52 -0700

Ok , I  didn't explain my self correct:
   In case of java having a lot of classes jar should be used.
   All examples for PySpark I found is one py script( Pi , wordcount ...) ,
but in real environment analytics has more then one py file.
   My question is how to use PySpark on Yarn analytics in case multiple
python files.


I a not so sure that using coma separated python files is a good option in
my case ( we have quite a lot of files).
  In case of using zip option:
     Is it just a zip all python files like in jar in java?
     In java there is a Manifest file which points to the main method?
     Is the zip option best practice or there are other techniques?

Thanks
Oleg.


On Sat, Sep 6, 2014 at 1:01 AM, Dimension Data, LLC. <
subscripti...@didata.us> wrote:

>  Hi:
>
> Curious... is there any reason not to use one of the below pyspark options
> (in red)? Assuming each file is, say 10k in size, is 50 files too much?
> Does that touch on some practical limitation?
>
>
> Usage: ./bin/pyspark [options]
> Options:
>   --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
> or local.
>   --deploy-mode DEPLOY_MODE   Where to run the driver program: either
> "client" to run
>                               on the local machine, or "cluster" to run
> inside cluster.
>   --class CLASS_NAME          Your application's main class (for Java /
> Scala apps).
>   --name NAME                 A name of your application.
>   --jars JARS                 Comma-separated list of local jars to
> include on the driver
>                               and executor classpaths.
>
>   --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py
> files to place
>                               on the PYTHONPATH for Python apps.
>
>   --files FILES               Comma-separated list of files to be placed
> in the working
>                               directory of each executor.
> [ ... snip ... ]
>
>
>
>
>
> On 09/05/2014 12:00 PM, Davies Liu wrote:
> > Hi Oleg,
> >
> > In order to simplify the process of package and distribute you
> > codes, you could deploy an shared storage (such as NFS), and put your
> > project in it, mount it to all the slaves as "/projects".
> >
> > In the spark job scripts, you can access your project by put the
> > path into sys.path, such as:
> >
> > import sys sys.path.append("/projects") import myproject
> >
> > Davies
> >
> > On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets <oruchov...@gmail.com>
> <oruchov...@gmail.com>
> > wrote:
> >> Hi , We avaluating PySpark  and successfully executed examples of
> >> PySpark on Yarn.
> >>
> >> Next step what we want to do: We have a python project ( bunch of
> >> python script using Anaconda packages). Question: What is the way
> >> to execute PySpark on Yarn having a lot of python files ( ~ 50)?
> >> Should it be packaged in archive? How the command to execute
> >> Pyspark on Yarn with a lot of files will looks like? Currently
> >> command looks like:
> >>
> >> ./bin/spark-submit --master yarn  --num-executors 3
> >> --driver-memory 4g --executor-memory 2g --executor-cores 1
> >> examples/src/main/python/wordcount.py   1000
> >>
> >> Thanks Oleg.
> >
> > ---------------------------------------------------------------------
> >
> >
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>

Re: PySpark on Yarn a lot of python scripts project

Reply via email to