Ok , I didn't explain my self correct: In case of java having a lot of classes jar should be used. All examples for PySpark I found is one py script( Pi , wordcount ...) , but in real environment analytics has more then one py file. My question is how to use PySpark on Yarn analytics in case multiple python files.
I a not so sure that using coma separated python files is a good option in my case ( we have quite a lot of files). In case of using zip option: Is it just a zip all python files like in jar in java? In java there is a Manifest file which points to the main method? Is the zip option best practice or there are other techniques? Thanks Oleg. On Sat, Sep 6, 2014 at 1:01 AM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > Hi: > > Curious... is there any reason not to use one of the below pyspark options > (in red)? Assuming each file is, say 10k in size, is 50 files too much? > Does that touch on some practical limitation? > > > Usage: ./bin/pyspark [options] > Options: > --master MASTER_URL spark://host:port, mesos://host:port, yarn, > or local. > --deploy-mode DEPLOY_MODE Where to run the driver program: either > "client" to run > on the local machine, or "cluster" to run > inside cluster. > --class CLASS_NAME Your application's main class (for Java / > Scala apps). > --name NAME A name of your application. > --jars JARS Comma-separated list of local jars to > include on the driver > and executor classpaths. > > --py-files PY_FILES Comma-separated list of .zip, .egg, or .py > files to place > on the PYTHONPATH for Python apps. > > --files FILES Comma-separated list of files to be placed > in the working > directory of each executor. > [ ... snip ... ] > > > > > > On 09/05/2014 12:00 PM, Davies Liu wrote: > > Hi Oleg, > > > > In order to simplify the process of package and distribute you > > codes, you could deploy an shared storage (such as NFS), and put your > > project in it, mount it to all the slaves as "/projects". > > > > In the spark job scripts, you can access your project by put the > > path into sys.path, such as: > > > > import sys sys.path.append("/projects") import myproject > > > > Davies > > > > On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets <oruchov...@gmail.com> > <oruchov...@gmail.com> > > wrote: > >> Hi , We avaluating PySpark and successfully executed examples of > >> PySpark on Yarn. > >> > >> Next step what we want to do: We have a python project ( bunch of > >> python script using Anaconda packages). Question: What is the way > >> to execute PySpark on Yarn having a lot of python files ( ~ 50)? > >> Should it be packaged in archive? How the command to execute > >> Pyspark on Yarn with a lot of files will looks like? Currently > >> command looks like: > >> > >> ./bin/spark-submit --master yarn --num-executors 3 > >> --driver-memory 4g --executor-memory 2g --executor-cores 1 > >> examples/src/main/python/wordcount.py 1000 > >> > >> Thanks Oleg. > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > >