On Fri, Sep 5, 2014 at 10:21 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote: > Ok , I didn't explain my self correct: > In case of java having a lot of classes jar should be used. > All examples for PySpark I found is one py script( Pi , wordcount ...) , > but in real environment analytics has more then one py file. > My question is how to use PySpark on Yarn analytics in case multiple > python files. > > I a not so sure that using coma separated python files is a good option in > my case ( we have quite a lot of files). > In case of using zip option: > Is it just a zip all python files like in jar in java? > In java there is a Manifest file which points to the main method? > Is the zip option best practice or there are other techniques?
In daily development, it's common to modify your projects and re-run the jobs. If using zip or egg to package your code, you need to do this every time after modification, I think it will be boring. If the code is storaged and shared to the slaves via an shared file system, then it's pretty easy to modify and re-run your job, just like in local machine. > Thanks > Oleg. > > > On Sat, Sep 6, 2014 at 1:01 AM, Dimension Data, LLC. > <subscripti...@didata.us> wrote: >> >> Hi: >> >> Curious... is there any reason not to use one of the below pyspark options >> (in red)? Assuming each file is, say 10k in size, is 50 files too much? >> Does that touch on some practical limitation? >> >> >> Usage: ./bin/pyspark [options] >> Options: >> --master MASTER_URL spark://host:port, mesos://host:port, yarn, >> or local. >> --deploy-mode DEPLOY_MODE Where to run the driver program: either >> "client" to run >> on the local machine, or "cluster" to run >> inside cluster. >> --class CLASS_NAME Your application's main class (for Java / >> Scala apps). >> --name NAME A name of your application. >> --jars JARS Comma-separated list of local jars to >> include on the driver >> and executor classpaths. >> >> --py-files PY_FILES Comma-separated list of .zip, .egg, or .py >> files to place >> on the PYTHONPATH for Python apps. >> >> --files FILES Comma-separated list of files to be placed >> in the working >> directory of each executor. >> [ ... snip ... ] >> >> >> >> >> >> On 09/05/2014 12:00 PM, Davies Liu wrote: >> > Hi Oleg, >> > >> > In order to simplify the process of package and distribute you >> > codes, you could deploy an shared storage (such as NFS), and put your >> > project in it, mount it to all the slaves as "/projects". >> > >> > In the spark job scripts, you can access your project by put the >> > path into sys.path, such as: >> > >> > import sys sys.path.append("/projects") import myproject >> > >> > Davies >> > >> > On Fri, Sep 5, 2014 at 1:28 AM, Oleg Ruchovets <oruchov...@gmail.com> >> > wrote: >> >> Hi , We avaluating PySpark and successfully executed examples of >> >> PySpark on Yarn. >> >> >> >> Next step what we want to do: We have a python project ( bunch of >> >> python script using Anaconda packages). Question: What is the way >> >> to execute PySpark on Yarn having a lot of python files ( ~ 50)? >> >> Should it be packaged in archive? How the command to execute >> >> Pyspark on Yarn with a lot of files will looks like? Currently >> >> command looks like: >> >> >> >> ./bin/spark-submit --master yarn --num-executors 3 >> >> --driver-memory 4g --executor-memory 2g --executor-cores 1 >> >> examples/src/main/python/wordcount.py 1000 >> >> >> >> Thanks Oleg. >> > >> > --------------------------------------------------------------------- >> > >> > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org