Hi Gourav, If your question is how to distribute python package dependencies across the Spark cluster programmatically? ...here is an example -
$ export PYTHONPATH='path/to/thrift.zip:path/to/happybase.zip:path/to/your/py/application' And in code: sc.addPyFile('/path/to/thrift.zip') sc.addPyFile('/path/to/happybase.zip') Regards, Ram On 15 February 2016 at 15:16, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > So far no one is able to get my question at all. I know what it takes to > load packages via SPARK shell or SPARK submit. > > How do I load packages when starting a SPARK cluster, as mentioned here > http://spark.apache.org/docs/latest/spark-standalone.html ? > > > Regards, > Gourav Sengupta > > > > > On Mon, Feb 15, 2016 at 3:25 AM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> with conf option >> >> spark-submit --conf 'key = value ' >> >> Hope that helps you. >> >> On 15 February 2016 at 11:21, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Hi Gourav, >>> you can use like below to load packages at the start of the spark shell. >>> >>> spark-shell --packages com.databricks:spark-csv_2.10:1.1.0 >>> >>> On 14 February 2016 at 03:34, Gourav Sengupta <gourav.sengu...@gmail.com >>> > wrote: >>> >>>> Hi, >>>> >>>> I was interested in knowing how to load the packages into SPARK cluster >>>> started locally. Can someone pass me on the links to set the conf file so >>>> that the packages can be loaded? >>>> >>>> Regards, >>>> Gourav >>>> >>>> On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz <brk...@gmail.com> wrote: >>>> >>>>> Hello Gourav, >>>>> >>>>> The packages need to be loaded BEFORE you start the JVM, therefore you >>>>> won't be able to add packages dynamically in code. You should use the >>>>> --packages with pyspark before you start your application. >>>>> One option is to add a `conf` that will load some packages if you are >>>>> constantly going to use them. >>>>> >>>>> Best, >>>>> Burak >>>>> >>>>> >>>>> >>>>> On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta < >>>>> gourav.sengu...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am creating sparkcontext in a SPARK standalone cluster as >>>>>> mentioned here: >>>>>> http://spark.apache.org/docs/latest/spark-standalone.html using the >>>>>> following code: >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------------------------------------------------------- >>>>>> sc.stop() >>>>>> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) >>>>>> \ >>>>>> .setMaster("spark://hostname:7077") \ >>>>>> .set('spark.shuffle.service.enabled', True) \ >>>>>> .set('spark.dynamicAllocation.enabled','true') \ >>>>>> .set('spark.executor.memory','20g') \ >>>>>> .set('spark.driver.memory', '4g') \ >>>>>> >>>>>> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 )) >>>>>> conf.getAll() >>>>>> sc = SparkContext(conf = conf) >>>>>> >>>>>> -----(we should definitely be able to optimise the configuration but >>>>>> that is not the point here) --- >>>>>> >>>>>> I am not able to use packages, a list of which is mentioned here >>>>>> http://spark-packages.org, using this method. >>>>>> >>>>>> Where as if I use the standard "pyspark --packages" option then the >>>>>> packages load just fine. >>>>>> >>>>>> I will be grateful if someone could kindly let me know how to load >>>>>> packages when starting a cluster as mentioned above. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Gourav Sengupta >>>>>> >>>>> >>>>> >>>> >>> >> >