So I'm a little biased - I think the bet bride between the two is using DataFrames. I've got some examples in my talk and on the high performance spark GitHub https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py calls some custom scala code.
Using a custom context is a bit trixie though because of how the launching is done, as Jeff Zhang points out you would need to wrap it in a JavaSparkContext and then you could override the _intialize_context function in context.py On Thu, Jun 30, 2016 at 11:06 AM, Jeff Zhang <zjf...@gmail.com> wrote: > Hi Pedro, > > Your use case is interesting. I think launching java gateway is the same > as native SparkContext, the only difference is on creating your custom > SparkContext instead of native SparkContext. You might also need to wrap it > using java. > > https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172 > > > > On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >> Hi All, >> >> I have written a Scala package which essentially wraps the SparkContext >> around a custom class that adds some functionality specific to our internal >> use case. I am trying to figure out the best way to call this from PySpark. >> >> I would like to do this similarly to how Spark itself calls the JVM >> SparkContext as in: >> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py >> >> My goal would be something like this: >> >> Scala Code (this is done): >> >>> import com.company.mylibrary.CustomContext >> >>> val myContext = CustomContext(sc) >> >>> val rdd: RDD[String] = myContext.customTextFile("path") >> >> Python Code (I want to be able to do this): >> >>> from company.mylibrary import CustomContext >> >>> myContext = CustomContext(sc) >> >>> rdd = myContext.customTextFile("path") >> >> At the end of each code, I should be working with an ordinary RDD[String]. >> >> I am trying to access my Scala class through sc._jvm as below, but not >> having any luck so far. >> >> My attempts: >> >>> a = sc._jvm.com.company.mylibrary.CustomContext >> >>> dir(a) >> ['<package or class name>'] >> >> Example of what I want:: >> >>> a = sc._jvm.PythonRDD >> >>> dir(a) >> ['anonfun$6', 'anonfun$8', 'collectAndServe', >> 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile', >> 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD', >> 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions', >> 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions', >> 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions', >> 'readBroadcastFromFile', 'readRDDFromFile', 'runJob', >> 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile', >> 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair', >> 'writeIteratorToStream', 'writeUTF'] >> >> The next thing I would run into is converting the JVM RDD[String] back to >> a Python RDD, what is the easiest way to do this? >> >> Overall, is this a good approach to calling the same API in Scala and >> Python? >> >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> > > > -- > Best Regards > > Jeff Zhang > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau