Re: Call Scala API from PySpark

Holden Karau Thu, 30 Jun 2016 11:23:38 -0700

So I'm a little biased - I think the bet bride between the two is using
DataFrames. I've got some examples in my talk and on the high performance
spark GitHub
https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py
calls some custom scala code.


Using a custom context is a bit trixie though because of how the launching
is done, as Jeff Zhang points out you would need to wrap it in a
JavaSparkContext and then you could override the _intialize_context
function in context.py

On Thu, Jun 30, 2016 at 11:06 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> Hi Pedro,
>
> Your use case is interesting.  I think launching java gateway is the same
> as native SparkContext, the only difference is on creating your custom
> SparkContext instead of native SparkContext. You might also need to wrap it
> using java.
>
> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172
>
>
>
> On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have written a Scala package which essentially wraps the SparkContext
>> around a custom class that adds some functionality specific to our internal
>> use case. I am trying to figure out the best way to call this from PySpark.
>>
>> I would like to do this similarly to how Spark itself calls the JVM
>> SparkContext as in:
>> https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py
>>
>> My goal would be something like this:
>>
>> Scala Code (this is done):
>> >>> import com.company.mylibrary.CustomContext
>> >>> val myContext = CustomContext(sc)
>> >>> val rdd: RDD[String] = myContext.customTextFile("path")
>>
>> Python Code (I want to be able to do this):
>> >>> from company.mylibrary import CustomContext
>> >>> myContext = CustomContext(sc)
>> >>> rdd = myContext.customTextFile("path")
>>
>> At the end of each code, I should be working with an ordinary RDD[String].
>>
>> I am trying to access my Scala class through sc._jvm as below, but not
>> having any luck so far.
>>
>> My attempts:
>> >>> a = sc._jvm.com.company.mylibrary.CustomContext
>> >>> dir(a)
>> ['<package or class name>']
>>
>> Example of what I want::
>> >>> a = sc._jvm.PythonRDD
>> >>> dir(a)
>> ['anonfun$6', 'anonfun$8', 'collectAndServe',
>> 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile',
>> 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD',
>> 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions',
>> 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions',
>> 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions',
>> 'readBroadcastFromFile', 'readRDDFromFile', 'runJob',
>> 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile',
>> 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair',
>> 'writeIteratorToStream', 'writeUTF']
>>
>> The next thing I would run into is converting the JVM RDD[String] back to
>> a Python RDD, what is the easiest way to do this?
>>
>> Overall, is this a good approach to calling the same API in Scala and
>> Python?
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Call Scala API from PySpark

Reply via email to