Obviously it depends on what is missing, but if I were you, I'd try monkey patching pyspark with the functionality you need first (along with submitting a pull request, of course). The pyspark code is very readable, and a lot of functionality just builds on top of a few primitives, as in the Scala Spark code. And in many cases you can use the Scala version for reference. For example, compare RDD.distinct() in Scala (https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L263) and Python (https://github.com/apache/incubator-spark/blob/master/python/pyspark/rdd.py#L175) (the Python version is missing numPartitions, but that looks like a trivial fix in this case).

-Ewen

December 11, 2013 8:57 PM
Hi all,

I've been mostly using Spark with Python, and it's been a great time (thanks for the earlier help with GPUs, btw), but I recently stumbled through the Scala API and found it incredibly rich, with some options that would be pretty helpful for us but are lacking in the Python API. Is it straightforward to write a driver in Scala, but have the workers be written in Python? Alternatively, can I (easily) use Py4J to access these Scala methods from Python? I imagine I'll be playing around with it over the next few days, but I was wondering if anyone had tried this. Sorry if it's a stupid question...

Thanks for the time and attention

Patrick Grinaway

Reply via email to