Although python can launch subprocess to run java code, but in PySpark, the processing code which need to run parallelly in cluster, have to be written in python, for example, in PySpark:
def f(x): ... rdd.map(f) // The function `f` must be pure python code If you try to launch subprocess to run java code in function `f`, it will bring large overhead and many other issues. On Thu, Sep 28, 2017 at 5:36 PM, Giuseppe Celano < cel...@informatik.uni-leipzig.de> wrote: > Hi, > > What I meant is that I could run the Java script using the subprocess > module in Python. In that case is any difference (from directly coding in > the Java API) in performance expected? Thanks. > > > > On Sep 28, 2017, at 3:32 AM, Weichen Xu <weichen...@databricks.com> wrote: > > I think you have to use Spark Java API, in PySpark, functions running on > spark executors (such as map function) can only written in python. > > On Thu, Sep 28, 2017 at 12:48 AM, Giuseppe Celano <cel...@informatik.uni- > leipzig.de> wrote: > >> Hi everyone, >> >> I would like to apply a java script to many files in parallel. I am >> wondering whether I should definitely use the Spark Java API, or I could >> also run the script using the Python API (with which I am more familiar >> with), without this affecting performance. Thanks. >> >> Giuseppe >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > >