[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions
[ https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454585#comment-16454585 ] holdenk commented on SPARK-23842: - So the py4j gateway only exists on the driver program, on the worker programs Spark uses its own method of communicating between the worker and Spark. You might find looking at [https://github.com/sparklingpandas/sparklingml] to be helpful, if you factor your code so that the Java function takes in a DataFrame you can use it that way (or register Java UDFs as shown). > accessing java from PySpark lambda functions > > > Key: SPARK-23842 > URL: https://issues.apache.org/jira/browse/SPARK-23842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: cloudpickle, py4j, pyspark > > Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be > more of a Spark issue than py4j.. > |We have a third-party Java library that is distributed to Spark executors > through {{--jars}} parameter. > We want to call a static Java method in that library on executor's side > through Spark's {{map()}} or create an object of that library's class through > {{mapPartitions()}} call. > None of the approaches worked so far. It seems Spark tries to serialize > everything it sees in a lambda function, distribute to executors etc. > I am aware of an older py4j issue/question > [#171|https://github.com/bartdag/py4j/issues/171] but looking at that > discussion isn't helpful. > We thought to create a reference to that "class" through a call like > {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose > functionality of that library in pyspark executors' lambda functions. > We don't want Spark to try to serialize spark session variables "spark" nor > its reference to py4j gateway {{spark._jvm}} (because it leads to expected > non-serializable exceptions), so tried to "trick" Spark not to try to > serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} > into {{exec()}} call. > It led to another issue that {{spark}} (spark session) nor {{sc}} (spark > context) variables seems not available in spark executors' lambda functions. > So we're stuck and don't know how to call a generic java class through py4j > on executor's side (from within {{map}} or {{mapPartitions}} lambda > functions). > It would be an easier adventure from Scala/ Java for Spark as those can > directly call that 3rd-party libraries methods, but our users ask to have a > way to do the same from PySpark.| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions
[ https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423439#comment-16423439 ] Ruslan Dautkhanov commented on SPARK-23842: --- [~hyukjin.kwon] Thanks for the reply. I wasn't sure where this issue (or feature ?) belongs, that's why I opened it on both py4j and Spark.. any idea how we can call a third-party Java library's method (that was distributed into executors through `--jars` argument) - from mapPartitions() ? I understand that spark session nor spark context are serializable but I found no way to call a java method from within map() / mapPartitions(). We normally can use `spark._jvm` to access py4j on driver side, how we can do the same on executor side? Thanks in advance for any leads. > accessing java from PySpark lambda functions > > > Key: SPARK-23842 > URL: https://issues.apache.org/jira/browse/SPARK-23842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: cloudpickle, py4j, pyspark > > Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be > more of a Spark issue than py4j.. > |We have a third-party Java library that is distributed to Spark executors > through {{--jars}} parameter. > We want to call a static Java method in that library on executor's side > through Spark's {{map()}} or create an object of that library's class through > {{mapPartitions()}} call. > None of the approaches worked so far. It seems Spark tries to serialize > everything it sees in a lambda function, distribute to executors etc. > I am aware of an older py4j issue/question > [#171|https://github.com/bartdag/py4j/issues/171] but looking at that > discussion isn't helpful. > We thought to create a reference to that "class" through a call like > {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose > functionality of that library in pyspark executors' lambda functions. > We don't want Spark to try to serialize spark session variables "spark" nor > its reference to py4j gateway {{spark._jvm}} (because it leads to expected > non-serializable exceptions), so tried to "trick" Spark not to try to > serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} > into {{exec()}} call. > It led to another issue that {{spark}} (spark session) nor {{sc}} (spark > context) variables seems not available in spark executors' lambda functions. > So we're stuck and don't know how to call a generic java class through py4j > on executor's side (from within {{map}} or {{mapPartitions}} lambda > functions). > It would be an easier adventure from Scala/ Java for Spark as those can > directly call that 3rd-party libraries methods, but our users ask to have a > way to do the same from PySpark.| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions
[ https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423378#comment-16423378 ] Hyukjin Kwon commented on SPARK-23842: -- How come it's more a Spark issue. spark session or spark context is not meant to be serialized .. I haven't tries that yet but I think you should do this without that session or context but with py4j alone within the mapPartitions. > accessing java from PySpark lambda functions > > > Key: SPARK-23842 > URL: https://issues.apache.org/jira/browse/SPARK-23842 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: cloudpickle, py4j, pyspark > > Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be > more of a Spark issue than py4j.. > |We have a third-party Java library that is distributed to Spark executors > through {{--jars}} parameter. > We want to call a static Java method in that library on executor's side > through Spark's {{map()}} or create an object of that library's class through > {{mapPartitions()}} call. > None of the approaches worked so far. It seems Spark tries to serialize > everything it sees in a lambda function, distribute to executors etc. > I am aware of an older py4j issue/question > [#171|https://github.com/bartdag/py4j/issues/171] but looking at that > discussion isn't helpful. > We thought to create a reference to that "class" through a call like > {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose > functionality of that library in pyspark executors' lambda functions. > We don't want Spark to try to serialize spark session variables "spark" nor > its reference to py4j gateway {{spark._jvm}} (because it leads to expected > non-serializable exceptions), so tried to "trick" Spark not to try to > serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} > into {{exec()}} call. > It led to another issue that {{spark}} (spark session) nor {{sc}} (spark > context) variables seems not available in spark executors' lambda functions. > So we're stuck and don't know how to call a generic java class through py4j > on executor's side (from within {{map}} or {{mapPartitions}} lambda > functions). > It would be an easier adventure from Scala/ Java for Spark as those can > directly call that 3rd-party libraries methods, but our users ask to have a > way to do the same from PySpark.| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org