[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions

2018-04-26 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454585#comment-16454585
 ] 

holdenk commented on SPARK-23842:
-

So the py4j gateway only exists on the driver program, on the worker programs 
Spark uses its own method of communicating between the worker and Spark. You 
might find looking at [https://github.com/sparklingpandas/sparklingml] to be 
helpful, if you factor your code so that the Java function takes in a DataFrame 
you can use it that way (or register Java UDFs as shown).

> accessing java from PySpark lambda functions
> 
>
> Key: SPARK-23842
> URL: https://issues.apache.org/jira/browse/SPARK-23842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be 
> more of a Spark issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors 
> through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side 
> through Spark's {{map()}} or create an object of that library's class through 
> {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize 
> everything it sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question 
> [#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
> discussion isn't helpful.
> We thought to create a reference to that "class" through a call like 
> {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose 
> functionality of that library in pyspark executors' lambda functions.
> We don't want Spark to try to serialize spark session variables "spark" nor 
> its reference to py4j gateway {{spark._jvm}} (because it leads to expected 
> non-serializable exceptions), so tried to "trick" Spark not to try to 
> serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} 
> into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
> context) variables seems not available in spark executors' lambda functions. 
> So we're stuck and don't know how to call a generic java class through py4j 
> on executor's side (from within {{map}} or {{mapPartitions}} lambda 
> functions).
> It would be an easier adventure from Scala/ Java for Spark as those can 
> directly call that 3rd-party libraries methods, but our users ask to have a 
> way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions

2018-04-02 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423439#comment-16423439
 ] 

Ruslan Dautkhanov commented on SPARK-23842:
---

[~hyukjin.kwon] Thanks for the reply. I wasn't sure where this issue (or 
feature ?) belongs, that's why I opened it on both py4j and Spark.. any idea 
how we can call a third-party Java library's method (that was distributed into 
executors through `--jars` argument) - from mapPartitions() ? I understand that 
spark session nor spark context are serializable but I found no way to call a 
java method from within map() / mapPartitions(). We normally can use 
`spark._jvm` to access py4j on driver side, how we can do the same on executor 
side? Thanks in advance for any leads.

> accessing java from PySpark lambda functions
> 
>
> Key: SPARK-23842
> URL: https://issues.apache.org/jira/browse/SPARK-23842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be 
> more of a Spark issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors 
> through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side 
> through Spark's {{map()}} or create an object of that library's class through 
> {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize 
> everything it sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question 
> [#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
> discussion isn't helpful.
> We thought to create a reference to that "class" through a call like 
> {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose 
> functionality of that library in pyspark executors' lambda functions.
> We don't want Spark to try to serialize spark session variables "spark" nor 
> its reference to py4j gateway {{spark._jvm}} (because it leads to expected 
> non-serializable exceptions), so tried to "trick" Spark not to try to 
> serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} 
> into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
> context) variables seems not available in spark executors' lambda functions. 
> So we're stuck and don't know how to call a generic java class through py4j 
> on executor's side (from within {{map}} or {{mapPartitions}} lambda 
> functions).
> It would be an easier adventure from Scala/ Java for Spark as those can 
> directly call that 3rd-party libraries methods, but our users ask to have a 
> way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions

2018-04-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423378#comment-16423378
 ] 

Hyukjin Kwon commented on SPARK-23842:
--

How come it's more a Spark issue. spark session or spark context is not meant 
to be serialized .. I haven't tries that yet but I think you should do this 
without that session or context but with py4j alone within the mapPartitions.

> accessing java from PySpark lambda functions
> 
>
> Key: SPARK-23842
> URL: https://issues.apache.org/jira/browse/SPARK-23842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be 
> more of a Spark issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors 
> through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side 
> through Spark's {{map()}} or create an object of that library's class through 
> {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize 
> everything it sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question 
> [#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
> discussion isn't helpful.
> We thought to create a reference to that "class" through a call like 
> {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose 
> functionality of that library in pyspark executors' lambda functions.
> We don't want Spark to try to serialize spark session variables "spark" nor 
> its reference to py4j gateway {{spark._jvm}} (because it leads to expected 
> non-serializable exceptions), so tried to "trick" Spark not to try to 
> serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} 
> into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
> context) variables seems not available in spark executors' lambda functions. 
> So we're stuck and don't know how to call a generic java class through py4j 
> on executor's side (from within {{map}} or {{mapPartitions}} lambda 
> functions).
> It would be an easier adventure from Scala/ Java for Spark as those can 
> directly call that 3rd-party libraries methods, but our users ask to have a 
> way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org