[ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278121#comment-16278121
 ] 

Hyukjin Kwon edited comment on SPARK-22674 at 12/5/17 8:06 AM:
---------------------------------------------------------------

Oh, sorry, I overlooked at {{ that regular pickle won't be able to unpickle 
namedtuples anymore.}}.

I didn't mean to completely remove out the support with regular pickle one but 
deduplicates the logic for serializers if possible and matches PySpark's copy 
to specific version of cloudpickle, if possible (not completely replacing it 
completely). 

I'd like to avoid a separate fix within PySpark if we can.



was (Author: hyukjin.kwon):
Oh, sorry, I overlooked at {{ that regular pickle won't be able to unpickle 
namedtuples anymore.}}.

I didn't mean to completely remove out the support with regular pickle one but 
deduplicates the logic for serializers if possible and matches PySpark's copy 
to specific version of cloudpickle, if possible. 

I'd like to avoid a separate fix within PySpark if we can.


> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
>                 Key: SPARK-22674
>                 URL: https://issues.apache.org/jira/browse/SPARK-22674
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Jonas Amrich
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to