PySpark Streaming “PicklingError: Could not serialize object” when use transform operator and checkpoint enabled

Xilang Yan Thu, 23 May 2019 05:52:13 -0700

In PySpark streaming, if checkpoint enabled, and if use a stream.transform
operator to join with another rdd, “PicklingError: Could not serialize
object” will be thrown. I have asked the same question at stackoverflow:
https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror-could-not-serialize-object-when-checkpoint-an


After some investigation, I found the problem is due to checkpoint will
serialize lambda and then serialize the rdd in lambda. So I change the code
to something like below, the purpose is to use a static transient variable 
to avoid serialize rdd.

class DocInfoHolder:
    doc_info = None

line.transform(lambda rdd:rdd.join(DocInfoHolder.doc_info)).pprint(10)

But problem exist still. Then I found pyspark use a special pickle called
cloudpickle.py, looks like it will serialize any reference class, function,
lambda code, and there is no document about how to skip serialize. Could
anyone help, how to walk around this issue.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

PySpark Streaming “PicklingError: Could not serialize object” when use transform operator and checkpoint enabled

Reply via email to