In PySpark streaming, if checkpoint enabled, and if use a stream.transform
operator to join with another rdd, “PicklingError: Could not serialize
object” will be thrown. I have asked the same question at stackoverflow:
https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror-could-not-serialize-object-when-checkpoint-an

After some investigation, I found the problem is due to checkpoint will
serialize lambda and then serialize the rdd in lambda. So I change the code
to something like below, the purpose is to use a static transient variable 
to avoid serialize rdd.

class DocInfoHolder:
    doc_info = None

line.transform(lambda rdd:rdd.join(DocInfoHolder.doc_info)).pprint(10)

But problem exist still. Then I found pyspark use a special pickle called
cloudpickle.py, looks like it will serialize any reference class, function,
lambda code, and there is no document about how to skip serialize. Could
anyone help, how to walk around this issue.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to