In PySpark streaming, if checkpoint enabled, and if use a stream.transform operator to join with another rdd, “PicklingError: Could not serialize object” will be thrown. I have asked the same question at stackoverflow: https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror-could-not-serialize-object-when-checkpoint-an
After some investigation, I found the problem is due to checkpoint will serialize lambda and then serialize the rdd in lambda. So I change the code to something like below, the purpose is to use a static transient variable to avoid serialize rdd. class DocInfoHolder: doc_info = None line.transform(lambda rdd:rdd.join(DocInfoHolder.doc_info)).pprint(10) But problem exist still. Then I found pyspark use a special pickle called cloudpickle.py, looks like it will serialize any reference class, function, lambda code, and there is no document about how to skip serialize. Could anyone help, how to walk around this issue. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org