Based on Ben's helpful error description, I managed to reproduce this bug
and found the root cause:
There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't
close a serialization stream before attempting to deserialize its
serialized values, causing it to miss any data stored in
cache() / persist() is definitely *not* supposed to affect the result of a
program, so the behavior that you're seeing is unexpected.
I'll try to reproduce this myself by caching in PySpark under heavy memory
pressure, but in the meantime the following questions will help me to debug:
- Does
Hi,
I'm trying to understand if there is any difference in correctness
between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).
I can see that there may be differences in performance, but my
expectation was that using either would result in the