I am trying to understand the process of caching and specifically what the
behavior is when the cache is full. Please excuse me if this question is a
little vague, I am trying to build my understanding of this process.

I have an RDD that I perform several computations with, I persist it with
IN_MEMORY_SER before performing the computations.

I believe that, due to insufficient memory, it is recomputing (at least
part of) the RDD each time.

Logging shows that the RDD was not cached previously, and therefore needs
to be computed.

I looked at the BlockManager Spark code, and see that getOrCompute attempts
to retrieve memory from cache. If it is not available, it computes it.

Can I assume that when Spark attempts to cache an RDD but runs out of
memory, it recomputes a part of the RDD each time it is read?

I think I might be incorrect in this assumption, because I would have
expected a warning message if the cache was out of memory.

Thanks,
Jeff

Reply via email to