So even on RDDs cache/persist mutate the RDD object. The important thing
for Spark is that the data  represented/in the RDD/Dataframe isn’t mutated.

On Mon, May 25, 2020 at 10:56 AM Chris Thomas <heathstudios...@gmail.com>
wrote:

>
> The cache() method on the DataFrame API caught me out.
>
> Having learnt that DataFrames are built on RDDs and that RDDs are
> immutable, when I saw the statement df.cache() in our codebase I thought
> ‘This must be a bug, the result is not assigned, the statement will have no
> affect.’
>
> However, I’ve since learnt that the cache method actually mutates the
> DataFrame object*. The statement was valid after all.
>
> I understand that the underlying user data is immutable, but doesn’t
> mutating the DataFrame object make the API a little inconsistent and harder
> to reason about?
>
> Regards
>
> Chris
>
>
> * (as does persist and rdd.setName methods. I expect there are others)
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to