So even on RDDs cache/persist mutate the RDD object. The important thing for Spark is that the data represented/in the RDD/Dataframe isn’t mutated.
On Mon, May 25, 2020 at 10:56 AM Chris Thomas <heathstudios...@gmail.com> wrote: > > The cache() method on the DataFrame API caught me out. > > Having learnt that DataFrames are built on RDDs and that RDDs are > immutable, when I saw the statement df.cache() in our codebase I thought > ‘This must be a bug, the result is not assigned, the statement will have no > affect.’ > > However, I’ve since learnt that the cache method actually mutates the > DataFrame object*. The statement was valid after all. > > I understand that the underlying user data is immutable, but doesn’t > mutating the DataFrame object make the API a little inconsistent and harder > to reason about? > > Regards > > Chris > > > * (as does persist and rdd.setName methods. I expect there are others) > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau