When I first learnt Spark, I was told that *cache()* is desirable anytime one performs more than one Action on an RDD or DataFrame. For example, consider the PySpark toy example below; it shows two approaches to doing the same thing.
# Approach 1 (bad?) df2 = someTransformation(df1) a = df2.count() b = df2.first() # This step could take long, because df2 has to be created all over again # Approach 2 (good?) df2 = someTransformation(df1) df2.cache() a = df2.count() b = df2.first() # Because df2 is already cached, this action is quick df2.unpersist() The second approach shown above is somewhat clunky, because it requires one to cache any dataframe that will be Acted on more than once, followed by the need to call *unpersist()* later to free up memory. *So my question is: is the second approach still necessary/desirable when operating on DataFrames in newer versions of Spark (>=1.6)?* Thanks!! Apu
