Caching when you perfom one action and have a dataframe used more than once.

mxmn Thu, 28 Jun 2018 03:49:07 -0700

Hi,

Let's say I have the following code (it's an example)


df_a = spark.read.json()
df_b = df_a.sample(False, 0.5, 10)
df_c = df_a.sample(False, 0.5, 10)
df_d = df_b.union(df_c)
df_d.count()

Do we have to cache df_a as it is used by df_b and df_c, or spark will
notice that df_a is used twice in the same DAG and doesn't not compute it
twice ? 

Thanks, 

Maxime 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Caching when you perfom one action and have a dataframe used more than once.

Reply via email to