Hi, Let's say I have the following code (it's an example)
df_a = spark.read.json() df_b = df_a.sample(False, 0.5, 10) df_c = df_a.sample(False, 0.5, 10) df_d = df_b.union(df_c) df_d.count() Do we have to cache df_a as it is used by df_b and df_c, or spark will notice that df_a is used twice in the same DAG and doesn't not compute it twice ? Thanks, Maxime -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org