I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) -> RDD1 -> RDD2 | | V V RDD3 -> RDD4 -> Action!
To my experience the change RDD1 get recalculated is volatile, sometimes once, sometimes twice. When calculation of this RDD is expensive (e.g. involves using an RESTful service that charges me money), this compels me to persist RDD1 which takes extra memory, and in case the Action! doesn't always happen, I don't know when to unpersist it to free those memory. A related problem might be in $SQLContest.jsonRDD(), since the source jsonRDD is used twice (one for schema inferring, another for data read). It almost guarantees that the source jsonRDD is calculated twice. Has this problem be addressed so far? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/If-an-RDD-appeared-twice-in-a-DAG-of-which-calculation-is-triggered-by-a-single-action-will-this-RDD-tp21192.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org