I asked this question too soon. I am caching off a bunch of RDDs in a TrieMap so that our framework can wire them together and the locking was not completely correct- therefore it was creating multiple new RDDs at times instead of using cached versions- which were creating completely separate lineages.
What's strange is that this bug only surfaced when I updated Spark. On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet <cjno...@gmail.com> wrote: > We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework > that we've been developing that connects various different RDDs together > based on some predefined business cases. After updating to 1.2.0, some of > the concurrency expectations about how the stages within jobs are executed > have changed quite significantly. > > Given 3 RDDs: > > RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache() > RDD2 = RDD1.outputToFile > RDD3 = RDD1.groupBy().outputToFile > > In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage > encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and > RDD3 to both block waiting for RDD1 to complete and cache- at which point > RDD2 and RDD3 both use the cached version to complete their work. > > Spark 1.2.0 seems to schedule two (be it concurrently running) stages for > each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each > get run twice). It does not look like there is any sharing of the results > between these jobs. > > Are we doing something wrong? Is there a setting that I'm not > understanding somewhere? >