Hi, I have a spark app which internally splits into 2 jobs coz we write to 2 different cassandra tables. The input data comes from the same cassandra table, so after reading data from cassandra and apply few transformations I cache one of the RDD and fork the program to compute both the metrics. Ideally what I would expect is that the 2nd job skips all the previous transformations prior to the cached RDD and start running from there, instead what is happening is that the entire stage in which caching was done is getting re-executed till the caching transformation and then the job continues the way it is supposed to.
Here are 2 DAGs from the jobs: Job 0: [image: Inline image 1] Job 1: [image: Inline image 2] The RDD from Job 0 Stage 4 after join and map is cached as indicated by the green dot, and the corresponding stage in Job 1 is Stage 9 which again re-executes the join and map stage before going on to the regular path. There is another interesting observation that the final RDD in stage 0 is also a cached RDD and is being used in Stage 2. When we kick off the job both Stage 0 and Stage 2 are schedule in parallel but Stage 2 waits for a long time and then does some computation after 70-80% of Stage 0 is over which I assume is waiting for the cache to be built by Stage 0 and the computation it is doing is the last map transformation in Stage 2 which indicates cache working intra job. I want to understand this in depth coz re computation of entire stage can sometimes lead to re-reading data from cassandra or shuffling a lot of data. Thanks, Sourabh