Hi. I have an RDD that I use repeatedly through many iterations of an algorithm. To prevent recomputation, I persist the RDD (and incidentally I also persist and checkpoint it's parents)
val consCostConstraintMap = consCost.join(constraintMap).map { case (cid, (costs,(mid1,_,mid2,_,_))) => { (cid, (costs, mid1, mid2)) } } consCostConstraintMap.setName("consCostConstraintMap") consCostConstraintMap.persist(MEMORY_AND_DISK_SER) ... later on in an iterative loop val update = updatedTrips.join(consCostConstraintMap).flatMap { ... }.treeReduce() --------- I can see from the UI that consCostConstraintMap is in storage RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in TachyonSize on Disk consCostConstraintMap <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:4040/storage/rdd?id=113>Memory Serialized 1x Replicated600100%15.2 GB0.0 B0.0 B --------- In the Jobs list, I see the following pattern Where each of the treeReduce line corresponds to one iteration loop Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all stages): Succeeded/Total 13treeReduce at reconstruct.scala:243 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=13>2015/05/22 16:27:112.9 min16/16 (194 skipped) 9024/9024 (109225 skipped) 12treeReduce at reconstruct.scala:243 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=12>2015/05/22 16:24:162.9 min16/16 (148 skipped) 9024/9024 (82725 skipped) 11treeReduce at reconstruct.scala:243 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=11>2015/05/22 16:21:212.9 min16/16 (103 skipped) 9024/9024 (56280 skipped) 10treeReduce at reconstruct.scala:243 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=10>2015/05/22 16:18:282.9 min16/16 (69 skipped) 9024/9024 (36980 skipped) -------------- If I push into one Job I see *Completed Stages:* <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job/?id=12#completed> 16 - *Skipped Stages:* <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job/?id=12#skipped> 148 Completed Stages (16) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle ReadShuffle Write525treeReduce at reconstruct.scala:243 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/stages/stage?id=525&attempt=0> +details 2015/05/22 16:27:0942 ms 24/24 21.7 KB524....... 519map at reconstruct.scala:153 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/stages/stage?id=519&attempt=0> +details 2015/05/22 16:24:161.2 min 600/600 14.8 GB8.4 GBThe last line map at reconstruct.scala:153 <http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/stages/stage?id=519&attempt=0> corresponds to "val consCostConstraintMap = consCost.join(constraintMap).map {" Which I expected to have been cached. Is there some way I can find out what it is spending 1.2 mins doing .. I presume reading and writing GB of data. But why? Eveything should be in memory? Any clues on where I should start? tks