Hello All, I am new to Spark and I am trying to understand how iterative application of operations are handled in Spark. Consider the following program in Scala.
var u = sc.textFile(args(0)+"s1.txt").map(line => { line.split("\\|") match { case Array(x,y) => (y.toInt,x.toInt)}}) u.cache() println("Before iteration u count: "+u.count()) val type1 = sc.textFile(args(0)+"Type1.txt").map(line => { line.split("\\|") match { case Array(x,y) => (x.toInt,y.toInt)}}) type1.cache() println("Type1 count: " + type1.count()) var counter=0 while(counter < 20) { val r1Join = type1.join(u).map( { case (k,v) => v}).cache() u = u.union(r1Join).distinct.cache() //testing checkpoint if(counter == 4) u.checkpoint() println("u count: "+u.count()) counter += 1 } >From the UI, I have attached the DAG visualizations at various iterations. I have the following questions. It would be of great help if someone can answer them. 1) When we cache a RDD, is it safe to say that it will not be recomputed? For example in dag1.png, all the green map dots will not be recomputed. 2) In dag1.png, for stage4 join, we expected one input to be the output of stage3 (this is as per our expectation) and the other input to be the output of stage2. The latter does not happen. Why is this the case? 3) In dag1.png, why is stage5 not part of stage4? Why is distinct and distinct + cache separated? Will distinct be run twice? 4) In dag4.png, we expected the input of join in stage21 would come from the output of stage19 but instead, it gets recomputed at the beginning of stage21. Why would distinct gets recomputed at the beginning of each iteration? 5) In dag2.png, join operation is represented by 3 boxes. What does this mean? 6) In dag4.png, there are several "skipped" stages. Is it safe to assume that the skipped stages not recomputed again? Thanks in advance. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag1.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag2.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag3.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag4.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/understanding-iterative-algorithms-in-Spark-tp26064.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org