understanding iterative algorithms in Spark

Raghava Mon, 25 Jan 2016 14:30:12 -0800

Hello All,

I am new to Spark and I am trying to understand how iterative application of
operations are handled in Spark. Consider the following program in Scala.


var u = sc.textFile(args(0)+"s1.txt").map(line => {
                line.split("\\|") match { case Array(x,y) => 
(y.toInt,x.toInt)}})     
u.cache()
println("Before iteration u count: "+u.count())

val type1 = sc.textFile(args(0)+"Type1.txt").map(line => {
        line.split("\\|") match { case Array(x,y) => (x.toInt,y.toInt)}}) 
type1.cache()
println("Type1 count: " + type1.count())

var counter=0
while(counter < 20) {
        val r1Join = type1.join(u).map( { case (k,v) => v}).cache()        
        u = u.union(r1Join).distinct.cache()                
        //testing checkpoint
        if(counter == 4)
        u.checkpoint()
        println("u count: "+u.count())
        counter += 1
}

>From the UI, I have attached the DAG visualizations at various iterations.

I have the following questions. It would be of great help if someone can
answer them.

1) When we cache a RDD, is it safe to say that it will not be recomputed?
For example in dag1.png, all the green map dots will not be recomputed.

2) In dag1.png, for stage4 join, we expected one input to be the output of
stage3 (this is as per our expectation) and the other input to be the output
of stage2. The latter does not happen. Why is this the case?

3) In dag1.png, why is stage5 not part of stage4? Why is distinct and
distinct + cache separated? Will distinct be run twice?

4) In dag4.png, we expected the input of join in stage21 would come from the
output of stage19 but instead, it gets recomputed at the beginning of
stage21. Why would distinct gets recomputed at the beginning of each
iteration? 

5) In dag2.png, join operation is represented by 3 boxes. What does this
mean?

6) In dag4.png, there are several "skipped" stages. Is it safe to assume
that the skipped stages not recomputed again?

Thanks in advance.

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag1.png> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag2.png> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag3.png> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26064/dag4.png> 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/understanding-iterative-algorithms-in-Spark-tp26064.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

understanding iterative algorithms in Spark

Reply via email to