Re: problem about broadcast variable in iteration

Andrew Ash Sun, 25 May 2014 14:49:21 -0700

Hi Randy,

In Spark 1.0 there was a lot of work done to allow unpersisting data that's
no longer needed.  See the below pull request.

Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the
next variable to see if you can cut the dependency there.

https://github.com/apache/spark/pull/126

Alternatively, it sounds like your algorithm needs some additional state to
join against to produce each successive iteration of RDD.  Have you
considered storing that data in an RDD rather than a broadcast variable?

Andrew

On Wed, May 7, 2014 at 10:02 PM, randylu <randyl...@gmail.com> wrote:

> But when i put broadcast variable out of for-circle, it workes well(if not
> concerned about memory issue as you pointed out):
>  1  var rdd1 = ...
>  2  var rdd2 = ...
>  3  var kv = ...
>  4  var kvGlobal = sc.broadcast(kv)               // broadcast kv
>  5  for (i <- 0 until n) {
>  6    rdd1 = rdd2.map {
>  7      case t => doSomething(t, kvGlobal.value)
>  8    }.cache()
>  9    var tmp = rdd1.reduceByKey().collect()
> 10    kv = updateKV(tmp)                               // update kv for
> each
> iteration
> 11    kvGlobal = sc.broadcast(kv)               // broadcast kv
> 12    rdd2 = rdd1
> 13 }
> 14 rdd2.saveAsTextFile()
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: problem about broadcast variable in iteration

Reply via email to