Re: rdd.foreach doesn't act as expected

Matei Zaharia Wed, 06 Nov 2013 18:12:54 -0800

In general, you shouldn’t be mutating data in RDDs. That will make it 
impossible to recover from faults.


In this particular case, you got 1 and 2 because the RDD isn’t cached. You just 
get the same list you called parallelize() with each time you iterate through 
it. But caching it and modifying it in place would not be a good idea — use a 
map() to create a new RDD instead.

Matei

On Nov 6, 2013, at 5:41 PM, Hao REN <julien19890...@gmail.com> wrote:

> 'map' works as expected. The immutable object here is just based on the use 
> case that the data need to be updated everyday.
> Wondering what the best way to do that. Not sure that spark supports updating 
> well.
> 
> 
> 2013/11/6 Mohit Jaggi <mohit.ja...@ayasdi.com>
> my guess is you need to use a map for this. foreach is for side-effects and i 
> am not sure if changing the object itself is an expected use. also, the 
> objects are supposed to be immutable, your's isn't.
> 
> 
> On Tue, Nov 5, 2013 at 4:40 PM, Hao REN <julien19890...@gmail.com> wrote:
> Hi,
> 
> Just a quick question:
> 
> When playing Spark with my toy code as below, I get some unexpected results.
> 
> 
> case class A(var a: Int) {
>     def setA() = { a = 100 }
> }
> 
> val as = sc.parallelize(List(A(1), A(2)))   // it is a RDD[A]
> 
> as.foreach(_.setA())
> 
> as.collect  // it gives Array[this.A] = Array(A(1), A(2))
> 
> 
> The result expected is Array(A(100), A(100)). I am just trying to update the 
> content of the objects of A which reside in RDD.
> 
> 1) Does the foreach do the right thing ? 
> 2) Which is the best way to update the object in RDD, use 'map' instead ?
> 
> Thank you.
> 
> Hao
> 
> -- 
> REN Hao
> 
> Data Engineer @ ClaraVista
> 
> Paris, France
> 
> Tel:  +33 06 14 54 57 24
> 
> 
> 
> 
> -- 
> REN Hao
> 
> Data Engineer @ ClaraVista
> 
> Paris, France
> 
> Tel:  +33 06 14 54 57 24

Re: rdd.foreach doesn't act as expected

Reply via email to