Spark doesn't perform defensive copying before passing cached objects to your user-defined functions, so if you're caching mutable Java objects and mutating them in a transformation, the effects of that mutation might change the cached data and affect other RDDs. Is func2 mutating your cached objects?
On Mon, Dec 9, 2013 at 11:25 AM, Yadid Ayzenberg <[email protected]>wrote: > yes, I understand that I lost the original reference to my RDD. > my questions is regarding the new reference (made by tempRDD). I should > have 2 references to two different points in the lineage chain. > One is myRDD and the other is tempRDD. it seems that after running the > transformation referred to by tempRDD, I have altered myRDD as well. > (or at least that's what I think is happening). When I try to run an > additional transformation, it seems that that myRDD is now pointing to a > point in the lineage which is after the 2nd transformation. > > Yadid > > > > On 12/9/13 2:17 PM, Mark Hamstra wrote: > > Neither map nor mapPartitions mutates an RDD -- if you establish an > immutable reference to an rdd (e.g., in Scala, val rdd = ...), the data > contained within that RDD will be the same regardless of any map or > mapPartition transformations. However, if you re-assign the reference to > point to the transformed RDD (as you do with myRDD = > myRDD.mapPartitions(...)), then you've lost the reference to the original > (un-mutated) state of the RDD and have only a reference to the > RDD-with-tranformation-applied. That doesn't make the RDD mutable nor does > it make either map or mapPartitions a side-effecting mutator -- you've just > changed where in a lineage of transformations you are pointing to with your > mutable myRDD reference. > > > On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg <[email protected]>wrote: > >> >> Hi all, >> >> Im noticing some strange behavior when running mapPartitions. Pseudo code: >> >> JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD = >> myRDD.mapPartitions( func ) >> >> myRDD.count() >> >> ArrayList<Tuple2<Integer, Tuple2<List<Tuple2<Double, Double>>, >> List<Tuple2<Double, Double>>>>>tempRDD = myRDD.mapPartitions(func2 ) >> >> tempRDD.count() >> >> >> JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD = >> myRDD.mapPartitions( func ) >> >> >> It seems that mapPartitions has side-effects. When I try running the last >> line - its seems that contents of myRDD have been changed by the previous >> map. I thought the RDD were immutable and that It was only possible to >> generate new RDDs using map. Is this incorrect? >> >> >> Thanks, >> Yadid >> >> > >
