I guess that's the answer then , I am mutating my cached objects.
I will probably need to create a new copy of those objects within the
transformation to avoid this.
On 12/9/13 2:33 PM, Josh Rosen wrote:
Spark doesn't perform defensive copying before passing cached objects
to your user-defined functions, so if you're caching mutable Java
objects and mutating them in a transformation, the effects of that
mutation might change the cached data and affect other RDDs. Is func2
mutating your cached objects?
On Mon, Dec 9, 2013 at 11:25 AM, Yadid Ayzenberg <[email protected]
<mailto:[email protected]>> wrote:
yes, I understand that I lost the original reference to my RDD.
my questions is regarding the new reference (made by tempRDD). I
should have 2 references to two different points in the lineage chain.
One is myRDD and the other is tempRDD. it seems that after running
the transformation referred to by tempRDD, I have altered myRDD as
well.
(or at least that's what I think is happening). When I try to run
an additional transformation, it seems that that myRDD is now
pointing to a point in the lineage which is after the 2nd
transformation.
Yadid
On 12/9/13 2:17 PM, Mark Hamstra wrote:
Neither map nor mapPartitions mutates an RDD -- if you establish
an immutable reference to an rdd (e.g., in Scala, val rdd =
...), the data contained within that RDD will be the same
regardless of any map or mapPartition transformations. However,
if you re-assign the reference to point to the transformed RDD
(as you do with myRDD = myRDD.mapPartitions(...)), then you've
lost the reference to the original (un-mutated) state of the RDD
and have only a reference to the RDD-with-tranformation-applied.
That doesn't make the RDD mutable nor does it make either map or
mapPartitions a side-effecting mutator -- you've just changed
where in a lineage of transformations you are pointing to with
your mutable myRDD reference.
On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg
<[email protected] <mailto:[email protected]>> wrote:
Hi all,
Im noticing some strange behavior when running mapPartitions.
Pseudo code:
JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
myRDD.mapPartitions( func )
myRDD.count()
ArrayList<Tuple2<Integer, Tuple2<List<Tuple2<Double,
Double>>, List<Tuple2<Double, Double>>>>>tempRDD =
myRDD.mapPartitions(func2 )
tempRDD.count()
JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
myRDD.mapPartitions( func )
It seems that mapPartitions has side-effects. When I try
running the last line - its seems that contents of myRDD have
been changed by the previous map. I thought the RDD were
immutable and that It was only possible to generate new RDDs
using map. Is this incorrect?
Thanks,
Yadid