I guess that's the answer then , I am mutating my cached objects.
I will probably need to create a new copy of those objects within the transformation to avoid this.


On 12/9/13 2:33 PM, Josh Rosen wrote:
Spark doesn't perform defensive copying before passing cached objects to your user-defined functions, so if you're caching mutable Java objects and mutating them in a transformation, the effects of that mutation might change the cached data and affect other RDDs. Is func2 mutating your cached objects?


On Mon, Dec 9, 2013 at 11:25 AM, Yadid Ayzenberg <[email protected] <mailto:[email protected]>> wrote:

    yes, I understand that I lost the original reference to my RDD.
    my questions is regarding the new reference (made by tempRDD). I
    should have 2 references to two different points in the lineage chain.
    One is myRDD and the other is tempRDD. it seems that after running
    the transformation referred to by tempRDD, I have altered myRDD as
    well.
    (or at least that's what I think is happening). When I try to run
    an additional transformation, it seems that that myRDD is now
    pointing to a point in the lineage which is after the 2nd
    transformation.

    Yadid



    On 12/9/13 2:17 PM, Mark Hamstra wrote:
    Neither map nor mapPartitions mutates an RDD -- if you establish
    an immutable reference to an rdd (e.g., in Scala,  val rdd =
    ...), the data contained within that RDD will be the same
    regardless of any map or mapPartition transformations.  However,
    if you re-assign the reference to point to the transformed RDD
    (as you do with myRDD = myRDD.mapPartitions(...)), then you've
    lost the reference to the original (un-mutated) state of the RDD
    and have only a reference to the RDD-with-tranformation-applied.
     That doesn't make the RDD mutable nor does it make either map or
    mapPartitions a side-effecting mutator -- you've just changed
    where in a lineage of transformations you are pointing to with
    your mutable myRDD reference.


    On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg
    <[email protected] <mailto:[email protected]>> wrote:


        Hi all,

        Im noticing some strange behavior when running mapPartitions.
        Pseudo code:

        JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
        myRDD.mapPartitions( func )

        myRDD.count()

        ArrayList<Tuple2<Integer, Tuple2<List<Tuple2<Double,
        Double>>, List<Tuple2<Double, Double>>>>>tempRDD =
        myRDD.mapPartitions(func2 )

        tempRDD.count()


        JavaPairRDD<Object, Tuple2<Object, BSONObject>> myRDD =
        myRDD.mapPartitions( func )


        It seems that mapPartitions has side-effects. When I try
        running the last line - its seems that contents of myRDD have
        been changed by the previous map. I thought the RDD were
        immutable and that It was only possible to generate new RDDs
        using map. Is this incorrect?


        Thanks,
        Yadid





Reply via email to