It sounds a lot like your values are mutable classes and you are
mutating or reusing them somewhere? It might work until you actually
try to materialize them all and find many point to the same object.

On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers <tris...@blackfrog.org> wrote:
> Hi,
>
> I’m getting some seemingly invalid results when I collect an RDD. This is
> happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.
>
> See the following code snippet:
>
> JavaRDD<Thing> rdd= pairRDD.values();
> rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) );
> rdd.collect().forEach( e -> System.out.println ( "Collected Foreach: " + e )
> );
>
> I would expect the results from the two outputters to be identical, but
> instead I see:
>
> RDD Foreach: Thing1
> RDD Foreach: Thing2
> RDD Foreach: Thing3
> RDD Foreach: Thing4
> (…snip…)
> Collected Foreach: Thing1
> Collected Foreach: Thing1
> Collected Foreach: Thing1
> Collected Foreach: Thing2
>
> So essentially the valid entries except for one are replaced by an
> equivalent number of duplicate objects. I’ve tried various map and filter
> operations, but the results in the RDD always appear correct until I try to
> collect() the results. I’ve also found that calling cache() on the RDD
> materialises the duplication process such that the RDD Foreach displays the
> duplicates too...
>
> Any suggestions for how I can go about debugging this would be massively
> appreciated.
>
> Cheers
> Tristan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to