Hi,

Assume my spark context is pointing to local[N]. If I have an RDD created with 
sparkContext.sequenceFile(…), and I call .collect() on it immediately (assume 
it's small), sometimes I get duplicate rows back. In addition, if I call 
sparkContext.sequenceFile(…) and immediately call an operation on it, I get 
incorrect results – debugging the code in Eclipse shows that some of the 
objects in the RDD are duplicates even when there are no duplicates in the 
original sequence file.

I know this is a problem related to how Hadoop Writable serialization re-uses 
objects. I wrote a solution which immediately "copies" the data from the 
sequence file into a new object. More specifically:

sparkContext.sequenceFile(…).map(x => new MyClass(x))

Creating the RDD with the above code fixes that problem. However – now I get 
out of memory errors trying to do something like sort this data set, when I run 
my code against a 10-Node cluster.

My question is (assuming you've gotten past my very poor vague explanation of 
my situation): How does the Hadoop file system and its optimization to re-use 
objects, affect the contents of RDDs if they are collected and/or transformed? 
And, does this have to be a concern when RDDs are retrieved when Spark is run 
against a cluster? Or will I only see these anomalies if I'm running Spark on 
local[N]?

Thanks! Hope that wasn't too confusing,

-Matt Cheah

Reply via email to