Spark incorrectly collects data from Hadoop SequenceFile

Sergey Parhomenko Mon, 30 Sep 2013 15:07:53 -0700

Hi,

We tried to use *JavaPairRDD.sortByKey()* and were not able to. I'm not
fully sure if that's a bug or we are using APIs incorrectly, so would like
to crosscheck on the mailing list first. The unit test is attached.
Essentially, we create Hadoop sequence file and write different key/value
pairs there. Then we use *JavaSparkContext.sequenceFile().collect()* to
read the same pairs. The data we get, however, is not the data we sent - we
get the same row over and over again. That seems to be caused by the code
in *HadoopRDD.compute()* which creates mutable key and value once, and
reuses them for each iterated tuple. While this works fine if we just need
to calculate something based on the data, it does not work if we need to
collect some of that data. It doesn't work both when using Java
serialization (*org.apache.hadoop.io.serializer.JavaSerialization*) and
default Hadoop serialization (*
org.apache.hadoop.io.serializer.WritableSerialization*), which is
demonstrated by corresponding test methods. For the same reason *
JavaPairRDD.sortByKey()* does not work, which is actually our main problem,
also demonstrated in a separate method.


If this is indeed a bug we can raise an issue in JIRA.

--
Best regards,
Sergey Parhomenko

SparkBugTest.java
Description: Binary data

Spark incorrectly collects data from Hadoop SequenceFile

Reply via email to