Hi, We tried to use *JavaPairRDD.sortByKey()* and were not able to. I'm not fully sure if that's a bug or we are using APIs incorrectly, so would like to crosscheck on the mailing list first. The unit test is attached. Essentially, we create Hadoop sequence file and write different key/value pairs there. Then we use *JavaSparkContext.sequenceFile().collect()* to read the same pairs. The data we get, however, is not the data we sent - we get the same row over and over again. That seems to be caused by the code in *HadoopRDD.compute()* which creates mutable key and value once, and reuses them for each iterated tuple. While this works fine if we just need to calculate something based on the data, it does not work if we need to collect some of that data. It doesn't work both when using Java serialization (*org.apache.hadoop.io.serializer.JavaSerialization*) and default Hadoop serialization (* org.apache.hadoop.io.serializer.WritableSerialization*), which is demonstrated by corresponding test methods. For the same reason * JavaPairRDD.sortByKey()* does not work, which is actually our main problem, also demonstrated in a separate method.
If this is indeed a bug we can raise an issue in JIRA. -- Best regards, Sergey Parhomenko
SparkBugTest.java
Description: Binary data
