https://groups.google.com/forum/?fromgroups=#!searchin/spark-users/reuse/spark-users/ztODmLlUlwc/Se2MTK5IU3EJ
On Mon, Sep 30, 2013 at 3:06 PM, Sergey Parhomenko <[email protected]>wrote: > Hi, > > We tried to use *JavaPairRDD.sortByKey()* and were not able to. I'm not > fully sure if that's a bug or we are using APIs incorrectly, so would like > to crosscheck on the mailing list first. The unit test is attached. > Essentially, we create Hadoop sequence file and write different key/value > pairs there. Then we use *JavaSparkContext.sequenceFile().collect()* to > read the same pairs. The data we get, however, is not the data we sent - we > get the same row over and over again. That seems to be caused by the code > in *HadoopRDD.compute()* which creates mutable key and value once, and > reuses them for each iterated tuple. While this works fine if we just need > to calculate something based on the data, it does not work if we need to > collect some of that data. It doesn't work both when using Java > serialization (*org.apache.hadoop.io.serializer.JavaSerialization*) and > default Hadoop serialization (* > org.apache.hadoop.io.serializer.WritableSerialization*), which is > demonstrated by corresponding test methods. For the same reason * > JavaPairRDD.sortByKey()* does not work, which is actually our main > problem, also demonstrated in a separate method. > > If this is indeed a bug we can raise an issue in JIRA. > > -- > Best regards, > Sergey Parhomenko >
