To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop API, and isn't necessarily a failing in Spark - see this blog post (https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) from 2011 documenting a similar issue.
On June 11, 2015, at 3:17 PM, Sean Owen <so...@cloudera.com> wrote: Yep you need to use a transformation of the raw value; use toString for example. On Thu, Jun 11, 2015, 8:54 PM Crystal Xing <crystalxin...@gmail.com> wrote: That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen <so...@cloudera.com> wrote: Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing <crystalxin...@gmail.com> wrote: I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDD<Text> idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have 7000K distinct value, how every it only returns 7000 values, which is the same as number of tasks. The type I am using is import org.apache.hadoop.io.Text; However, if I switch to use String instead of Text, it works correcly. I think the Text class should have correct implementation of equals() and hashCode() functions since it is the hadoop class. Does anyone have clue what is going on? I am using spark 1.2. Zheng zheng