Interesting, my gut instinct is the same as Sean's. I'd suggest debugging
this in plain old scala first, without involving spark. Even just in the
scala shell, create one of your Array[T], try calling .toSet and calling
.distinct. If those aren't the same, then its got nothing to do with
spark.
Hi,
I have a question about Array[T].distinct on customized class T. My data is
a like RDD[(String, Array[T])] in which T is a class written by my class.
There are some duplicates in each Array[T] so I want to remove them. I
override the equals() method in T and use
val dataNoDuplicates =
Hi Sean,
I didn't override hasCode. But the problem is that Array[T].toSet could
work but Array[T].distinct couldn't. If it is because I didn't override
hasCode, then toSet shouldn't work either right? I also tried using this
Array[T].distinct outside RDD, and it is working alright also,
I suppose it depends a lot on the implementations. In general,
distinct and toSet work when hashCode and equals are defined
correctly. When that isn't the case, the result isn't defined; it
might happen to work in some cases. This could well explain why you
see different results. Why not implement