Re: Array[T].distinct doesn't work inside RDD

2015-04-14 Thread Imran Rashid
Interesting, my gut instinct is the same as Sean's. I'd suggest debugging this in plain old scala first, without involving spark. Even just in the scala shell, create one of your Array[T], try calling .toSet and calling .distinct. If those aren't the same, then its got nothing to do with spark.

Array[T].distinct doesn't work inside RDD

2015-04-07 Thread anny9699
Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use val dataNoDuplicates =

Re: Array[T].distinct doesn't work inside RDD

2015-04-07 Thread Anny Chen
Hi Sean, I didn't override hasCode. But the problem is that Array[T].toSet could work but Array[T].distinct couldn't. If it is because I didn't override hasCode, then toSet shouldn't work either right? I also tried using this Array[T].distinct outside RDD, and it is working alright also,

Re: Array[T].distinct doesn't work inside RDD

2015-04-07 Thread Sean Owen
I suppose it depends a lot on the implementations. In general, distinct and toSet work when hashCode and equals are defined correctly. When that isn't the case, the result isn't defined; it might happen to work in some cases. This could well explain why you see different results. Why not implement