Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use
val dataNoDuplicates = dataDuplicates.map{case(id, arr) => (id, arr.distinct)} to remove duplicates inside RDD. However this doesn't work since I did some further tests by using val dataNoDuplicates = dataDuplicates.map{case(id, arr) => val uniqArr = arr.distinct if(uniqArr.length > 1) println(uniqArr.head == uniqArr.last) (id, uniqArr) } And from the worker stdout I could see that it always returns "TRUE" results. I then tried removing duplicates by using Array[T].toSet instead of Array[T].distinct and it is working! Could anybody explain why the Array[T].toSet and Array[T].distinct behaves differently here? And Why is Array[T].distinct not working? Thanks a lot! Anny -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Array-T-distinct-doesn-t-work-inside-RDD-tp22412.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org