Hi, 

I have a question about Array[T].distinct on customized class T. My data is
a like RDD[(String, Array[T])] in which T is a class written by my class.
There are some duplicates in each Array[T] so I want to remove them. I
override the equals() method in T and use

val dataNoDuplicates = dataDuplicates.map{case(id, arr) => (id,
arr.distinct)}

to remove duplicates inside RDD. However this doesn't work since I did some
further tests by using

val dataNoDuplicates = dataDuplicates.map{case(id, arr) =>
val uniqArr = arr.distinct
if(uniqArr.length > 1) println(uniqArr.head == uniqArr.last)
(id, uniqArr)
}

And from the worker stdout I could see that it always returns "TRUE"
results. I then tried removing duplicates by using Array[T].toSet instead of
Array[T].distinct and it is working!

Could anybody explain why the Array[T].toSet and Array[T].distinct behaves
differently here? And Why is Array[T].distinct not working? 

Thanks a lot!
Anny




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Array-T-distinct-doesn-t-work-inside-RDD-tp22412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to