[ https://issues.apache.org/jira/browse/SPARK-5917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328769#comment-14328769 ]
Sean Owen commented on SPARK-5917: ---------------------------------- Isn't this just another symptom of the known issues with S3 and inconsistent reads? https://issues.apache.org/jira/browse/SPARK-2579 Did you try the items in that JIRA? > Distinct is broken > ------------------ > > Key: SPARK-5917 > URL: https://issues.apache.org/jira/browse/SPARK-5917 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.1.1 > Environment: Spark 1.1.1 running on YARN 2.4 via Amazon EMR. > Reporter: Derrick Burns > Priority: Critical > > I hate to file bugs that are hard to reproduce (by other people), but after > spending a full week trying to debug my code, I constructed a scenario where > the following assertion FAILS. > val x : RDD[T] = .... > val y = x.distinct() > assert( y.count() <= x.count() ) > I am at a complete loss as to how this can occur under ANY definition of > equality/order unless the RDD underlying x is mutable. Since none of my RDD > transforms mutate any existing RDD data and I am reading from immutable > sources (data on S3), I conclude that there must be a bug in Spark or I am > mutating my data unknowingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org