Hi, I am using the filter() method to separate the rdds based on a predicate,as follows:
val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0}) // t._2 is a Double val rdd2 = data.filter (t => { t._2 >1.0 && t._2 <= 4.0}) val rdd3 = data.filter (t => { t._2 >0.0 && t._2 <= 4.0}) // this should be a union of rdd1 and rdd2 When I print the count of all the 3 rdds, I find that rdd1.count() + rdd2.count() > rdd3.count(). Here are the 3 counts: rdd1.count() = 22,088,757 rdd2.count() = 37,436,993 rdd3.count() = 39,096,164 rdd1 and rdd2 should be mutually exclusive and the sum of their counts should be equal to rdd3.count(). Any idea why I am having this discrepancy? Is the distributed computation causing incorrect counts? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-filter-tp21565.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org