Hi,

I am using the filter() method to separate the rdds based on a predicate,as
follows:

val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0})  // t._2 is a
Double
val rdd2 = data.filter (t => { t._2 >1.0 && t._2 <= 4.0}) 
val rdd3 = data.filter (t => { t._2 >0.0 && t._2 <= 4.0})  // this should be
a union of rdd1 and rdd2


When I print the count of all the 3 rdds, I find that rdd1.count() +
rdd2.count() > rdd3.count(). Here are the 3 counts:
rdd1.count() = 22,088,757
rdd2.count() = 37,436,993
rdd3.count() = 39,096,164

rdd1 and rdd2 should be mutually exclusive and the sum of their counts
should be equal to rdd3.count(). Any idea why I am having this discrepancy?
Is the distributed computation causing incorrect counts?

thanks
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-filter-tp21565.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to