How was this RDD generated? Any randomness involved? -Xiangrui On Mon, Feb 9, 2015 at 10:47 AM, SK <skrishna...@gmail.com> wrote: > Hi, > > I am using the filter() method to separate the rdds based on a predicate,as > follows: > > val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0}) // t._2 is a > Double > val rdd2 = data.filter (t => { t._2 >1.0 && t._2 <= 4.0}) > val rdd3 = data.filter (t => { t._2 >0.0 && t._2 <= 4.0}) // this should be > a union of rdd1 and rdd2 > > > When I print the count of all the 3 rdds, I find that rdd1.count() + > rdd2.count() > rdd3.count(). Here are the 3 counts: > rdd1.count() = 22,088,757 > rdd2.count() = 37,436,993 > rdd3.count() = 39,096,164 > > rdd1 and rdd2 should be mutually exclusive and the sum of their counts > should be equal to rdd3.count(). Any idea why I am having this discrepancy? > Is the distributed computation causing incorrect counts? > > thanks > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/rdd-filter-tp21565.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org