How was this RDD generated? Any randomness involved? -Xiangrui

On Mon, Feb 9, 2015 at 10:47 AM, SK <skrishna...@gmail.com> wrote:
> Hi,
>
> I am using the filter() method to separate the rdds based on a predicate,as
> follows:
>
> val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0})  // t._2 is a
> Double
> val rdd2 = data.filter (t => { t._2 >1.0 && t._2 <= 4.0})
> val rdd3 = data.filter (t => { t._2 >0.0 && t._2 <= 4.0})  // this should be
> a union of rdd1 and rdd2
>
>
> When I print the count of all the 3 rdds, I find that rdd1.count() +
> rdd2.count() > rdd3.count(). Here are the 3 counts:
> rdd1.count() = 22,088,757
> rdd2.count() = 37,436,993
> rdd3.count() = 39,096,164
>
> rdd1 and rdd2 should be mutually exclusive and the sum of their counts
> should be equal to rdd3.count(). Any idea why I am having this discrepancy?
> Is the distributed computation causing incorrect counts?
>
> thanks
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/rdd-filter-tp21565.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to