Hi all,
I notice that RDD.cartesian has a strange behavior with cached and uncached
data. More precisely, I have a set of data that I load with objectFile
*val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
Then I split it in two set depending on some criteria
*val part1 = data.filter(_._2 matches "view1")*
*val part2 = data.filter(_._2 matches "view2")*
Finally, I compute the cartesian product of part1 and part2
*val pair = part1.cartesian(part2)*
If every thing goes well I should have
*pair.count == part1.count * part2.count*
But this is not the case if I don't cache part1 and part2.
What I was missing ? Does caching data mandatory in Spark ?
Cheers,
Jaonary