Hi Spark community,

I have a problem with zipping two RDDs of the same size and same number
of partitions.
The error message says that zipping is only allowed on RDDs which are
partitioned into chunks of exactly the same sizes.
How can I assure this? My workaround at the moment is to repartition
both RDDs to only one partition but that obviously
does not scale.

This problem originates from my problem to draw n random tuple pairs
(Tuple, Tuple) from an RDD[Tuple].
What I do is to sample 2 * n tuples, split them into two parts, balance
the sizes of these parts
by filtering some tuples out and zipping them together.

I would appreciate to read better approaches for both problems.

Thanks in advance,
Niklas

Reply via email to