"sample 2 * n tuples, split them into two parts, balance the sizes of these parts by filtering some tuples out"
How do you guarantee that the two RDDs have the same size? -Xiangrui On Fri, Jan 9, 2015 at 3:40 AM, Niklas Wilcke <[email protected]> wrote: > Hi Spark community, > > I have a problem with zipping two RDDs of the same size and same number of > partitions. > The error message says that zipping is only allowed on RDDs which are > partitioned into chunks of exactly the same sizes. > How can I assure this? My workaround at the moment is to repartition both > RDDs to only one partition but that obviously > does not scale. > > This problem originates from my problem to draw n random tuple pairs (Tuple, > Tuple) from an RDD[Tuple]. > What I do is to sample 2 * n tuples, split them into two parts, balance the > sizes of these parts > by filtering some tuples out and zipping them together. > > I would appreciate to read better approaches for both problems. > > Thanks in advance, > Niklas --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
