Hi Xiangrui, Thanks for your reply. This makes sense, and I should have looked at the doc.. indeed.. Zipping before saveAsFile did the trick.
-----Original Message----- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, April 01, 2014 11:43 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: Issue with zip and partitions >From API docs: "Zips this RDD with another one, returning key-value pairs with >the first element in each RDD, second element in each RDD, etc. Assumes that >the two RDDs have the *same number of partitions* and the *same number of >elements in each partition* (e.g. one was made through a map on the other)." Basically, one RDD should be a mapped RDD of the other, or both RDDs are mapped RDDs of the same RDD. Btw, your message says "Dell - Internal Use - Confidential"... Best, Xiangrui On Tue, Apr 1, 2014 at 7:27 PM, <patrick_nico...@dell.com> wrote: > Dell - Internal Use - Confidential > > I got an exception "can't zip RDDs with unusual numbers of Partitions" > when I apply any action (reduce, collect) of dataset created by > zipping two dataset of 10 million entries each. The problem occurs > independently of the number of partitions or when I let Spark creates those > partitions. > > > > Interestingly enough, I do not have problem zipping datasets of 1 and > 2.5 million entries..... > > A similar problem was reported on this board with 0.8 but remember if > the problem was fixed. > > > > Any idea? Any workaround? > > > > I appreciate.