Good to know it's as simple as that! I wonder why shuffle=true is not the default for coalesce().
On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > Try passing the shuffle=true parameter to coalesce, then it will do the > map in parallel but still pass all the data through one reduce node for > writing it out. That's probably the fastest it will get. No need to cache > if you do that. > > Matei > > On Mar 21, 2014, at 4:04 PM, Aureliano Buendia <buendia...@gmail.com> > wrote: > > > Hi, > > > > Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We > found that a partition number of 1000 is a good number to speed the process > up. However, it does not make sense to have 1000 pieces of csv files each > less than 1 kb. > > > > We used RDD.coalesce(1) to get only 1 csv file, but it's extremely slow, > and we are not properly using our resources this way. So this is very slow: > > > > rdd.map(...).coalesce(1).saveAsTextFile() > > > > How is it possible to use coalesce(1) simply for concatenating the > materialized output text files? Would something like this make sense?: > > > > rdd.map(...).coalesce(100).coalesce(1).saveAsTextFile() > > > > Or, would something like this achieve it?: > > > > rdd.map(...).cache().coalesce(1).saveAsTextFile() > >