Re: How to save as a single file efficiently?

Aureliano Buendia Fri, 21 Mar 2014 17:02:33 -0700

Good to know it's as simple as that! I wonder why shuffle=true is not the
default for coalesce().



On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> Try passing the shuffle=true parameter to coalesce, then it will do the
> map in parallel but still pass all the data through one reduce node for
> writing it out. That's probably the fastest it will get. No need to cache
> if you do that.
>
> Matei
>
> On Mar 21, 2014, at 4:04 PM, Aureliano Buendia <buendia...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
> found that a partition number of 1000 is a good number to speed the process
> up. However, it does not make sense to have 1000 pieces of csv files each
> less than 1 kb.
> >
> > We used RDD.coalesce(1) to get only 1 csv file, but it's extremely slow,
> and we are not properly using our resources this way. So this is very slow:
> >
> > rdd.map(...).coalesce(1).saveAsTextFile()
> >
> > How is it possible to use coalesce(1) simply for concatenating the
> materialized output text files? Would something like this make sense?:
> >
> > rdd.map(...).coalesce(100).coalesce(1).saveAsTextFile()
> >
> > Or, would something like this achieve it?:
> >
> > rdd.map(...).cache().coalesce(1).saveAsTextFile()
>
>

Re: How to save as a single file efficiently?

Reply via email to