Re: Combining Spark Files with saveAsTextFile

Igor Berman Wed, 05 Aug 2015 03:07:33 -0700

seems that coallesce do work, see following thread
https://www.mail-archive.com/user%40spark.apache.org/msg00928.html


On 5 August 2015 at 09:47, Igor Berman <igor.ber...@gmail.com> wrote:

> using coalesce might be dangerous, since 1 worker process will need to
> handle whole file and if the file is huge you'll get OOM, however it
> depends on implementation, I'm not sure how it will be done
> nevertheless, worse to try the coallesce method(please post your results)
>
> another option would be to use FileUtil.copyMerge which copies each
> partition one after another into destination stream(file); so as soon as
> you've written your hdfs file with spark with multiple partitions in
> parallel(as usual), you can then make another step to merge it into any
> destination you want
>
> On 5 August 2015 at 07:43, Mohammed Guller <moham...@glassbeam.com> wrote:
>
>> Just to further clarify, you can first call coalesce with argument 1 and
>> then call saveAsTextFile. For example,
>>
>>
>>
>> rdd.coalesce(1).saveAsTextFile(...)
>>
>>
>>
>>
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Mohammed Guller
>> *Sent:* Tuesday, August 4, 2015 9:39 PM
>> *To:* 'Brandon White'; user
>> *Subject:* RE: Combining Spark Files with saveAsTextFile
>>
>>
>>
>> One options is to use the coalesce method in the RDD class.
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Brandon White [mailto:bwwintheho...@gmail.com
>> <bwwintheho...@gmail.com>]
>> *Sent:* Tuesday, August 4, 2015 7:23 PM
>> *To:* user
>> *Subject:* Combining Spark Files with saveAsTextFile
>>
>>
>>
>> What is the best way to make saveAsTextFile save as only a single file?
>>
>
>

Re: Combining Spark Files with saveAsTextFile

Reply via email to