Re: Shuffle file consolidation

Nathan Kronenfeld Thu, 29 May 2014 14:22:35 -0700

Thanks, I missed that.

One thing that's still unclear to me, even looking at that, is - does this
parameter have to be set when starting up the cluster, on each of the
workers, or can it be set by an individual client job?



On Fri, May 23, 2014 at 10:13 AM, Han JU <ju.han.fe...@gmail.com> wrote:

> Hi Nathan,
>
> There's some explanation in the spark configuration section:
>
> ```
> If set to "true", consolidates intermediate files created during a
> shuffle. Creating fewer files can improve filesystem performance for
> shuffles with large numbers of reduce tasks. It is recommended to set this
> to "true" when using ext4 or xfs filesystems. On ext3, this option might
> degrade performance on machines with many (>8) cores due to filesystem
> limitations.
> ```
>
>
> 2014-05-23 16:00 GMT+02:00 Nathan Kronenfeld <nkronenf...@oculusinfo.com>:
>
> In trying to sort some largish datasets, we came across the
>> spark.shuffle.consolidateFiles property, and I found in the source code
>> that it is set, by default, to false, with a note to default it to true
>> when the feature is stable.
>>
>> Does anyone know what is unstable about this? If we set it true, what
>> problems should we anticipate?
>>
>> Thanks,
>>             -Nathan Kronenfeld
>>
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenf...@oculusinfo.com
>>
>
>
>
> --
> *JU Han*
>
> Data Engineer @ Botify.com
>
> +33 0619608888
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Re: Shuffle file consolidation

Reply via email to