Re: groupBy and then coalesce impacts shuffle partitions in unintended way

Vadim Semenov Wed, 08 Aug 2018 13:14:31 -0700

`coalesce` sets the number of partitions for the last stage, so you
have to use `repartition` instead which is going to introduce an extra
shuffle stage


On Wed, Aug 8, 2018 at 3:47 PM Koert Kuipers <ko...@tresata.com> wrote:
>
> one small correction: lots of files leads to pressure on the spark driver 
> program when reading this data in spark.
>
> On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> hi,
>>
>> i am reading data from files into a dataframe, then doing a groupBy for a 
>> given column with a count, and finally i coalesce to a smaller number of 
>> partitions before writing out to disk. so roughly:
>>
>> spark.read.format(...).load(...).groupBy(column).count().coalesce(100).write.format(...).save(...)
>>
>> i have this setting: spark.sql.shuffle.partitions=2048
>>
>> i expect to see 2048 partitions in shuffle. what i am seeing instead is a 
>> shuffle with only 100 partitions. it's like the coalesce has taken over the 
>> partitioning of the groupBy.
>>
>> any idea why?
>>
>> i am doing coalesce because it is not helpful to write out 2048 files, lots 
>> of files leads to pressure down the line on executors reading this data (i 
>> am writing to just one partition of a larger dataset), and since i have less 
>> than 100 executors i expect it to be efficient. so sounds like a good idea, 
>> no?
>>
>> but i do need 2048 partitions in my shuffle due to the operation i am doing 
>> in the groupBy (in my real problem i am not just doing a count...).
>>
>> thanks!
>> koert
>>
>


-- 
Sent from my iPhone

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

Reply via email to