ok thanks. mhhhhh. that seems odd. shouldnt coalesce introduce a new map-phase with less tasks instead of changing the previous shuffle?
using repartition seems too expensive just to keep the number of files down. so i guess i am back to looking for another solution. On Wed, Aug 8, 2018 at 4:13 PM, Vadim Semenov <va...@datadoghq.com> wrote: > `coalesce` sets the number of partitions for the last stage, so you > have to use `repartition` instead which is going to introduce an extra > shuffle stage > > On Wed, Aug 8, 2018 at 3:47 PM Koert Kuipers <ko...@tresata.com> wrote: > > > > one small correction: lots of files leads to pressure on the spark > driver program when reading this data in spark. > > > > On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> > >> hi, > >> > >> i am reading data from files into a dataframe, then doing a groupBy for > a given column with a count, and finally i coalesce to a smaller number of > partitions before writing out to disk. so roughly: > >> > >> spark.read.format(...).load(...).groupBy(column).count(). > coalesce(100).write.format(...).save(...) > >> > >> i have this setting: spark.sql.shuffle.partitions=2048 > >> > >> i expect to see 2048 partitions in shuffle. what i am seeing instead is > a shuffle with only 100 partitions. it's like the coalesce has taken over > the partitioning of the groupBy. > >> > >> any idea why? > >> > >> i am doing coalesce because it is not helpful to write out 2048 files, > lots of files leads to pressure down the line on executors reading this > data (i am writing to just one partition of a larger dataset), and since i > have less than 100 executors i expect it to be efficient. so sounds like a > good idea, no? > >> > >> but i do need 2048 partitions in my shuffle due to the operation i am > doing in the groupBy (in my real problem i am not just doing a count...). > >> > >> thanks! > >> koert > >> > > > > > -- > Sent from my iPhone >