`coalesce` sets the number of partitions for the last stage, so you have to use `repartition` instead which is going to introduce an extra shuffle stage
On Wed, Aug 8, 2018 at 3:47 PM Koert Kuipers <ko...@tresata.com> wrote: > > one small correction: lots of files leads to pressure on the spark driver > program when reading this data in spark. > > On Wed, Aug 8, 2018 at 3:39 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >> hi, >> >> i am reading data from files into a dataframe, then doing a groupBy for a >> given column with a count, and finally i coalesce to a smaller number of >> partitions before writing out to disk. so roughly: >> >> spark.read.format(...).load(...).groupBy(column).count().coalesce(100).write.format(...).save(...) >> >> i have this setting: spark.sql.shuffle.partitions=2048 >> >> i expect to see 2048 partitions in shuffle. what i am seeing instead is a >> shuffle with only 100 partitions. it's like the coalesce has taken over the >> partitioning of the groupBy. >> >> any idea why? >> >> i am doing coalesce because it is not helpful to write out 2048 files, lots >> of files leads to pressure down the line on executors reading this data (i >> am writing to just one partition of a larger dataset), and since i have less >> than 100 executors i expect it to be efficient. so sounds like a good idea, >> no? >> >> but i do need 2048 partitions in my shuffle due to the operation i am doing >> in the groupBy (in my real problem i am not just doing a count...). >> >> thanks! >> koert >> > -- Sent from my iPhone --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org