Hi Adrian, Which kind of partitioning are you using? Have you already tried to coalesce it to a prime number?
2016-12-14 11:56 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>: > I realise that coalesce() isn't guaranteed to be balanced and adding a > repartition() does indeed fix this (at the cost of a large shuffle. > > I'm trying to understand _why_ it's so uneven (hopefully it helps someone > else too). This is using spark v2.0.2 (pyspark). > > Essentially we're just reading CSVs into a DataFrame (which we persist > serialised for some calculations), then writing it back out as PRQ. To > avoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs each). > > The writers end up with about 700-900MB each (not bad). Except for one > which is at 6GB before I killed it. > > Input data is 12000 gzipped CSV files in S3 (approx 30GB), named like > this, almost all about 2MB each: > s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209 > -i-da71c942-389.gz > s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529 > -i-01d3dab021b760d29-334.gz > > (we're aware that this isn't an ideal naming convention from an S3 > performance PoV). > > The actual CSV file format is: > UUID\tINT\tINT\... . (wide rows - about 300 columns) > > e.g.: > 17f9c2a7-ddf6-42d3-bada-63b845cb33a5 1481587198750 11213.... > 1d723493-5341-450d-a506-5c96ce0697f0 1481587198751 11212 ... > 64cec96f-732c-44b8-a02e-098d5b63ad77 1481587198752 11211 ... > > The dataframe seems to be stored evenly on all the nodes (according to the > storage tab) and all the blocks are the same size. Most of the tasks are > executed at NODE_LOCAL locality (although there are a few ANY). The > oversized task is NODE_LOCAL though. > > The reading and calculations all seem evenly spread, confused why the > writes aren't as I'd expect the input partitions to be even, what's causing > and what we can do? Maybe it's possible for coalesce() to be a bit smarter > in terms of which partitions it coalesces - balancing the size of the final > partitions rather than the number of source partitions in each final > partition. > > Thanks for any light you can shine! > > Adrian > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >