I've been struggling with the reliability, which on the face of it, should
be a fairly simple job: Given a number of events, group them by user, event
type and week in which they occurred and aggregate their counts.
The input data is fairly skewed so I do a repartition and add a salt to the
key, so the aggregation is performed in two phases to (hopefully) mitigate
What I see is that the size of data dealt with by each task is uniform
(170mb) with a parallelism of 5000 (also tried with 2500), however tasks
take increasingly longer, initially they start at a few seconds, then up to
18min by the end of the job.
As I am not storing any RDDs I have tried using the legacy memory
management to 70% of the heap memory to execution, with 10% for storage and
20% for unroll.
Could anyone give me any pointers on what might be causing this, it feels
like a memory leak, but I'm struggling to see where it is coming from.
Here are some details on my set up, plus I've attached the stats from the
2.0 (tried with 1.6.2 also)
2 cores 10Gb RAM
20 nodes each with 16 cores and 100Gb RAM
Data from map phase:
Input data size: 668.2 GB
Shuffle write size: 831.9 GB
[image: Screen Shot 2016-09-20 at 17.15.12.png]