On 28 Dec 2017, at 19:25, Patrick Alwell <palw...@hortonworks.com> wrote: > You are using groupByKey() have you thought of an alternative like > aggregateByKey() or combineByKey() to reduce shuffling?
I am aware of this indeed. I do have a groupByKey() that is difficult to avoid, but the problem occurs afterwards. > Dynamic allocation is great; but sometimes I’ve found explicitly setting the > num executors, cores per executor, and memory per executor to be a better > alternative. I will try with dynamic allocation off. > Take a look at the yarn logs as well for the particular executor in question. > Executors can have multiple tasks; and will often fail if they have more > tasks than available threads. The trouble is there is nothing significant in the logs (read: that is clear enough for me to understand!). Any special message I could grep for? > [...] https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism > [...] https://spark.apache.org/docs/latest/hardware-provisioning.html Thanks for the pointers -- will have a look! Jeroen --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org