Spark cluster multi tenancy

Sadhan Sood Wed, 26 Aug 2015 10:46:41 -0700

Hi All,

We've set up our spark cluster on aws running on yarn (running on hadoop
2.3) with fair scheduling and preemption turned on. The cluster is shared
for prod and dev work where prod runs with a higher fair share and can
preempt dev jobs if there are not enough resources available for it.
It appears that dev jobs which get preempted often get unstable after
losing some executors and the whole jobs gets stuck (without making any
progress) or end up getting restarted (and hence losing all the work done).
Has someone encountered this before ? Is the solution just to set
spark.task.maxFailures
to a really high value to recover from task failures in such scenarios? Are
there other approaches that people have taken for spark multi tenancy that
works better in such scenario?


Thanks,
Sadhan

Spark cluster multi tenancy

Reply via email to