Hey folks,

I'm wondering what strategies other folks are using for maintaining and
monitoring the stability of stand-alone spark clusters.

Our master very regularly loses workers, and they (as expected) never
rejoin the cluster.  This is the same behavior I've seen
using akka cluster (if that's what spark is using in stand-alone mode) --
are there configuration options we could be setting
to make the cluster more robust?

We have a custom script which monitors the number of workers (through the
web interface) and restarts the cluster when
necessary, as well as resolving other issues we face (like spark shells
left open permanently claiming resources), and it
works, but it's no where close to a great solution.

What are other folks doing?  Is this something that other folks observe as
well?  I suspect that the loss of workers is tied to
jobs that run out of memory on the client side or our use of very large
broadcast variables, but I don't have an isolated test case.
I'm open to general answers here: for example, perhaps we should simply be
using mesos or yarn instead of stand-alone mode.

--j

Reply via email to