You are better off using Mesos for production cluster. Standalone mode will not provide reliability & availability in production. That said it depends on what production means. Many of my analytics customers use standalone in production. Regards Mayur
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Fri, May 16, 2014 at 10:23 PM, Josh Marcus <jmar...@meetup.com> wrote: > Hey folks, > > I'm wondering what strategies other folks are using for maintaining and > monitoring the stability of stand-alone spark clusters. > > Our master very regularly loses workers, and they (as expected) never > rejoin the cluster. This is the same behavior I've seen > using akka cluster (if that's what spark is using in stand-alone mode) -- > are there configuration options we could be setting > to make the cluster more robust? > > We have a custom script which monitors the number of workers (through the > web interface) and restarts the cluster when > necessary, as well as resolving other issues we face (like spark shells > left open permanently claiming resources), and it > works, but it's no where close to a great solution. > > What are other folks doing? Is this something that other folks observe as > well? I suspect that the loss of workers is tied to > jobs that run out of memory on the client side or our use of very large > broadcast variables, but I don't have an isolated test case. > I'm open to general answers here: for example, perhaps we should simply be > using mesos or yarn instead of stand-alone mode. > > --j > >