Re: advice on maintaining a production spark cluster?

2014-05-21 Thread sagi
if you saw some exception message like the JIRA https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log file, you are welcome to have a try https://github.com/apache/spark/pull/827 On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote: Aaron: I see this

Re: advice on maintaining a production spark cluster?

2014-05-21 Thread Mark Hamstra
After the several fixes that we have made to exception handling in Spark 1.0.0, I expect that this behavior will be quite different from 0.9.1. Executors should be far more likely to shutdown cleanly in the event of errors, allowing easier restarts. But I expect that there will be more bugs to

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Arun Ahuja
Hi Matei, Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well. If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease. They are still alive and well in the Cloudera

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM,

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Matei Zaharia
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that. Matei On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote: I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
We're using spark 0.9.0, and we're using it out of the box -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see if there are particular telltale errors on the worker side. We've had

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
So, for example, I have two disassociated worker machines at the moment. The last messages in the spark logs are akka association error messages, like the following: 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp:// sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
Aaron: I see this in the Master's logs: 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker worker-20140520011737-hdn3.int.meetup.com-50038 There

Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Matei Zaharia
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks

advice on maintaining a production spark cluster?

2014-05-16 Thread Josh Marcus
Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's