if you saw some exception message like the JIRA
https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log
file, you are welcome to have a try https://github.com/apache/spark/pull/827
On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote:
Aaron:
I see this
After the several fixes that we have made to exception handling in Spark
1.0.0, I expect that this behavior will be quite different from 0.9.1.
Executors should be far more likely to shutdown cleanly in the event of
errors, allowing easier restarts. But I expect that there will be more
bugs to
Hi Matei,
Unfortunately, I don't have more detailed information, but we have seen the
loss of workers in standalone mode as well. If a job is killed through
CTRL-C we will often see in the Spark Master page the number of workers and
cores decrease. They are still alive and well in the Cloudera
I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job failures. We're running pretty close to
master, though; perhaps it is related to an uncaught exception in the
Worker from a prior version of Spark.
On Tue, May 20, 2014 at 11:36 AM,
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the
integration with that.
Matei
On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote:
I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job
We're using spark 0.9.0, and we're using it out of the box -- not using
Cloudera Manager or anything similar.
There are warnings from the master that there continue to be heartbeats
from the unregistered workers. I will see if there are particular
telltale errors on the worker side.
We've had
So, for example, I have two disassociated worker machines at the moment.
The last messages in the spark logs are akka association error messages,
like the following:
14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://
sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://
Unfortunately, those errors are actually due to an Executor that exited,
such that the connection between the Worker and Executor failed. This is
not a fatal issue, unless there are analogous messages from the Worker to
the Master (which should be present, if they exist, at around the same
point
Aaron:
I see this in the Master's logs:
14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
worker-20140520011737-hdn3.int.meetup.com-50038
There
Which version is this with? I haven’t seen standalone masters lose workers. Is
there other stuff on the machines that’s killing them, or what errors do you
see?
Matei
On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:
Hey folks,
I'm wondering what strategies other folks
Hey folks,
I'm wondering what strategies other folks are using for maintaining and
monitoring the stability of stand-alone spark clusters.
Our master very regularly loses workers, and they (as expected) never
rejoin the cluster. This is the same behavior I've seen
using akka cluster (if that's
11 matches
Mail list logo