Unfortunately, those errors are actually due to an Executor that exited,
such that the connection between the Worker and Executor failed. This is
not a fatal issue, unless there are analogous messages from the Worker to
the Master (which should be present, if they exist, at around the same
point in time).

Do you happen to have the logs from the Master that indicate that the
Worker terminated? Is it just an Akka disassociation, or some exception?


On Tue, May 20, 2014 at 12:53 PM, Sean Owen <so...@cloudera.com> wrote:

> This isn't helpful of me to say, but, I see the same sorts of problem
> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
> into when it happens, but usually after heavy use and after running
> for a long time. I had figured I'd see if the changes since 0.9.0
> addressed it and revisit later.
>
> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com> wrote:
> > So, for example, I have two disassociated worker machines at the moment.
> > The last messages in the spark logs are akka association error messages,
> > like the following:
> >
> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
> > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] ->
> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association
> > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
> > akka.remote.EndpointAssociationException: Association failed with
> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
> > Caused by:
> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
> > ]
> >
> > On the master side, there are lots and lots of messages of the form:
> >
> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
> > worker-20140520011737-hdn3.int.meetup.com-50038
> >
> > --j
> >
> >
>

Reply via email to