Re: advice on maintaining a production spark cluster?

Han JU Wed, 21 May 2014 08:59:12 -0700

I've seen also worker loss and that's way I asked a question about worker
re-spawn.


My typical case is there's some job got OOM exception. Then on the master
UI some worker's state becomes DEAD.
In the master's log, there's error like:

```
14/05/21 15:38:02 ERROR remote.EndpointWriter: AssociationError [akka.tcp://
sparkmas...@ec2-23-20-189-111.compute-1.amazonaws.com:7077] ->
[akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]: Error
[Association failed with
[akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-186-156-22.ec2.internal/10.186.156.22:38572
]
14/05/21 15:38:02 INFO master.Master:
akka.tcp://sparkWorker@ip-10-186-156-22.ec2.internal:38572 got
disassociated, removing it.
```

On the `DEAD` worker machine, there's 2 spark processes, worker and
executor backend:
  16280 org.apache.spark.deploy.worker.Worker
  25989 org.apache.spark.executor.CoarseGrainedExecutorBackend

The bad thing is that in this case, a sbin/stop-all.sh and
sbin/start-all.sh cannot bring back the DEAD worker since the worker
process cannot be terminated (maybe due to the executor backend). I have to
log in, kill -9 both worker process and the executor backend.

I'm on 0.9.1 and using ec2-script.



2014-05-21 11:42 GMT+02:00 sagi <zhpeng...@gmail.com>:

> if you saw some exception message like the JIRA
> https://issues.apache.org/jira/browse/SPARK-1886  mentioned in work's log
> file, you are welcome to have a try
> https://github.com/apache/spark/pull/827
>
>
>
>
> On Wed, May 21, 2014 at 11:21 AM, Josh Marcus <jmar...@meetup.com> wrote:
>
>> Aaron:
>>
>> I see this in the Master's logs:
>>
>> 14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
>> address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
>> 14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
>> worker-20140520011737-hdn3.int.meetup.com-50038
>>
>> There was an executor that launched that did fail, such as:
>> 14/05/20 01:16:05 INFO Master: Launching executor
>> app-20140520011605-0001/2 on worker
>> worker-20140519155427-hdn3.int.meetup.com-50
>> 038
>> 14/05/20 01:17:37 INFO Master: Removing executor
>> app-20140520011605-0001/2 because it is FAILED
>>
>> ... but other executors on other machines also failed without permanently
>> disassociating.
>>
>> There are these messages which I don't know if they are related:
>>  14/05/20 01:17:38 INFO LocalActorRef: Message
>> [akka.remote.transport.AssociationHandle$Disassociated] from
>> Actor[akka://sparkMaste
>> r/deadLetters] to
>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
>> 6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
>> encountered. This logging can be turned off or adjusted with confi
>> guration settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>> 14/05/20 01:17:38 INFO LocalActorRef: Message
>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
>> Actor[akka
>> ://sparkMaster/deadLetters] to
>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
>> aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
>> letters encountered. This logging can be turned off or adjust
>> ed with configuration settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>>
>>
>>
>>
>> On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson <ilike...@gmail.com>wrote:
>>
>>> Unfortunately, those errors are actually due to an Executor that exited,
>>> such that the connection between the Worker and Executor failed. This is
>>> not a fatal issue, unless there are analogous messages from the Worker to
>>> the Master (which should be present, if they exist, at around the same
>>> point in time).
>>>
>>> Do you happen to have the logs from the Master that indicate that the
>>> Worker terminated? Is it just an Akka disassociation, or some exception?
>>>
>>>
>>> On Tue, May 20, 2014 at 12:53 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> This isn't helpful of me to say, but, I see the same sorts of problem
>>>> and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
>>>> into when it happens, but usually after heavy use and after running
>>>> for a long time. I had figured I'd see if the changes since 0.9.0
>>>> addressed it and revisit later.
>>>>
>>>> On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmar...@meetup.com>
>>>> wrote:
>>>> > So, for example, I have two disassociated worker machines at the
>>>> moment.
>>>> > The last messages in the spark logs are akka association error
>>>> messages,
>>>> > like the following:
>>>> >
>>>> > 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
>>>> > [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] ->
>>>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error
>>>> [Association
>>>> > failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
>>>> > akka.remote.EndpointAssociationException: Association failed with
>>>> > [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
>>>> > Caused by:
>>>> >
>>>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>>>> > Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
>>>> > ]
>>>> >
>>>> > On the master side, there are lots and lots of messages of the form:
>>>> >
>>>> > 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
>>>> > worker-20140520011737-hdn3.int.meetup.com-50038
>>>> >
>>>> > --j
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>
>
> --
> ---------------------------------
> Best Regards
>



-- 
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Re: advice on maintaining a production spark cluster?

Reply via email to