[ 
https://issues.apache.org/jira/browse/MESOS-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070592#comment-14070592
 ] 

Steven Schlansker edited comment on MESOS-1193 at 7/22/14 6:08 PM:
-------------------------------------------------------------------

This same issue just took down our entire cluster this morning.  Not cool!
I wish I had more debugging information, but here's the last 10 log lines:
{code}
I0722 17:41:26.189750  1376 slave.cpp:2552] Cleaning up executor 
'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 
201403072353-2969781770-5050-852-0000
I0722 17:41:26.189893  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-0000/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f'
 for gc 6.99999780272days in the future
I0722 17:41:26.189980  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-0000/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60'
 for gc 6.99999780201482days in the future
I0722 17:41:26.737553  1380 slave.cpp:933] Got assigned task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-0000
I0722 17:41:26.737844  1380 slave.cpp:1043] Launching task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-0000
I0722 17:41:26.739146  1375 mesos_containerizer.cpp:537] Starting container 
'17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework 
'201403072353-2969781770-5050-852-0000'
I0722 17:41:26.739151  1380 slave.cpp:1153] Queuing task 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework 
'201403072353-2969781770-5050-852-0000
I0722 17:41:26.748342  1375 launcher.cpp:117] Forked child with pid '12376' for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b'
I0722 17:41:26.752080  1380 mesos_containerizer.cpp:647] Fetching URIs for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command 
'/usr/local/libexec/mesos/mesos-fetcher'
F0722 17:41:52.215634  1377 mesos_containerizer.cpp:862] Check failed: 
promises.contains(containerId) 
{code}

{code}
I0722 17:56:35.702491 13428 main.cpp:126] Build: 2014-06-09 21:08:25 by root
I0722 17:56:35.702517 13428 main.cpp:128] Version: 0.19.0
I0722 17:56:35.702535 13428 main.cpp:131] Git tag: 0.19.0
I0722 17:56:35.702553 13428 main.cpp:135] Git SHA: 
51e047524cf744ee257870eb479345646c0428ff
I0722 17:56:35.702590 13428 mesos_containerizer.cpp:124] Using isolation: 
posix/cpu,posix/mem
I0722 17:56:35.702942 13428 main.cpp:149] Starting Mesos slave
I0722 17:56:35.703721 13428 slave.cpp:143] Slave started on 1)@10.70.6.32:5051
I0722 17:56:35.704082 13428 slave.cpp:255] Slave resources: cpus(*):8; 
mem(*):29077; disk(*):70336; ports(*):[31000-32000]
I0722 17:56:35.705883 13428 slave.cpp:283] Slave hostname: 10.70.6.32
I0722 17:56:35.705915 13428 slave.cpp:284] Slave checkpoint: true
{code}


was (Author: stevenschlansker):
This same issue just took down our entire cluster this morning.  Not cool!
I wish I had more debugging information, but here's the last 10 log lines:
{code}
I0722 17:41:26.189750  1376 slave.cpp:2552] Cleaning up executor 
'chronos.43852277-11c7-11e4-b541-1e5db1e5be60' of framework 
201403072353-2969781770-5050-852-0000
I0722 17:41:26.189893  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-0000/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60/runs/941f20d9-ba74-4531-b64e-6c2b05b0277f'
 for gc 6.99999780272days in the future
I0722 17:41:26.189980  1381 gc.cpp:56] Scheduling 
'/mnt/mesos/slaves/20140703-000606-1594050058-5050-3987-6/frameworks/201403072353-2969781770-5050-852-0000/executors/chronos.43852277-11c7-11e4-b541-1e5db1e5be60'
 for gc 6.99999780201482days in the future
I0722 17:41:26.737553  1380 slave.cpp:933] Got assigned task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-0000
I0722 17:41:26.737844  1380 slave.cpp:1043] Launching task 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 for framework 
201403072353-2969781770-5050-852-0000
I0722 17:41:26.739146  1375 mesos_containerizer.cpp:537] Starting container 
'17c2236a-1242-4c36-a6ed-54a31f687e8b' for executor 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' of framework 
'201403072353-2969781770-5050-852-0000'
I0722 17:41:26.739151  1380 slave.cpp:1153] Queuing task 
'chronos.67e803db-11c7-11e4-b541-1e5db1e5be60' for executor 
chronos.67e803db-11c7-11e4-b541-1e5db1e5be60 of framework 
'201403072353-2969781770-5050-852-0000
I0722 17:41:26.748342  1375 launcher.cpp:117] Forked child with pid '12376' for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b'
I0722 17:41:26.752080  1380 mesos_containerizer.cpp:647] Fetching URIs for 
container '17c2236a-1242-4c36-a6ed-54a31f687e8b' using command 
'/usr/local/libexec/mesos/mesos-fetcher'
F0722 17:41:52.215634  1377 mesos_containerizer.cpp:862] Check failed: 
promises.contains(containerId) 
{code}

> Check failed: promises.contains(containerId) crashes slave
> ----------------------------------------------------------
>
>                 Key: MESOS-1193
>                 URL: https://issues.apache.org/jira/browse/MESOS-1193
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 0.18.0
>            Reporter: Tobi Knaup
>
> This was observed with four slaves on one machine, one framework (Marathon) 
> and around 100 tasks per slave.
> I0404 17:58:58.298075  3939 mesos_containerizer.cpp:891] Executor for 
> container '6d4de71c-a491-4544-afe0-afcbfa37094a' has exited
> I0404 17:58:58.298395  3938 slave.cpp:2047] Executor 'web_467-1396634277535' 
> of framework 201404041625-3823062160-55371-22555-0000 has terminated with 
> signal Killed
> E0404 17:58:58.298475  3929 slave.cpp:2320] Failed to unmonitor container for 
> executor web_467-1396634277535 of framework 
> 201404041625-3823062160-55371-22555-0000: Not monitored
> I0404 17:58:58.299075  3938 slave.cpp:1643] Handling status update 
> TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000 
> from @0.0.0.0:0
> I0404 17:58:58.299232  3932 status_update_manager.cpp:315] Received status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000
> I0404 17:58:58.299360  3932 status_update_manager.cpp:368] Forwarding status 
> update TASK_FAILED (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000 
> to master@144.76.223.227:5050
> I0404 17:58:58.306967  3932 status_update_manager.cpp:393] Received status 
> update acknowledgement (UUID: c815e057-e7a2-4c26-a382-6796a1585d1d) for task 
> web_467-1396634277535 of framework 201404041625-3823062160-55371-22555-0000
> I0404 17:58:58.307049  3932 slave.cpp:2186] Cleaning up executor 
> 'web_467-1396634277535' of framework 201404041625-3823062160-55371-22555-0000
> I0404 17:58:58.307122  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-0000/executors/web_467-1396634277535/runs/6d4de71c-a491-4544-afe0-afcbfa37094a'
>  for gc 6.99999644578667days in the future
> I0404 17:58:58.307157  3932 gc.cpp:56] Scheduling 
> '/tmp/mesos5053/slaves/20140404-164105-3823062160-5050-24762-5/frameworks/201404041625-3823062160-55371-22555-0000/executors/web_467-1396634277535'
>  for gc 6.99999644553185days in the future
> F0404 17:58:58.597434  3938 mesos_containerizer.cpp:682] Check failed: 
> promises.contains(containerId)
> *** Check failure stack trace: ***
>     @     0x7f5209da6e5d  google::LogMessage::Fail()
>     @     0x7f5209da8c9d  google::LogMessage::SendToLog()
>     @     0x7f5209da6a4c  google::LogMessage::Flush()
>     @     0x7f5209da9599  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f5209ad9f88  
> mesos::internal::slave::MesosContainerizerProcess::exec()
>     @     0x7f5209af3b56  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS6_11ContainerIDEiSA_iEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSH_FSF_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>     @     0x7f5209cd0bf2  process::ProcessManager::resume()
>     @     0x7f5209cd0eec  process::schedule()
>     @     0x7f5208b48f6e  start_thread
>     @     0x7f52088739cd  (unknown)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to