[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
[ https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-6226: --- Shepherd: Vinod Kone > Master crashes while transitioning tasks to 'TASK_UNREACHABLE' > -- > > Key: MESOS-6226 > URL: https://issues.apache.org/jira/browse/MESOS-6226 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Jan Schlicht >Assignee: Neil Conway >Priority: Critical > Fix For: 1.1.0 > > > While marking an agent as unreachable, Mesos master crashes when marking > tasks as unreachable. > A possible reason could be that a framework might not be reregistered yet. > {noformat} > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105 > 1043 master.cpp:236] Scheduling transition of agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health > check timeout > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079 > 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375 > 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update > the registry > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798 > 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at > position 2681 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132 > 1041 replica.cpp:537] Replica received write request for position 2681 from > __req_res__(35)@10.10.0.212:5050 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492 > 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport > endpoint is not connected > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237 > 1041 replica.cpp:691] Replica received learned notice for position 2681 from > @0.0.0.0:0 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190 > 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304 > 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at > position 2682 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390 > 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > (10.10.0.205) unreachable: health check timed out > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456 > 1043 master.cpp:7537] Updating the state of task 8 of framework > 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status > update state: TASK_LOST) > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558 > 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 > of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 > (10.10.0.205) > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631 > 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework > 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable' > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651 > 1041 hierarchical.cpp:514] Removed agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733 > 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack > trace: *** > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829 > 1041 replica.cpp:537] Replica received write request for position 2682 from > __req_res__(38)@10.10.0.212:5050 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba979d > google::LogMessage::Fail() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2bab5cd > google::LogMessage::SendToLog() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742 > 1041 replica.cpp:691] Replica received learned notice for position 2682 from > @0.0.0.0:0 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba938c > google::LogMessage::Flush() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2babec9 > google::LogMessageFatal::~LogMessageFatal() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21dd52d > google::CheckNotNull<>() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21c632c >
[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
[ https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6226: -- Fix Version/s: 1.1.0 > Master crashes while transitioning tasks to 'TASK_UNREACHABLE' > -- > > Key: MESOS-6226 > URL: https://issues.apache.org/jira/browse/MESOS-6226 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Jan Schlicht >Assignee: Neil Conway >Priority: Critical > Fix For: 1.1.0 > > > While marking an agent as unreachable, Mesos master crashes when marking > tasks as unreachable. > A possible reason could be that a framework might not be reregistered yet. > {noformat} > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105 > 1043 master.cpp:236] Scheduling transition of agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health > check timeout > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079 > 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375 > 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update > the registry > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798 > 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at > position 2681 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132 > 1041 replica.cpp:537] Replica received write request for position 2681 from > __req_res__(35)@10.10.0.212:5050 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492 > 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport > endpoint is not connected > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237 > 1041 replica.cpp:691] Replica received learned notice for position 2681 from > @0.0.0.0:0 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190 > 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304 > 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at > position 2682 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390 > 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > (10.10.0.205) unreachable: health check timed out > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456 > 1043 master.cpp:7537] Updating the state of task 8 of framework > 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status > update state: TASK_LOST) > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558 > 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 > of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 > (10.10.0.205) > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631 > 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework > 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable' > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651 > 1041 hierarchical.cpp:514] Removed agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733 > 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack > trace: *** > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829 > 1041 replica.cpp:537] Replica received write request for position 2682 from > __req_res__(38)@10.10.0.212:5050 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba979d > google::LogMessage::Fail() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2bab5cd > google::LogMessage::SendToLog() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742 > 1041 replica.cpp:691] Replica received learned notice for position 2682 from > @0.0.0.0:0 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba938c > google::LogMessage::Flush() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2babec9 > google::LogMessageFatal::~LogMessageFatal() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21dd52d > google::CheckNotNull<>() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21c632c > mesos::internal::master::Master::_markUnreachable() >
[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
[ https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6226: -- Priority: Critical (was: Major) > Master crashes while transitioning tasks to 'TASK_UNREACHABLE' > -- > > Key: MESOS-6226 > URL: https://issues.apache.org/jira/browse/MESOS-6226 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Jan Schlicht >Assignee: Neil Conway >Priority: Critical > Fix For: 1.1.0 > > > While marking an agent as unreachable, Mesos master crashes when marking > tasks as unreachable. > A possible reason could be that a framework might not be reregistered yet. > {noformat} > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105 > 1043 master.cpp:236] Scheduling transition of agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health > check timeout > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079 > 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375 > 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update > the registry > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798 > 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at > position 2681 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132 > 1041 replica.cpp:537] Replica received write request for position 2681 from > __req_res__(35)@10.10.0.212:5050 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492 > 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport > endpoint is not connected > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237 > 1041 replica.cpp:691] Replica received learned notice for position 2681 from > @0.0.0.0:0 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190 > 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304 > 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at > position 2682 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390 > 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > (10.10.0.205) unreachable: health check timed out > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456 > 1043 master.cpp:7537] Updating the state of task 8 of framework > 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status > update state: TASK_LOST) > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558 > 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 > of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 > (10.10.0.205) > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631 > 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework > 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable' > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651 > 1041 hierarchical.cpp:514] Removed agent > 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733 > 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack > trace: *** > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829 > 1041 replica.cpp:537] Replica received write request for position 2682 from > __req_res__(38)@10.10.0.212:5050 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba979d > google::LogMessage::Fail() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2bab5cd > google::LogMessage::SendToLog() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742 > 1041 replica.cpp:691] Replica received learned notice for position 2682 from > @0.0.0.0:0 > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba938c > google::LogMessage::Flush() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2babec9 > google::LogMessageFatal::~LogMessageFatal() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21dd52d > google::CheckNotNull<>() > Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21c632c >