[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'

2016-09-22 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6226:
---
Shepherd: Vinod Kone

> Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
> --
>
> Key: MESOS-6226
> URL: https://issues.apache.org/jira/browse/MESOS-6226
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Jan Schlicht
>Assignee: Neil Conway
>Priority: Critical
> Fix For: 1.1.0
>
>
> While marking an agent as unreachable, Mesos master crashes when marking 
> tasks as unreachable.
> A possible reason could be that a framework might not be reregistered yet.
> {noformat}
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105  
> 1043 master.cpp:236] Scheduling transition of agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health 
> check timeout
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079  
> 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375  
> 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update 
> the registry
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798  
> 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
> position 2681
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132  
> 1041 replica.cpp:537] Replica received write request for position 2681 from 
> __req_res__(35)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492  
> 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport 
> endpoint is not connected
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237  
> 1041 replica.cpp:691] Replica received learned notice for position 2681 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190  
> 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304  
> 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at 
> position 2682
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390  
> 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456  
> 1043 master.cpp:7537] Updating the state of task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status 
> update state: TASK_LOST)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558  
> 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 
> of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 
> (10.10.0.205)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631  
> 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable'
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651  
> 1041 hierarchical.cpp:514] Removed agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733  
> 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack 
> trace: ***
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829  
> 1041 replica.cpp:537] Replica received write request for position 2682 from 
> __req_res__(38)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba979d  
> google::LogMessage::Fail()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2bab5cd  
> google::LogMessage::SendToLog()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742  
> 1041 replica.cpp:691] Replica received learned notice for position 2682 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba938c  
> google::LogMessage::Flush()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2babec9  
> google::LogMessageFatal::~LogMessageFatal()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21dd52d  
> google::CheckNotNull<>()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21c632c  
> 

[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'

2016-09-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6226:
--
Fix Version/s: 1.1.0

> Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
> --
>
> Key: MESOS-6226
> URL: https://issues.apache.org/jira/browse/MESOS-6226
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Jan Schlicht
>Assignee: Neil Conway
>Priority: Critical
> Fix For: 1.1.0
>
>
> While marking an agent as unreachable, Mesos master crashes when marking 
> tasks as unreachable.
> A possible reason could be that a framework might not be reregistered yet.
> {noformat}
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105  
> 1043 master.cpp:236] Scheduling transition of agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health 
> check timeout
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079  
> 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375  
> 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update 
> the registry
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798  
> 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
> position 2681
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132  
> 1041 replica.cpp:537] Replica received write request for position 2681 from 
> __req_res__(35)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492  
> 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport 
> endpoint is not connected
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237  
> 1041 replica.cpp:691] Replica received learned notice for position 2681 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190  
> 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304  
> 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at 
> position 2682
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390  
> 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456  
> 1043 master.cpp:7537] Updating the state of task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status 
> update state: TASK_LOST)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558  
> 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 
> of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 
> (10.10.0.205)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631  
> 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable'
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651  
> 1041 hierarchical.cpp:514] Removed agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733  
> 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack 
> trace: ***
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829  
> 1041 replica.cpp:537] Replica received write request for position 2682 from 
> __req_res__(38)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba979d  
> google::LogMessage::Fail()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2bab5cd  
> google::LogMessage::SendToLog()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742  
> 1041 replica.cpp:691] Replica received learned notice for position 2682 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba938c  
> google::LogMessage::Flush()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2babec9  
> google::LogMessageFatal::~LogMessageFatal()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21dd52d  
> google::CheckNotNull<>()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21c632c  
> mesos::internal::master::Master::_markUnreachable()
> 

[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'

2016-09-22 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6226:
--
Priority: Critical  (was: Major)

> Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
> --
>
> Key: MESOS-6226
> URL: https://issues.apache.org/jira/browse/MESOS-6226
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Jan Schlicht
>Assignee: Neil Conway
>Priority: Critical
> Fix For: 1.1.0
>
>
> While marking an agent as unreachable, Mesos master crashes when marking 
> tasks as unreachable.
> A possible reason could be that a framework might not be reregistered yet.
> {noformat}
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105  
> 1043 master.cpp:236] Scheduling transition of agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health 
> check timeout
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079  
> 1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375  
> 1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update 
> the registry
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798  
> 1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
> position 2681
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132  
> 1041 replica.cpp:537] Replica received write request for position 2681 from 
> __req_res__(35)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492  
> 1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport 
> endpoint is not connected
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237  
> 1041 replica.cpp:691] Replica received learned notice for position 2681 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190  
> 1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304  
> 1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at 
> position 2682
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390  
> 1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 
> (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456  
> 1043 master.cpp:7537] Updating the state of task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status 
> update state: TASK_LOST)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558  
> 1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 
> of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 
> (10.10.0.205)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631  
> 1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework 
> 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable'
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651  
> 1041 hierarchical.cpp:514] Removed agent 
> 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733  
> 1043 master.cpp:5918] Check failed: 'framework' Must be non NULL
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack 
> trace: ***
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829  
> 1041 replica.cpp:537] Replica received write request for position 2682 from 
> __req_res__(38)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba979d  
> google::LogMessage::Fail()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2bab5cd  
> google::LogMessage::SendToLog()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742  
> 1041 replica.cpp:691] Replica received learned notice for position 2682 from 
> @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2ba938c  
> google::LogMessage::Flush()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc2babec9  
> google::LogMessageFatal::~LogMessageFatal()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21dd52d  
> google::CheckNotNull<>()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @ 0x7f5fc21c632c  
>