[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045457#comment-15045457 ] Anand Mazumdar commented on MESOS-3851: --- Patch for bringing back the reverted commit : https://reviews.apache.org/r/40998/ > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce google::LogMessage::Flush() > @ 0x7f4f7152c402 google::LogMessageFatal::~LogMessageFatal() > @
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042325#comment-15042325 ] Benjamin Mahler commented on MESOS-3851: The fix is committed, it would be great to re-enable the CHECKs in order to detect this issue should it still be present: {noformat} commit 4201c2c3e5849a01d0a63769404bad03792ae5de Author: Anand MazumdarDate: Fri Dec 4 14:15:26 2015 -0800 Linked against the executor in the agent to ensure ordered message delivery. Previously, we did not `link` against the executor `PID` while (re)-registering. This might lead to libprocess creating ephemeral sockets everytime a `send` was invoked. This was leading to races where messages might appear on the Executor out of order. This change does a `link` on the executor PID to ensure ordered message delivery. Review: https://reviews.apache.org/r/40660 {noformat} > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: >
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023186#comment-15023186 ] Anand Mazumdar commented on MESOS-3851: --- This should not be a blocker for 0.26. As [~bmahler] pointed out earlier in the thread, this race happens due to us doing a {{send}} without a {{link}}. Hence, this behavior has existed for quite some time. It's just that [~tnachen]'s changes to command executor highlighted this. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task >
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002589#comment-15002589 ] Benjamin Mahler commented on MESOS-3851: My guess is that this is due to the semantics of {{send}} without a {{link}}. We may use two ephemeral connections to send both of these messages, so it seems possible for the messages to be re-ordered. If so, linking should fix the issue. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework >
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002614#comment-15002614 ] Vinod Kone commented on MESOS-3851: --- i looked at the libprocess code and it looked like a single connection should be reused? from: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1539 {code} void SocketManager::send(Message* message, const Socket::Kind& kind) { CHECK(message != NULL); const Address& address = message->to.address; Option socket = None(); bool connect = false; synchronized (mutex) { // Check if there is already a socket. bool persist = persists.count(address) > 0; bool temp = temps.count(address) > 0; if (persist || temp) { int s = persist ? persists[address] : temps[address]; CHECK(sockets.count(s) > 0); socket = *sockets[s]; // Update whether or not this socket should get disposed after // there is no more data to send. if (!persist) { dispose.insert(socket.get()); } if (outgoing.count(socket.get()) > 0) { outgoing[socket.get()].push(new MessageEncoder(socket.get(), message)); return; } else { // Initialize the outgoing queue. outgoing[socket.get()]; } } else { // No persistent or temporary socket to the socket address // currently exists, so we create a temporary one. // The kind of socket we create is passed in as an argument. // This allows us to support downgrading the connection type // from SSL to POLL if enabled. Try create = Socket::create(kind); if (create.isError()) { VLOG(1) << "Failed to send, create socket: " << create.error(); delete message; return; } socket = create.get(); int s = socket.get(); sockets[s] = new Socket(socket.get()); addresses[s] = address; temps[address] = s; dispose.insert(s); // Initialize the outgoing queue. outgoing[s]; connect = true; } } if (connect) { CHECK_SOME(socket); socket.get().connect(address) .onAny(lambda::bind( ::send_connect, this, lambda::_1, new Socket(socket.get()), message)); } else { // If we're not connecting and we haven't added the encoder to // the 'outgoing' queue then schedule it to be sent. internal::send( new MessageEncoder(socket.get(), message), new Socket(socket.get())); } {code} AFAICT, the race is only possible if 2 sockets are created but even for temp connections (without link) only one socket is created. Well, I guess it is possible for a 2 temp sockets to be created, if the first one gets closed (because of no data) before the 2nd request arrives. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() >
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998699#comment-14998699 ] Till Toenshoff commented on MESOS-3851: --- I will be committing the workaround patch Tim has provided https://reviews.apache.org/r/40107/ (thanks a bunch [~tnachen]!) shortly after running a final check on it. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce google::LogMessage::Flush() > @
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998731#comment-14998731 ] Till Toenshoff commented on MESOS-3851: --- This following commit fixes the crash - we still may want to find the reasoning for the race condition and hence I will not close this ticket but will remove the target version (0.26.0) to unblock 0.26.0. {noformat} commit b6d4b28a4c9ca717ad8be5bbc27e40c005fc51ad Author: Timothy ChenDate: Tue Nov 10 15:46:17 2015 +0100 Removed unused checks in command executor. Review: https://reviews.apache.org/r/40107 {noformat} > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999494#comment-14999494 ] Vinod Kone commented on MESOS-3851: --- Doesn't look like this is related to the new HTTP executor logic as this race seem to happen even in non-http-executor based tests. Also the changes in slave doesn't seem related. Either this race has always existed but only now got exposed due to the CHECK in the command executor or there are some recent libprocess related changes that are the cause. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: >
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998287#comment-14998287 ] Bernd Mathiske commented on MESOS-3851: --- [~marco-mesos]: Guess what I have been looking at as of yesterday :-) [~anandmazumdar] has analyzed this well. There is highly likely be some race of sorts between executor registration and task launching. That would completely explain the CHECK that fails. There is another explanation that is less likely also: faulty data or faulty marshaling or faulty transmission or faulty unmarshaling. My focus is on understanding how the code allows for said race and once I understand it, I will try to cause the race by inserting sleep(someSeconds) somewhere suitable. Without that, there is no reliable way of reproducing the bug. It never happens when I run this, not even on CentOS 7.1. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: >
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998288#comment-14998288 ] Bernd Mathiske commented on MESOS-3851: --- That said we could tag 0.26.0 without the change in CommandExecutor. This would leave Fetcher tests flaky, but at least CommandExecutor could launch tasks with some probability even if a race occurred. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998453#comment-14998453 ] haosdent commented on MESOS-3851: - In the error log, registered and launchTaks have different thread id. {noformat} I1110 00:36:30.616987 5169 exec.cpp:306] Executor::launchTask took 160701ns I1110 00:36:30.621285 5163 exec.cpp:222] Executor::registered took 399555ns {noformat} But in local test, these always have same thread id. {noformat} I1110 19:34:46.304114 8953 exec.cpp:222] Executor::registered took 182100ns S1110 19:34:46.304416 8953 exec.cpp:306] Executor::launchTask took 47975ns {noformat} {noformat} R1110 19:34:47.439801 9027 exec.cpp:222] Executor::registered took 257152ns I1110 19:34:47.440234 9027 exec.cpp:306] Executor::launchTask took 111249ns {noformat} {noformat} I1110 19:34:47.943961 9097 exec.cpp:222] Executor::registered took 271225ns I1110 19:34:47.944284 9097 exec.cpp:306] Executor::launchTask took 45141ns {noformat} > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log:
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998541#comment-14998541 ] Bernd Mathiske commented on MESOS-3851: --- Interesting. Thanks for spotting this! > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce google::LogMessage::Flush() > @ 0x7f4f7152c402 google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997382#comment-14997382 ] Timothy Chen commented on MESOS-3851: - Although it's odd why we would ever run into this problem, technically the changes as part of my patch doesn't need the CHECK and executorInfo anymore as it's a residue left after I changed my implemntation for the rootfs. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997564#comment-14997564 ] Vinod Kone commented on MESOS-3851: --- {code} The reason seems to be a race between the executor receiving a RunTaskMessage before ExecutorRegisteredMessage leading to the CHECK_SOME(executorInfo) failure. {code} This seems to be really bad if it's the case. This not only affects the command executor but likely all other custom executors as well. [~anandmazumdar] [~bmahler] have you guys had more chance to dig into this? > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID:
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997759#comment-14997759 ] Joseph Wu commented on MESOS-3851: -- Here's another one ({{<...>}} indicates stuff I truncated): {code} [ RUN ] SlaveTest.HTTPSchedulerSlaveRestart <...> I1110 00:36:30.616757 5169 exec.cpp:297] Executor asked to run task 'a58d3992-429f-4e26-b60c-72ddcce3b804' I1110 00:36:30.616987 5169 exec.cpp:306] Executor::launchTask took 160701ns F1110 00:36:30.617060 5166 executor.cpp:184] CHECK_SOME(executorInfo): is NONE Starting task a58d3992-429f-4e26-b60c-72ddcce3b804 *** Check failure stack trace: *** I1110 00:36:30.618422 5163 exec.cpp:210] Executor registered on slave cfa4cfcf-5fce-4ce7-89b8-c466ef8c6bc3-S0 @ 0x7f917a1b911e google::LogMessage::Fail() I1110 00:36:30.621285 5163 exec.cpp:222] Executor::registered took 399555ns @ 0x7f917a1b907d google::LogMessage::SendToLog() @ 0x7f917a1b8a8e google::LogMessage::Flush() @ 0x7f917a1bb7c2 google::LogMessageFatal::~LogMessageFatal() @ 0x48d14a _CheckFatal::~_CheckFatal() @ 0x49cae7 mesos::internal::CommandExecutorProcess::launchTask() @ 0x4b43d9 _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ @ 0x4c4d59 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ @ 0x7f917a13f83f std::function<>::operator()() @ 0x7f917a127659 process::ProcessBase::visit() @ 0x7f917a12b424 process::DispatchEvent::visit() @ 0x48e144 process::ProcessBase::serve() @ 0x7f917a123a45 process::ProcessManager::resume() @ 0x7f917a120c76 _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f917a12ac50 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f917a12ac00 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f917a12ab92 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f917a12aae9 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f917a12aa82 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9175c261e0 (unknown) @ 0x7f9175e7fdf5 start_thread @ 0x7f917538e1ad __clone <...> ../../src/tests/slave_tests.cpp:2829: Failure Value of: status.get().state() Actual: TASK_FAILED <...> Expected: TASK_RUNNING <...> ../../src/tests/slave_tests.cpp:2855: Failure Failed to wait 15secs for updateFrameworkMessage <...> ../../3rdparty/libprocess/include/process/gmock.hpp:365: Failure Actual function call count doesn't match EXPECT_CALL(filter->mock, filter(testing::A()))... Expected args: message matcher (8-byte object <18-02 04-A0 78-7F 00-00>, 1-byte object <02>, 1-byte object <18>) Expected: to be called once Actual: never called - unsatisfied and active [ FAILED ] SlaveTest.HTTPSchedulerSlaveRestart (15688 ms) {code} Full logs: https://builds.apache.org/job/Mesos/COMPILER=gcc,CONFIGURATION=--verbose,OS=centos:7,label_exp=docker||Hadoop/1206/consoleFull > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997697#comment-14997697 ] Marco Massenzio commented on MESOS-3851: FYI [~anandmazumdar] is on holiday (and in India) this week, so he won't probably be able to look into this. [~bmahler] is traveling to Europe, so may not be around for a day or two (not sure about travel plans). If this is a blocker issue for 0.26 - who can take this on urgently? I saw the same (flaky) failures on the `FetcherCache` tests, so I'm wondering whether [~bernd-mesos] could look into this? > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997519#comment-14997519 ] Jie Yu commented on MESOS-3851: --- I don't understand the part that the executor can receive RunTaskMessage before ExecutorRegisteredMessage. Do you know why? > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce google::LogMessage::Flush() > @ 0x7f4f7152c402