[jira] [Commented] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-05-07 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101542#comment-17101542
 ] 

Andrei Sekretenko commented on MESOS-10116:
---

master:
{noformat}
commit a32513a1fc6a149b30f04721f866e3cbb6003661
Author: Andrei Sekretenko 
Date:   Tue Apr 14 18:55:59 2020 +0200

Added test for reactivation of a disconnected drained agent.

Review: https://reviews.apache.org/r/72364
{noformat}

1.9.x:
{noformat}
commit b3b6dbb27a93a9ace4e4d2d1e83b16ea92f1a8e1
Author: Andrei Sekretenko 
Date:   Tue Apr 14 18:55:59 2020 +0200

Added test for reactivation of a disconnected drained agent.

Review: https://reviews.apache.org/r/72364
{noformat}

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-05-04 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099064#comment-17099064
 ] 

Andrei Sekretenko commented on MESOS-10116:
---

1.9 backport:
{noformat}
commit 70291edf09f5b35af2b5389024de84b550ccacf3
Author: Andrei Sekretenko 
Date:   Tue Apr 14 20:05:11 2020 +0200

Fixed handling disconnected agents by REACTIVATE_AGENT call.

This patch fixes MESOS-10116 by preventing REACTIVATE_AGENT from
activating disconnected agents in the allocator and also fixes the
handling of agents that were removed while the reactivation was being
stored into the registry.

Review: https://reviews.apache.org/r/72363
{noformat}

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-05-04 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098887#comment-17098887
 ] 

Andrei Sekretenko commented on MESOS-10116:
---

Fix in master:
{noformat}
commit ff720df995daae76803f13f54f812913a0d3
Author: Andrei Sekretenko 
Date:   Tue Apr 14 20:05:11 2020 +0200

Fixed handling disconnected agents by REACTIVATE_AGENT call.

This patch fixes MESOS-10116 by preventing REACTIVATE_AGENT from
activating disconnected agents in the allocator and also fixes the
handling of agents that were removed while the reactivation was being
stored into the registry.

Review: https://reviews.apache.org/r/72363
{noformat}

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-05-04 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17098882#comment-17098882
 ] 

Andrei Sekretenko commented on MESOS-10116:
---

Fix: https://reviews.apache.org/r/72363/

Test: https://reviews.apache.org/r/72364/ (proper implementation of the test is 
blocked by https://issues.apache.org/jira/browse/MESOS-10118)

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)