Jan Schlicht created MESOS-8346:
-----------------------------------

             Summary: Resubscription of a resource provider will crash the 
agent if its HTTP connection isn't closed
                 Key: MESOS-8346
                 URL: https://issues.apache.org/jira/browse/MESOS-8346
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 1.5.0
            Reporter: Jan Schlicht
            Assignee: Jan Schlicht
            Priority: Blocker


A resource provider might resubscribe while its old HTTP connection wasn't 
properly closed. In that case an agent will crashm with, e.g., the following 
log:
{noformat}
I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource provider 
{"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource provider 
8e71beef-796e-4bde-9257-952ed0f230a5
I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource provider 
8e71beef-796e-4bde-9257-952ed0f230a5
E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total resources 
cpus:2; mem:1024; disk:1024; ports:[31000-32000]
F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
resourceProviders.subscribed.contains(resourceProviderId) 
*** Check failure stack trace: ***
E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
    @        0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
    @        0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation for 
1 agents in 61830ns
I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
(172.18.8.13) disconnected
I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
(172.18.8.13)
I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
(172.18.8.13)
I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
    @        0x115f2761d  
mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
    @        0x115f2977d  
_ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
    @        0x115f29740  
_ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEE13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEEOSN_OSO_N5cpp1416integer_sequenceImJXspT2_EEEEOSP_
    @        0x115f296bb  
_ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_EEEEESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOSY_
    @        0x115f2965d  
_ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
    @        0x115f29631  
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEJEEEvOT_DpOT0_
    @        0x115f29526  
_ZNO6lambda12CallableOnceIFvvEE10CallableFnINS_8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS7_14HttpConnectionERKNS6_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEEclEv
    @        0x10b6ca690  _ZNO6lambda12CallableOnceIFvvEEclEv
    @        0x10be09295  
_ZZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS_4UPIDEOT_ENKUlOS7_PNS_11ProcessBaseEE_clESD_SF_
    @        0x10be09180  
_ZN5cpp176invokeIZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS1_4UPIDEOT_EUlOS9_PNS1_11ProcessBaseEE_JS9_SH_EEEDTclclsr3stdE7forwardISD_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESE_DpOSJ_
    @        0x10be0912b  
_ZN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_NSt3__112placeholders4__phILi1EEEEE13invoke_expandISI_NSJ_5tupleIJS9_SM_EEENSP_IJOSH_EEEJLm0ELm1EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardISD_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEESE_OST_N5cpp1416integer_sequenceImJXspT2_EEEEOSU_
    @        0x10be0905f  
_ZNO6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_NSt3__112placeholders4__phILi1EEEEEclIJSH_EEEDTcl13invoke_expandclL_ZNSJ_4moveIRSI_EEONSJ_16remove_referenceISD_E4typeESE_EdtdefpT1fEclL_ZNSP_IRNSJ_5tupleIJS9_SM_EEEEESU_SE_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOS11_
    @        0x10be08f4d  
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS1_12CallableOnceIFvvEEEEEvRKNS4_4UPIDEOT_EUlOSB_PNS4_11ProcessBaseEE_JSB_NSt3__112placeholders4__phILi1EEEEEEJSJ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESG_DpOSQ_
    @        0x10be08f11  
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS5_4UPIDEOT_EUlOSC_PNS5_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EEEEEEJSK_EEEvSH_DpOT0_
    @        0x10be08d36  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEEEEEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_NSt3__112placeholders4__phILi1EEEEEEEclEOS3_
    @        0x11fd64bc9  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
    @        0x11fd64a69  process::ProcessBase::consume()
    @        0x11fe20ac4  
_ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
    @        0x113c77819  process::ProcessBase::serve()
    @        0x11fd5b8c9  process::ProcessManager::resume()
    @        0x11fe8260b  
process::ProcessManager::init_threads()::$_1::operator()()
    @        0x11fe82190  
_ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_1EEEEEPvSB_
    @     0x7fff64da56c1  _pthread_body
    @     0x7fff64da556d  _pthread_start
    @     0x7fff64da4c5d  thread_start
Abort trap: 6
{noformat}

This is due to a race condition in {{resource_provider/manager.cpp}} when 
handling closed HTTP connections of resource providers. If a resource provider 
resubscribes and its old HTTP connection is still open, the resource provider 
manager will close it. This is unexpected and will trigger closing the new HTTP 
connection which results in a failed {{CHECK}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to