[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed resolved ZOOKEEPER-2600.
--------------------------------------
    Resolution: Cannot Reproduce

> dangling ephemerals on overloaded server with local sessions
> ------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2600
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2600
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>            Reporter: Benjamin Reed
>
> we had the following strange production bug:
> there was an ephemeral znode for a session that was no longer active.  it 
> happened even in the absence of failures.
> we are running with local sessions enabled and slightly different logic than 
> the open source zookeeper, but code inspection shows that the problem is also 
> in open source.
> the triggering condition was server overload. we had a traffic burst and it 
> we were having commit latencies of over 30 seconds.
> after digging through logs/code we realized from the logs that the create 
> session txn for the ephemeral node started (in the PrepRequestProcessor) at 
> 11:23:04 and committed at 11:23:38 (the "Adding global session" is output in 
> the commit processor). it took 34 seconds to commit the createSession, during 
> that time the session expired. due to delays it appears that the interleave 
> was as follows:
> 1) create session hits prep request processor and create session txn 
> generated 11:23:04
> 2) time passes as the create session is going through zab
> 3) the session expires, close session is generated, and close session txn 
> generated 11:23:23
> 4) the create session gets committed and the session gets re-added to the 
> sessionTracker 11:23:38
> 5) the create ephemeral node hits prep request processor and a create txn 
> generated 11:23:40
> 6) the close session gets committed (all ephemeral nodes for the session are 
> deleted) and the session is deleted from sessionTracker
> 7) the create ephemeral node gets committed
> the root cause seems to be that the gobal sessions are managed by both the 
> PrepRequestProcessor and the CommitProcessor. also with the local session 
> upgrading we can have changes in flight before our sessions commits. i think 
> there are probably two places to fix:
> 1) changes to session tracker should not happen in prep request processor.
> 2) we should not have requests in flight while create session is in process. 
> there are two options to prevent this:
> a) when a create session is generated in makeUpgradeRequest, we need to start 
> queuing the requests from the clients and only submit them once the create 
> session is committed
> b) the client should explicitly detect that it needs to change from local 
> session to global session and explicitly open a global session and get the 
> commit before it sends an ephemeral create request
> option 2a) is a more transparent fix, but architecturally and in the long 
> term i think 2b) might be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to