[jira] [Updated] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4715: -- Labels: pull-request-available (was: ) > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 10m > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
Yan Zhao created ZOOKEEPER-4715: --- Summary: Verify file size and position in testGetCurrentLogSize. Key: ZOOKEEPER-4715 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 Project: ZooKeeper Issue Type: Wish Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao Fix For: 3.9.0, 3.8.2 This is pre-PR for ZOOKEEPER-4714. In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we want our values to match the original values. Therefore, we added checks for fileSize and filePosition in our tests. After adding the checks, we used a new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4714: -- Labels: pull-request-available (was: ) > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
Yan Zhao created ZOOKEEPER-4714: --- Summary: Improve syncRequestProcessor performance Key: ZOOKEEPER-4714 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 Project: ZooKeeper Issue Type: Wish Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao Fix For: 3.9.0, 3.8.2 Attachments: 761688051587_.pic.jpg In the SyncRequestProcessor, a write operation is performed for each write request. Two methods are relatively time-consuming. 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the current file is retrieved, which involves a system call. Call stack: java.io.File.length(File.java) org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) 2. Within ZKDatabase#append, the current position of the current file is retrieved, which also involves a system call. Call stack: sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) Therefore, it is best to maintain the current size and position of the current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor. The QuorumPeer thread should wait for the exit of syncThread before back in LOOKING state: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}>
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Summary: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency (was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency) > Follower.shutdown() and Observer.shutdown() do not correctly shutdown the > syncProcessor, which may lead to data inconsistency > - > > Key: ZOOKEEPER-4712 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4712 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server >Affects Versions: 3.5.10, 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.6.4, 3.8.1 >Reporter: Sirius >Priority: Critical > > Follower.shutdown() and Observer.shutdown() do not correctly shutdown the > syncProcessor. It may lead to potential data inconsistency (see Potential > Risk). > > A follower / observer will invoke syncProcessor.shutdown() in > LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), > respectively. > However, after the > [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] > of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke > LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() > anymore. > > h2. Call stack > h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... > * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> > ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) > > h5. For comparison, in version 3.4.X, > * Observer.shutdown() -> Learner.shutdown() -> > {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * Follower.shutdown() -> Learner.shutdown() -> > {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> > ZooKeeperServer.shutdown(boolean) > > h2. Code Details > Take version 3.8.0 as an example. > In Follower.shutdown() : > {code:java} > public void shutdown() { > LOG.info("shutdown Follower"); > + // invoke Learner.shutdown() > super.shutdown(); > } {code} > > In Learner.java: > {code:java} > public void shutdown() { > ... > // shutdown previous zookeeper > if (zk != null) { > // If we haven't finished SNAP sync, force fully shutdown > // to avoid potential inconsistency > + // This will invoke ZooKeeperServer.shutdown(boolean), > + // which will not shutdown syncProcessor > + // Before the fix of ZOOLEEPER-3642, > + // FollowerZooKeeperServer.shutdown() will be invoked here > zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); > } > } {code} > > In ZooKeeperServer.java: > {code:java} > public synchronized void shutdown(boolean fullyShutDown) { > ... > if (firstProcessor != null) { > + // For a follower, this will not shutdown its syncProcessor. > firstProcessor.shutdown(); > } > ... > } {code} > > In expectation, Follower.shutdown() should invoke > LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: > {code:java} > public synchronized void shutdown() { > ... > try { > + // shutdown the syncProcessor here > if (syncProcessor != null) { > syncProcessor.shutdown(); > } > } ... > } {code} > Observer.shutdown() has the similar problem. > > h2. Potential Risk > When Follower.shutdown() is called, the follower's QuorumPeer thread may > invoke fastForwardDataBase() and > update the lastProcessedZxid for the election and recovery phase before its > syncThread drains the pending requests and flushes them to disk. > In consequence, this lastProcessedZxid is not the latest zxid in its log, > leading to log inconsistency after the SYNC phase. (Similar to the symptoms > of ZOOKEEPER-2845.) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an
[jira] [Updated] (ZOOKEEPER-4713) ObserverZooKeeperServer.shutdown() is redundant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4713: -- Description: After the [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible for the shutdown logic of both the follower and observer. ObserverZooKeeperServer.shutdown() seems redundant, because it is not in the call stack of Observer.shutdown(). (Note that FollowerZooKeeperServer does not have the shutdown() method.) Related analysis can be found in ZOOKEEPER-4712 was: After the [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible for the shutdown logic of both the follower and observer. ObserverZooKeeperServer.shutdown() seems redundant. Related analysis can be found in [ZOOKEEPER-4712|https://issues.apache.org/jira/browse/ZOOKEEPER-4712)ZOOKEEPER-4712] > ObserverZooKeeperServer.shutdown() is redundant > --- > > Key: ZOOKEEPER-4713 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4713 > Project: ZooKeeper > Issue Type: Improvement > Components: quorum, server >Affects Versions: 3.5.10, 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.6.4, 3.8.1 >Reporter: Sirius >Priority: Minor > > After the > [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] > of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible > for the shutdown logic of both the follower and observer. > ObserverZooKeeperServer.shutdown() seems redundant, because it is not in the > call stack of Observer.shutdown(). (Note that FollowerZooKeeperServer does > not have the shutdown() method.) > Related analysis can be found in ZOOKEEPER-4712 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4713) ObserverZooKeeperServer.shutdown() is redundant
Sirius created ZOOKEEPER-4713: - Summary: ObserverZooKeeperServer.shutdown() is redundant Key: ZOOKEEPER-4713 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4713 Project: ZooKeeper Issue Type: Improvement Components: quorum, server Affects Versions: 3.8.1, 3.7.1, 3.8.0, 3.7.0, 3.6.3, 3.5.10, 3.6.4 Reporter: Sirius After the [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible for the shutdown logic of both the follower and observer. ObserverZooKeeperServer.shutdown() seems redundant. Related analysis can be found in [ZOOKEEPER-4712|https://issues.apache.org/jira/browse/ZOOKEEPER-4712)ZOOKEEPER-4712] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) h3. Example trace (TODO) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of [ZOOKEEPER-3642|https://issues.apache.org/jira/browse/ZOOKEEPER-3642], Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of [ZOOKEEPER-3642|https://issues.apache.org/jira/browse/ZOOKEEPER-3642], Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown()
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } In ZooKeeperServer.java: public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -- ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() ->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown(){-}>Learner.shutdown(){-}>{*}ObserverZooKeeperServer.shutdown(){*}{-}>ZooKeeperServer.shutdown(){-}>ZooKeeperServer.shutdown(boolean) * Follower.shutdown(){-}>Learner.shutdown(){-}>{*}FollowerZooKeeperServer.shutdown(){*}>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -- ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() ->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown(){-}>Learner.shutdown(){-}>{*}ObserverZooKeeperServer.shutdown(){*}{-}>ZooKeeperServer.shutdown(){-}>ZooKeeperServer.shutdown(boolean) * Follower.shutdown(){-}>Learner.shutdown(){-}>{*}FollowerZooKeeperServer.shutdown(){*}>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } In ZooKeeperServer.java: public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String)->LeaderZooKeeper.shutdown()->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown()->Learner.shutdown()->*ObserverZooKeeperServer.shutdown()*->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) * Follower.shutdown()->Learner.shutdown()->*FollowerZooKeeperServer.shutdown()*>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper
[jira] [Created] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
Sirius created ZOOKEEPER-4712: - Summary: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency Key: ZOOKEEPER-4712 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4712 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.8.1, 3.7.1, 3.8.0, 3.7.0, 3.6.3, 3.5.10, 3.6.4 Reporter: Sirius Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String)->LeaderZooKeeper.shutdown()->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown()->Learner.shutdown()->*ObserverZooKeeperServer.shutdown()*->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) * Follower.shutdown()->Learner.shutdown()->*FollowerZooKeeperServer.shutdown()*>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } In ZooKeeperServer.java: public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4711) a data race in org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated ZOOKEEPER-4711: - Description: When we run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false The following method in class : org.apache.zookeeper.server.watch.WatcherCleaner {code:java} public void addDeadWatcher(int watcherBit) { // Wait if there are too many watchers waiting to be closed, // this is will slow down the socket packet processing and // the adding watches in the ZK pipeline. while (maxInProcessingDeadWatchers > 0 && !stopped && totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { try { RATE_LOGGER.rateLimitLog("Waiting for dead watchers cleaning"); long startTime = Time.currentElapsedTime(); synchronized (processingCompletedEvent) { processingCompletedEvent.wait(100); } long latency = Time.currentElapsedTime() - startTime; ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); } catch (InterruptedException e) { LOG.info("Got interrupted while waiting for dead watches queue size"); break; } } synchronized (this) { if (deadWatchers.add(watcherBit)) { totalDeadWatchers.incrementAndGet(); ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); if (deadWatchers.size() >= watcherCleanThreshold) { synchronized (cleanEvent) { cleanEvent.notifyAll(); } } } } }{code} {code:java} @Override public void run() { while (!stopped) { synchronized (cleanEvent) { try { // add some jitter to avoid cleaning dead watchers at the // same time in the quorum if (!stopped && deadWatchers.size() < watcherCleanThreshold) { int maxWaitMs = (watcherCleanIntervalInSeconds + ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * 1000; cleanEvent.wait(maxWaitMs); } } catch (InterruptedException e) { LOG.info("Received InterruptedException while waiting for cleanEvent"); break; } } if (deadWatchers.isEmpty()) { continue; } synchronized (this) { // Clean the dead watchers need to go through all the current // watches, which is pretty heavy and may take a second if // there are millions of watches, that's why we're doing lazily // batch clean up in a separate thread with a snapshot of the // current dead watchers. final Set snapshot = new HashSet<>(deadWatchers); deadWatchers.clear(); int total = snapshot.size(); LOG.info("Processing {} dead watchers", total); cleaners.schedule(new WorkRequest() { @Override public void doWork() throws Exception { long startTime = Time.currentElapsedTime(); listener.processDeadWatchers(snapshot); long latency = Time.currentElapsedTime() - startTime; LOG.info("Takes {} to process {} watches", latency, total); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEANER_LATENCY.add(latency); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEARED.add(total); totalDeadWatchers.addAndGet(-total); synchronized (processingCompletedEvent) { processingCompletedEvent.notifyAll(); } } }); } } LOG.info("WatcherCleaner thread exited"); }{code} As we can see, the two methods visist deadWatchers Object by different thread. *Thread in run()* is *read* operation on deadWachers and Thread in addDeadWatcher is *write* operation on deadWachers. This causes a data race without any lock. was: When we run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false The method of addDeadWatcher ( System.out.println("2s::" +Thread.currentThread().getName()+ "
[jira] [Updated] (ZOOKEEPER-4711) a data race in org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated ZOOKEEPER-4711: - Summary: a data race in org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers (was: There is a data race bettween run() and addDeadWatcher in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test.) > a data race in > org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > --- > > Key: ZOOKEEPER-4711 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4711 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.9.0 > Environment: download zookeeper 3.9.0-SNAPSHOT from github repository > ([https://github.com/apache/zookeeper)] > Then run : mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false >Reporter: lujie >Priority: Critical > > When we run : > mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false > The method of addDeadWatcher > ( > System.out.println("2s::" +Thread.currentThread().getName()+ " > "+System.identityHashCode(deadWatchers)+" " + System.currentTimeMillis()); > this is my debug info. > ) > {code:java} > public void addDeadWatcher(int watcherBit) { > // Wait if there are too many watchers waiting to be closed, > // this is will slow down the socket packet processing and > // the adding watches in the ZK pipeline. > while (maxInProcessingDeadWatchers > 0 && !stopped && > totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { > try { > RATE_LOGGER.rateLimitLog("Waiting for dead watchers > cleaning"); > long startTime = Time.currentElapsedTime(); > synchronized (processingCompletedEvent) { > processingCompletedEvent.wait(100); > } > long latency = Time.currentElapsedTime() - startTime; > > ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); > } catch (InterruptedException e) { > LOG.info("Got interrupted while waiting for dead watches > queue size"); > break; > } > } > synchronized (this) { > > if (deadWatchers.add(watcherBit)) { > totalDeadWatchers.incrementAndGet(); > ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); > if (deadWatchers.size() >= watcherCleanThreshold) { > synchronized (cleanEvent) { > cleanEvent.notifyAll(); > } > } > } > } > }{code} > > {code:java} > @Override > public void run() { > while (!stopped) { > synchronized (cleanEvent) { > try { > // add some jitter to avoid cleaning dead watchers at the > // same time in the quorum > if (!stopped && deadWatchers.size() < > watcherCleanThreshold) { > > int maxWaitMs = (watcherCleanIntervalInSeconds > + > ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * > 1000; > cleanEvent.wait(maxWaitMs); > } > } catch (InterruptedException e) { > LOG.info("Received InterruptedException while waiting for > cleanEvent"); > break; > } > } if (deadWatchers.isEmpty()) { > continue; > } synchronized (this) { > // Clean the dead watchers need to go through all the current > // watches, which is pretty heavy and may take a second if > // there are millions of watches, that's why we're doing > lazily > // batch clean up in a separate thread with a snapshot of the > // current dead watchers. > final Set snapshot = new HashSet<>(deadWatchers); > deadWatchers.clear(); > int total = snapshot.size(); > LOG.info("Processing {} dead watchers", total); > cleaners.schedule(new WorkRequest() { > @Override > public void doWork() throws Exception { >
[jira] [Updated] (ZOOKEEPER-4711) There is a data race bettween run() and addDeadWatcher in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManag
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated ZOOKEEPER-4711: - Summary: There is a data race bettween run() and addDeadWatcher in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test. (was: There is a data race bettween run() and "public void addDeadWatcher(int watcherBit)" in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test.) > There is a data race bettween run() and addDeadWatcher in > org.apache.zookeeper.server.watch.WatcherCleaner class when run > org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit > test. > - > > Key: ZOOKEEPER-4711 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4711 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.9.0 > Environment: download zookeeper 3.9.0-SNAPSHOT from github repository > ([https://github.com/apache/zookeeper)] > Then run : mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false >Reporter: lujie >Priority: Critical > > When we run : > mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false > The method of addDeadWatcher > ( > System.out.println("2s::" +Thread.currentThread().getName()+ " > "+System.identityHashCode(deadWatchers)+" " + System.currentTimeMillis()); > this is my debug info. > ) > {code:java} > public void addDeadWatcher(int watcherBit) { > // Wait if there are too many watchers waiting to be closed, > // this is will slow down the socket packet processing and > // the adding watches in the ZK pipeline. > while (maxInProcessingDeadWatchers > 0 && !stopped && > totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { > try { > RATE_LOGGER.rateLimitLog("Waiting for dead watchers > cleaning"); > long startTime = Time.currentElapsedTime(); > synchronized (processingCompletedEvent) { > processingCompletedEvent.wait(100); > } > long latency = Time.currentElapsedTime() - startTime; > > ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); > } catch (InterruptedException e) { > LOG.info("Got interrupted while waiting for dead watches > queue size"); > break; > } > } > synchronized (this) { > > if (deadWatchers.add(watcherBit)) { > totalDeadWatchers.incrementAndGet(); > ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); > if (deadWatchers.size() >= watcherCleanThreshold) { > synchronized (cleanEvent) { > cleanEvent.notifyAll(); > } > } > } > } > }{code} > > {code:java} > @Override > public void run() { > while (!stopped) { > synchronized (cleanEvent) { > try { > // add some jitter to avoid cleaning dead watchers at the > // same time in the quorum > if (!stopped && deadWatchers.size() < > watcherCleanThreshold) { > > int maxWaitMs = (watcherCleanIntervalInSeconds > + > ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * > 1000; > cleanEvent.wait(maxWaitMs); > } > } catch (InterruptedException e) { > LOG.info("Received InterruptedException while waiting for > cleanEvent"); > break; > } > } if (deadWatchers.isEmpty()) { > continue; > } synchronized (this) { > // Clean the dead watchers need to go through all the current > // watches, which is pretty heavy and may take a second if > // there are millions of watches, that's why we're doing > lazily > // batch clean up in a separate thread with a snapshot of the > // current dead
[jira] [Created] (ZOOKEEPER-4711) There is a data race bettween run() and "public void addDeadWatcher(int watcherBit)" in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zoo
lujie created ZOOKEEPER-4711: Summary: There is a data race bettween run() and "public void addDeadWatcher(int watcherBit)" in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test. Key: ZOOKEEPER-4711 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4711 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.0 Environment: download zookeeper 3.9.0-SNAPSHOT from github repository ([https://github.com/apache/zookeeper)] Then run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false Reporter: lujie When we run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false The method of addDeadWatcher ( System.out.println("2s::" +Thread.currentThread().getName()+ " "+System.identityHashCode(deadWatchers)+" " + System.currentTimeMillis()); this is my debug info. ) {code:java} public void addDeadWatcher(int watcherBit) { // Wait if there are too many watchers waiting to be closed, // this is will slow down the socket packet processing and // the adding watches in the ZK pipeline. while (maxInProcessingDeadWatchers > 0 && !stopped && totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { try { RATE_LOGGER.rateLimitLog("Waiting for dead watchers cleaning"); long startTime = Time.currentElapsedTime(); synchronized (processingCompletedEvent) { processingCompletedEvent.wait(100); } long latency = Time.currentElapsedTime() - startTime; ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); } catch (InterruptedException e) { LOG.info("Got interrupted while waiting for dead watches queue size"); break; } } synchronized (this) { if (deadWatchers.add(watcherBit)) { totalDeadWatchers.incrementAndGet(); ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); if (deadWatchers.size() >= watcherCleanThreshold) { synchronized (cleanEvent) { cleanEvent.notifyAll(); } } } } }{code} {code:java} @Override public void run() { while (!stopped) { synchronized (cleanEvent) { try { // add some jitter to avoid cleaning dead watchers at the // same time in the quorum if (!stopped && deadWatchers.size() < watcherCleanThreshold) { int maxWaitMs = (watcherCleanIntervalInSeconds + ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * 1000; cleanEvent.wait(maxWaitMs); } } catch (InterruptedException e) { LOG.info("Received InterruptedException while waiting for cleanEvent"); break; } } if (deadWatchers.isEmpty()) { continue; } synchronized (this) { // Clean the dead watchers need to go through all the current // watches, which is pretty heavy and may take a second if // there are millions of watches, that's why we're doing lazily // batch clean up in a separate thread with a snapshot of the // current dead watchers. final Set snapshot = new HashSet<>(deadWatchers); deadWatchers.clear(); int total = snapshot.size(); LOG.info("Processing {} dead watchers", total); cleaners.schedule(new WorkRequest() { @Override public void doWork() throws Exception { long startTime = Time.currentElapsedTime(); listener.processDeadWatchers(snapshot); long latency = Time.currentElapsedTime() - startTime; LOG.info("Takes {} to process {} watches", latency, total); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEANER_LATENCY.add(latency); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEARED.add(total); totalDeadWatchers.addAndGet(-total); synchronized