[jira] [Updated] (ZOOKEEPER-4721) Upgrade OWASP Dependency Check to 8.3.1
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andor Molnar updated ZOOKEEPER-4721: Affects Version/s: 3.8.1 3.7.1 3.9.0 > Upgrade OWASP Dependency Check to 8.3.1 > --- > > Key: ZOOKEEPER-4721 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4721 > Project: ZooKeeper > Issue Type: Bug > Components: build >Affects Versions: 3.5.4, 3.6.0, 3.4.12, 3.7.1, 3.9.0, 3.8.1 >Reporter: Abraham Fine >Assignee: Patrick D. Hunt >Priority: Major > Labels: newbie, pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4552) Bump bouncycastle from 1.60 to 1.70
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhangJian He resolved ZOOKEEPER-4552. - Resolution: Abandoned > Bump bouncycastle from 1.60 to 1.70 > --- > > Key: ZOOKEEPER-4552 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4552 > Project: ZooKeeper > Issue Type: Task >Reporter: ZhangJian He >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4719) Use bouncycastle jdk18on instead of jdk15on
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved ZOOKEEPER-4719. -- Fix Version/s: 3.9.0 Assignee: Zili Chen Resolution: Fixed master via 4882f7b63490971e44a669e98428615ef7bf472f > Use bouncycastle jdk18on instead of jdk15on > --- > > Key: ZOOKEEPER-4719 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4719 > Project: ZooKeeper > Issue Type: Bug >Reporter: ZhangJian He >Assignee: Zili Chen >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > bouncycastle jdk15 on is deprecated in > [https://github.com/bcgit/bc-java/issues/1139] > we can switch to bouncycastle jdk18on -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743717#comment-17743717 ] Zili Chen commented on ZOOKEEPER-4714: -- [~andor] in case of you don't cut 3.9.0, this patch will be included now. > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved ZOOKEEPER-4714. -- Fix Version/s: (was: 3.8.3) Assignee: Zili Chen Resolution: Fixed master via e2e8ec661f8d50e5341bdefa0ccd8c5116f5ce4b > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andor Molnar updated ZOOKEEPER-4714: Fix Version/s: 3.9.0 > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.3 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 1h 50m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andor Molnar updated ZOOKEEPER-4714: Fix Version/s: (was: 3.9.0) > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.8.3 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 1h 50m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4720) Add jakarta.servlet support
Rajendra Rathore created ZOOKEEPER-4720: --- Summary: Add jakarta.servlet support Key: ZOOKEEPER-4720 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4720 Project: ZooKeeper Issue Type: New Feature Reporter: Rajendra Rathore In order to upgrade to Tomcat 10+ / Servlet 5+ it's required to switch to the Jakarta EE Namespace. h4. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4720) Add jakarta.servlet support
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajendra Rathore updated ZOOKEEPER-4720: Issue Type: Wish (was: New Feature) > Add jakarta.servlet support > --- > > Key: ZOOKEEPER-4720 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4720 > Project: ZooKeeper > Issue Type: Wish >Reporter: Rajendra Rathore >Priority: Major > > In order to upgrade to Tomcat 10+ / Servlet 5+ it's required to switch to the > Jakarta EE Namespace. > h4. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4717) Cache serialize data in the request to avoid repeat serialize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Olivelli resolved ZOOKEEPER-4717. Fix Version/s: (was: 3.8.2) Resolution: Fixed > Cache serialize data in the request to avoid repeat serialize. > -- > > Key: ZOOKEEPER-4717 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4717 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Enrico Olivelli >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 2h > Remaining Estimate: 0h > > For each request, it will be serialized three times. > 1. Leader proposal. It will serialize the request, wrap the serialized data > in a proposal, then send the proposal to the quorum members. > 2. SyncRequestProcessor append txn log. It will serialize the request, then > write the serialized data to the txn log. > 3. ZkDataBase addCommittedProposal. It will serialize the request, wrap the > serialized data in a proposal, then add the proposal to committedLog. > Serialization operations are CPU-sensitive, and when the CPU experiences > jitter, the time required for serialization operations will also skyrocket. > Therefore, we should avoid serializing the same request multiple times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ZOOKEEPER-4717) Cache serialize data in the request to avoid repeat serialize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Olivelli reassigned ZOOKEEPER-4717: -- Assignee: Enrico Olivelli > Cache serialize data in the request to avoid repeat serialize. > -- > > Key: ZOOKEEPER-4717 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4717 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Enrico Olivelli >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 2h > Remaining Estimate: 0h > > For each request, it will be serialized three times. > 1. Leader proposal. It will serialize the request, wrap the serialized data > in a proposal, then send the proposal to the quorum members. > 2. SyncRequestProcessor append txn log. It will serialize the request, then > write the serialized data to the txn log. > 3. ZkDataBase addCommittedProposal. It will serialize the request, wrap the > serialized data in a proposal, then add the proposal to committedLog. > Serialization operations are CPU-sensitive, and when the CPU experiences > jitter, the time required for serialization operations will also skyrocket. > Therefore, we should avoid serializing the same request multiple times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4717) Cache serialize data in the request to avoid repeat serialize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Olivelli updated ZOOKEEPER-4717: --- Fix Version/s: 3.9.0 > Cache serialize data in the request to avoid repeat serialize. > -- > > Key: ZOOKEEPER-4717 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4717 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 2h > Remaining Estimate: 0h > > For each request, it will be serialized three times. > 1. Leader proposal. It will serialize the request, wrap the serialized data > in a proposal, then send the proposal to the quorum members. > 2. SyncRequestProcessor append txn log. It will serialize the request, then > write the serialized data to the txn log. > 3. ZkDataBase addCommittedProposal. It will serialize the request, wrap the > serialized data in a proposal, then add the proposal to committedLog. > Serialization operations are CPU-sensitive, and when the CPU experiences > jitter, the time required for serialization operations will also skyrocket. > Therefore, we should avoid serializing the same request multiple times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4719) Use bouncycastle jdk18on instead of jdk15on
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4719: -- Labels: pull-request-available (was: ) > Use bouncycastle jdk18on instead of jdk15on > --- > > Key: ZOOKEEPER-4719 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4719 > Project: ZooKeeper > Issue Type: Bug >Reporter: ZhangJian He >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > bouncycastle jdk15 on is deprecated in > [https://github.com/bcgit/bc-java/issues/1139] > we can switch to bouncycastle jdk18on -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4719) Use bouncycastle jdk18on instead of jdk15on
ZhangJian He created ZOOKEEPER-4719: --- Summary: Use bouncycastle jdk18on instead of jdk15on Key: ZOOKEEPER-4719 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4719 Project: ZooKeeper Issue Type: Bug Reporter: ZhangJian He bouncycastle jdk15 on is deprecated in [https://github.com/bcgit/bc-java/issues/1139] we can switch to bouncycastle jdk18on -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4718) Removing unnecessary heap memory allocation in serialization can help reduce GC pressure.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved ZOOKEEPER-4718. -- Fix Version/s: 3.9.0 (was: 3.8.2) Assignee: Zili Chen Resolution: Fixed master via e08cc2a782982964a57651f179a468b19e2e6010 > Removing unnecessary heap memory allocation in serialization can help reduce > GC pressure. > - > > Key: ZOOKEEPER-4718 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4718 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > For each request, we will serialize it to a byte array. > In SerializeUtils#serializeRequest, before serializing the request, it always > allocates 32 byte array. It's unnecessary; we can allocate the byte array in > the catch code block. > {code:java} > public static byte[] serializeRequest(Request request) { > if (request == null || request.getHdr() == null) { > return null; > } > byte[] data = new byte[32]; > try { > data = Util.marshallTxnEntry(request.getHdr(), request.getTxn(), > request.getTxnDigest()); > } catch (IOException e) { > LOG.error("This really should be impossible", e); > } > return data; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4712: -- Labels: pull-request-available (was: ) > Follower.shutdown() and Observer.shutdown() do not correctly shutdown the > syncProcessor, which may lead to data inconsistency > - > > Key: ZOOKEEPER-4712 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4712 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server >Affects Versions: 3.5.10, 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.6.4, 3.8.1 >Reporter: Sirius >Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Follower.shutdown() and Observer.shutdown() do not correctly shutdown the > syncProcessor. It may lead to potential data inconsistency (see {*}Potential > Risk{*}). > > A follower / observer will invoke syncProcessor.shutdown() in > LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), > respectively. > However, after the > [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] > of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke > LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() > anymore. > > h2. Call stack > h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... > * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> > ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) > > h5. For comparison, in version 3.4.X, > * Observer.shutdown() -> Learner.shutdown() -> > {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * Follower.shutdown() -> Learner.shutdown() -> > {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> > ZooKeeperServer.shutdown(boolean) > > h2. Code Details > Take version 3.8.0 as an example. > In Follower.shutdown() : > {code:java} > public void shutdown() { > LOG.info("shutdown Follower"); > + // invoke Learner.shutdown() > super.shutdown(); > } {code} > > In Learner.java: > {code:java} > public void shutdown() { > ... > // shutdown previous zookeeper > if (zk != null) { > // If we haven't finished SNAP sync, force fully shutdown > // to avoid potential inconsistency > + // This will invoke ZooKeeperServer.shutdown(boolean), > + // which will not shutdown syncProcessor > + // Before the fix of ZOOLEEPER-3642, > + // FollowerZooKeeperServer.shutdown() will be invoked here > zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); > } > } {code} > > In ZooKeeperServer.java: > {code:java} > public synchronized void shutdown(boolean fullyShutDown) { > ... > if (firstProcessor != null) { > + // For a follower, this will not shutdown its syncProcessor. > firstProcessor.shutdown(); > } > ... > } {code} > > In expectation, Follower.shutdown() should invoke > LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: > {code:java} > public synchronized void shutdown() { > ... > try { > + // shutdown the syncProcessor here > if (syncProcessor != null) { > syncProcessor.shutdown(); > } > } ... > } {code} > Observer.shutdown() has the similar problem. > > h2. Potential Risk > When Follower.shutdown() is called, the follower's QuorumPeer thread may > update the lastProcessedZxid for the election and recovery phase before its > syncThread drains the pending requests and flushes them to disk. > In consequence, this lastProcessedZxid is not the latest zxid in its log, > leading to log inconsistency after the SYNC phase. (Similar to the symptoms > of ZOOKEEPER-2845.) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ZOOKEEPER-4669) Upgrade snappy-java to 1.1.9.1 (in order to support M1 macs)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740592#comment-17740592 ] AvnerW edited comment on ZOOKEEPER-4669 at 7/6/23 12:59 PM: [~mpolden] , [~cnauroth] it seems like only version 1.1.10.1 contains a fix for CVE-2023-34453, CVE-2023-34454 and CVE-2023-34455. Can the next ZK version include version 1.1.10.1 instead of 1.1.9.1? was (Author: avnerw): [~mpolden] , it seems like only version 1.1.10.1 contains a fix for CVE-2023-34453, CVE-2023-34454 and CVE-2023-34455. Can the next ZK version include version 1.1.10.1 instead of 1.1.9.1? > Upgrade snappy-java to 1.1.9.1 (in order to support M1 macs) > > > Key: ZOOKEEPER-4669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4669 > Project: ZooKeeper > Issue Type: Task > Components: java client >Reporter: Enrico Olivelli >Assignee: Martin Polden >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4669) Upgrade snappy-java to 1.1.9.1 (in order to support M1 macs)
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740592#comment-17740592 ] AvnerW commented on ZOOKEEPER-4669: --- [~mpolden] , it seems like only version 1.1.10.1 contains a fix for CVE-2023-34453, CVE-2023-34454 and CVE-2023-34455. Can the next ZK version include version 1.1.10.1 instead of 1.1.9.1? > Upgrade snappy-java to 1.1.9.1 (in order to support M1 macs) > > > Key: ZOOKEEPER-4669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4669 > Project: ZooKeeper > Issue Type: Task > Components: java client >Reporter: Enrico Olivelli >Assignee: Martin Polden >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4718) Removing unnecessary heap memory allocation in serialization can help reduce GC pressure.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhao updated ZOOKEEPER-4718: Description: For each request, we will serialize it to a byte array. In SerializeUtils#serializeRequest, before serializing the request, it always allocates 32 byte array. It's unnecessary; we can allocate the byte array in the catch code block. {code:java} public static byte[] serializeRequest(Request request) { if (request == null || request.getHdr() == null) { return null; } byte[] data = new byte[32]; try { data = Util.marshallTxnEntry(request.getHdr(), request.getTxn(), request.getTxnDigest()); } catch (IOException e) { LOG.error("This really should be impossible", e); } return data; } {code} was: In SerializeUtils#serializeRequest, before serializing the request, it always allocates 32 byte array. It's unnecessary; we can allocate the byte array in the catch code block. {code:java} public static byte[] serializeRequest(Request request) { if (request == null || request.getHdr() == null) { return null; } byte[] data = new byte[32]; try { data = Util.marshallTxnEntry(request.getHdr(), request.getTxn(), request.getTxnDigest()); } catch (IOException e) { LOG.error("This really should be impossible", e); } return data; } {code} > Removing unnecessary heap memory allocation in serialization can help reduce > GC pressure. > - > > Key: ZOOKEEPER-4718 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4718 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.8.2 > > Time Spent: 10m > Remaining Estimate: 0h > > For each request, we will serialize it to a byte array. > In SerializeUtils#serializeRequest, before serializing the request, it always > allocates 32 byte array. It's unnecessary; we can allocate the byte array in > the catch code block. > {code:java} > public static byte[] serializeRequest(Request request) { > if (request == null || request.getHdr() == null) { > return null; > } > byte[] data = new byte[32]; > try { > data = Util.marshallTxnEntry(request.getHdr(), request.getTxn(), > request.getTxnDigest()); > } catch (IOException e) { > LOG.error("This really should be impossible", e); > } > return data; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4718) Removing unnecessary heap memory allocation in serialization can help reduce GC pressure.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4718: -- Labels: pull-request-available (was: ) > Removing unnecessary heap memory allocation in serialization can help reduce > GC pressure. > - > > Key: ZOOKEEPER-4718 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4718 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.8.2 > > Time Spent: 10m > Remaining Estimate: 0h > > In SerializeUtils#serializeRequest, before serializing the request, it always > allocates 32 byte array. It's unnecessary; we can allocate the byte array in > the catch code block. > {code:java} > public static byte[] serializeRequest(Request request) { > if (request == null || request.getHdr() == null) { > return null; > } > byte[] data = new byte[32]; > try { > data = Util.marshallTxnEntry(request.getHdr(), request.getTxn(), > request.getTxnDigest()); > } catch (IOException e) { > LOG.error("This really should be impossible", e); > } > return data; > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4718) Removing unnecessary heap memory allocation in serialization can help reduce GC pressure.
Yan Zhao created ZOOKEEPER-4718: --- Summary: Removing unnecessary heap memory allocation in serialization can help reduce GC pressure. Key: ZOOKEEPER-4718 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4718 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao Fix For: 3.8.2 In SerializeUtils#serializeRequest, before serializing the request, it always allocates 32 byte array. It's unnecessary; we can allocate the byte array in the catch code block. {code:java} public static byte[] serializeRequest(Request request) { if (request == null || request.getHdr() == null) { return null; } byte[] data = new byte[32]; try { data = Util.marshallTxnEntry(request.getHdr(), request.getTxn(), request.getTxnDigest()); } catch (IOException e) { LOG.error("This really should be impossible", e); } return data; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4717) Cache serialize data in the request to avoid repeat serialize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4717: -- Labels: pull-request-available (was: ) > Cache serialize data in the request to avoid repeat serialize. > -- > > Key: ZOOKEEPER-4717 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4717 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Minor > Labels: pull-request-available > Fix For: 3.8.2 > > Time Spent: 10m > Remaining Estimate: 0h > > For each request, it will be serialized three times. > 1. Leader proposal. It will serialize the request, wrap the serialized data > in a proposal, then send the proposal to the quorum members. > 2. SyncRequestProcessor append txn log. It will serialize the request, then > write the serialized data to the txn log. > 3. ZkDataBase addCommittedProposal. It will serialize the request, wrap the > serialized data in a proposal, then add the proposal to committedLog. > Serialization operations are CPU-sensitive, and when the CPU experiences > jitter, the time required for serialization operations will also skyrocket. > Therefore, we should avoid serializing the same request multiple times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4717) Cache serialize data in the request to avoid repeat serialize.
Yan Zhao created ZOOKEEPER-4717: --- Summary: Cache serialize data in the request to avoid repeat serialize. Key: ZOOKEEPER-4717 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4717 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao Fix For: 3.8.2 For each request, it will be serialized three times. 1. Leader proposal. It will serialize the request, wrap the serialized data in a proposal, then send the proposal to the quorum members. 2. SyncRequestProcessor append txn log. It will serialize the request, then write the serialized data to the txn log. 3. ZkDataBase addCommittedProposal. It will serialize the request, wrap the serialized data in a proposal, then add the proposal to committedLog. Serialization operations are CPU-sensitive, and when the CPU experiences jitter, the time required for serialization operations will also skyrocket. Therefore, we should avoid serializing the same request multiple times. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4599) Upgrade Jetty to avoid CVE-2022-2048
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4599: Issue Type: Task (was: Bug) > Upgrade Jetty to avoid CVE-2022-2048 > > > Key: ZOOKEEPER-4599 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4599 > Project: ZooKeeper > Issue Type: Task >Affects Versions: 3.6.3, 3.8.0, 3.7.1 >Reporter: Shivakumar >Assignee: Mate Szalay-Beko >Priority: Major > Labels: security > Fix For: 3.9.0, 3.7.2, 3.8.2 > > > |CVE ID|Type|Severity|Packages|Package Version|CVSS|Fix Status| > |CVE-2022-2048|java|high|org.eclipse.jetty_jetty-io|9.4.43.v20210629|7.5|fixed > in 11.0.9, 10.0.9, 9.4.47| > Our security scan detected the above vulnerabilities > upgrade to correct versions for fixing vulnerabilities -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4674) C client tests don't pass on CI
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4674: Issue Type: Bug (was: Test) > C client tests don't pass on CI > --- > > Key: ZOOKEEPER-4674 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4674 > Project: ZooKeeper > Issue Type: Bug > Components: c client, tests >Reporter: Enrico Olivelli >Assignee: Damien Diederen >Priority: Blocker > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.6.5, 3.8.2 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4647) Tests don't pass on JDK20 because we try to mock InetAddress
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4647: Issue Type: Bug (was: Test) > Tests don't pass on JDK20 because we try to mock InetAddress > > > Key: ZOOKEEPER-4647 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4647 > Project: ZooKeeper > Issue Type: Bug >Reporter: Enrico Olivelli >Assignee: Enrico Olivelli >Priority: Critical > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > This test fails on JDK20-Ea > org.apache.zookeeper.test.StaticHostProviderTest.testEmptyResolution > Mockito cannot mock this class: class java.net.InetAddress. Mockito can only > mock non-private & non-final classes. If you're not sure why you're getting > this error, please report to the mailing list. > if I try to upgrade Mockito to 4.9.0 the error is > org.mockito.exceptions.base.MockitoException: > Cannot mock/spy class java.net.InetAddress > Mockito cannot mock/spy because : > - sealed class > > at > org.apache.zookeeper.test.StaticHostProviderTest.testReResolvingSingle(StaticHostProviderTest.jav -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4716) Upgrade jackson to 2.15.2, suppress two false positive CVE errors
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4716: Issue Type: Task (was: Improvement) > Upgrade jackson to 2.15.2, suppress two false positive CVE errors > - > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Task >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Our jackson is quite old, I want to upgrade it before release 3.8.2. > Also we have a few false positive CVEs reported by OWASP: > * CVE-2023-35116: according to jackson community, this is not a security > issue, see > [https://github.com/FasterXML/jackson-databind/issues/3972#issuecomment-1596193098] > * CVE-2022-45688: the following CVE is not even jackson related, but a > vulnerability in json-java which we don't use in ZooKeeper > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4716) Upgrade jackson to 2.15.2, suppress two false positive CVE errors
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4716: Summary: Upgrade jackson to 2.15.2, suppress two false positive CVE errors (was: upgrade jackson to 2.15.2, suppress two false positive CVE errors) > Upgrade jackson to 2.15.2, suppress two false positive CVE errors > - > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Our jackson is quite old, I want to upgrade it before release 3.8.2. > Also we have a few false positive CVEs reported by OWASP: > * CVE-2023-35116: according to jackson community, this is not a security > issue, see > [https://github.com/FasterXML/jackson-databind/issues/3972#issuecomment-1596193098] > * CVE-2022-45688: the following CVE is not even jackson related, but a > vulnerability in json-java which we don't use in ZooKeeper > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4709: Issue Type: Task (was: Improvement) > Upgrade Netty to 4.1.94.Final > - > > Key: ZOOKEEPER-4709 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 > Project: ZooKeeper > Issue Type: Task >Affects Versions: 3.7.1, 3.8.1 >Reporter: Fabio Buso >Priority: Major > Labels: dependency-upgrade, pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 50m > Remaining Estimate: 0h > > [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes > several improvements and bug fixes, including a resolution for > [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] > related to potential memory allocation vulnerabilities during a TLS > handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4714: Fix Version/s: 3.8.3 > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.3 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4714: Fix Version/s: (was: 3.8.2) > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740220#comment-17740220 ] Zili Chen commented on ZOOKEEPER-4715: -- It seems the lastest tag to use is 3.9.0. Then if it's not included, please move to the next version. > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 1h > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740218#comment-17740218 ] Zili Chen commented on ZOOKEEPER-4715: -- I'll move the fixed version to the following ones. [~andor] if it happens that you would include it in 3.9.0, please update the field then. > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 1h > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zili Chen resolved ZOOKEEPER-4715. -- Fix Version/s: (was: 3.8.2) Assignee: Zili Chen Resolution: Fixed master via 2edb73a943928e0716b91e8a1d06a9c226fa393c > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 1h > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740117#comment-17740117 ] Yan Zhao commented on ZOOKEEPER-4714: - No. It's just an improvement. > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740116#comment-17740116 ] Yan Zhao commented on ZOOKEEPER-4715: - No. It's just an improvement. > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740109#comment-17740109 ] Andor Molnar commented on ZOOKEEPER-4715: - Hi [~horizonzy] . Do you think this ticket is blocker for 3.9.0? > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17740108#comment-17740108 ] Andor Molnar commented on ZOOKEEPER-4714: - Thanks [~horizonzy] . Do think this issue is a blocker for 3.9.0? > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 20m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4713) ObserverZooKeeperServer.shutdown() is redundant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4713: -- Labels: pull-request-available (was: ) > ObserverZooKeeperServer.shutdown() is redundant > --- > > Key: ZOOKEEPER-4713 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4713 > Project: ZooKeeper > Issue Type: Improvement > Components: quorum, server >Affects Versions: 3.5.10, 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.6.4, 3.8.1 >Reporter: Sirius >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > After the > [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] > of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible > for the shutdown logic of both the follower and observer. > ObserverZooKeeperServer.shutdown() seems redundant, because it is not in the > call stack of Observer.shutdown(). (Note that FollowerZooKeeperServer does > not have the shutdown() method.) > Related analysis can be found in ZOOKEEPER-4712 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4707) Update snappy-java to address multiple CVEs
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko resolved ZOOKEEPER-4707. - Fix Version/s: 3.9.0 3.7.2 3.8.2 Resolution: Fixed Thank you [~lhotari] for raising the issue and doing the fix! I merged it to all active branches, it will be soon released with 3.9.0 and 3.8.2. > Update snappy-java to address multiple CVEs > --- > > Key: ZOOKEEPER-4707 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4707 > Project: ZooKeeper > Issue Type: Task >Reporter: Lari Hotari >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Address multiple CVEs: > CVE-2023-34453 > CVE-2023-34454 > CVE-2023-34455 > See https://github.com/xerial/snappy-java/releases/tag/v1.1.10.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4707) Update snappy-java to address multiple CVEs
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4707: Affects Version/s: 3.8.1 3.7.1 > Update snappy-java to address multiple CVEs > --- > > Key: ZOOKEEPER-4707 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4707 > Project: ZooKeeper > Issue Type: Task >Affects Versions: 3.7.1, 3.8.1 >Reporter: Lari Hotari >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Address multiple CVEs: > CVE-2023-34453 > CVE-2023-34454 > CVE-2023-34455 > See https://github.com/xerial/snappy-java/releases/tag/v1.1.10.1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4643) Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4643: -- Labels: pull-request-available (was: ) > Committed txns may be improperly truncated if follower crashes right after > updating currentEpoch but before persisting txns to disk > --- > > Key: ZOOKEEPER-4643 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4643 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server >Affects Versions: 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.8.1 >Reporter: Sirius >Priority: Critical > Labels: pull-request-available > Attachments: Trace-ZK-4643.pdf > > Time Spent: 10m > Remaining Estimate: 0h > > When a follower is processing the NEWLEADER message in SYNC phase, it will > update its {{_currentEpoch_}} to the file *before* writing the txns (from the > PROPOSALs sent by leader in SYNC) to the log file. Such execution order may > lead to improper truncation of *committed* txns on other servers in later > rounds. > The critical step to trigger this problem is to make a follower node crash > right after it updates its {{_currentEpoch_}} to the file but before writing > the txns to the log file. The potential risk is that, this node with > incomplete committed txns might be later elected as the leader with its > larger {{{}_currentEpoch_{}}}, and then improperly uses TRUNC to ask other > nodes to truncate their committed txns! > > h2. Trace > [^Trace-ZK-4643.pdf] > Here is an example to trigger the bug. (Focus on {{_currentEpoch_}} and > {{{}_lastLoggedZxid_{}}}) > {*}Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to > 1{*}{*}):{*} > - Start the ensemble with three nodes: S{+}0{+}, +S1+ & {+}S2{+}. > - +S2+ is elected leader. > - For all of them, _{{currentEpoch}}_ = 1, {{_lastLoggedZxid_}} (the last > zxid in the log)= <1, 3>, {{_lastProcessedZxid_}} = <1, 3>. > - +S0+ crashes. > - A new txn <1, 4> is logged and committed by +S1+ & {+}S2{+}. Then, +S1+ & > +S2+ have {{_lastLoggedZxid_}} = <1, 4>, {{_lastProcessedZxid_}} = <1, 4> . > - Verify clients can read the datatree with latest zxid <1, 4>. > *Round 2* {*}(Running nodes with their acceptedEpoch & currentEpoch set to > 2{*}{*}){*}{*}:{*} > * +S0+ & +S2+ restart, and +S1+ crashes. > * Again, +S2+ is elected leader. > * Then, during the SYNC phase, the leader +S2+ ({{{}_maxCommittedLog_{}}} = > <1, 4>) uses DIFF to sync with the follower +S0+ ({{{}_lastLoggedZxid_{}}} = > <1, 3>), and their {{_currentEpoch_}} will be set to 2 (and written to disk). > * ( Note that the follower +S0+ updates its currentEpoch file before writing > the txns to the log file when receiving NEWLEADER message. ) > * *Unfortunately, right after the follower +S0+ finishes updating its > currentEpoch file, it crashes.* > *Round 3* {*}(Running nodes with their acceptedEpoch & currentEpoch set to > 3{*}{*}){*}{*}:{*} > * +S0+ & +S1+ restart, and +S2+ crashes. > * Since +S0+ has {{_currentEpoch_}} = 2, +S1+ has {{_currentEpoch_}} = 1, > +S0+ will be elected leader. > * During the SYNC phase, the leader +S0+ ({{{}_maxCommittedLog_{}}} = <1, > 3>) will use TRUNC to sync with +S1+ ({{{}_lastLoggedZxid_{}}} = <1, 4>). > Then, +S1+ removes txn <1, 4>. > * ( However, <1, 4> was committed and visible by clients before, and is not > supposed to be truncated! ) > * Verify clients of +S0+ & +S1+ do NOT have the view of txn <1, 4>, a > violation of ZAB. > > Extra note: The trace can be constructed with quorum nodes alive at any > moment with careful time tuning of node crash & restart, e.g., let +S1+ > restart before +S0+ crashes at the end of Round 2. > > h2. Analysis > *Root Cause:* > When a follower updates its current epoch, it should guarantee that it has > already synced the uncommitted txns to the disk (or, taken snapshot). > Otherwise, after the current epoch is updated to the file but the history > (transaction log) of the follower is not updated yet, environment failures > might prevent the latter from going on smoothly. It is dangerous for a node > with updated current epoch but stale history to be elected leader. It might > truncate committed txns on other nodes. > > *Property Violation:* > * From the server side, the ensemble deletes a committed
[jira] [Commented] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739380#comment-17739380 ] Mate Szalay-Beko commented on ZOOKEEPER-4709: - I also pushed it to branch-3.7 > Upgrade Netty to 4.1.94.Final > - > > Key: ZOOKEEPER-4709 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.7.1, 3.8.1 >Reporter: Fabio Buso >Priority: Major > Labels: dependency-upgrade, pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 50m > Remaining Estimate: 0h > > [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes > several improvements and bug fixes, including a resolution for > [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] > related to potential memory allocation vulnerabilities during a TLS > handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4709: Fix Version/s: 3.7.2 > Upgrade Netty to 4.1.94.Final > - > > Key: ZOOKEEPER-4709 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.7.1, 3.8.1 >Reporter: Fabio Buso >Priority: Major > Labels: dependency-upgrade, pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 50m > Remaining Estimate: 0h > > [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes > several improvements and bug fixes, including a resolution for > [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] > related to potential memory allocation vulnerabilities during a TLS > handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4716) upgrade jackson to 2.15.2, suppress two false positive CVE errors
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko resolved ZOOKEEPER-4716. - Fix Version/s: 3.9.0 3.7.2 3.8.2 Resolution: Done > upgrade jackson to 2.15.2, suppress two false positive CVE errors > - > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.7.2, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > Our jackson is quite old, I want to upgrade it before release 3.8.2. > Also we have a few false positive CVEs reported by OWASP: > * CVE-2023-35116: according to jackson community, this is not a security > issue, see > [https://github.com/FasterXML/jackson-databind/issues/3972#issuecomment-1596193098] > * CVE-2022-45688: the following CVE is not even jackson related, but a > vulnerability in json-java which we don't use in ZooKeeper > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko resolved ZOOKEEPER-4709. - Resolution: Done [~siroibaf] , thank you for the contribution! The fix get merged to branch-3.8 and master. > Upgrade Netty to 4.1.94.Final > - > > Key: ZOOKEEPER-4709 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.7.1, 3.8.1 >Reporter: Fabio Buso >Priority: Major > Labels: dependency-upgrade, pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes > several improvements and bug fixes, including a resolution for > [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] > related to potential memory allocation vulnerabilities during a TLS > handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4709: Fix Version/s: 3.9.0 3.8.2 > Upgrade Netty to 4.1.94.Final > - > > Key: ZOOKEEPER-4709 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.7.1, 3.8.1 >Reporter: Fabio Buso >Priority: Major > Labels: dependency-upgrade, pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 40m > Remaining Estimate: 0h > > [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes > several improvements and bug fixes, including a resolution for > [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] > related to potential memory allocation vulnerabilities during a TLS > handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutd
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutd
[jira] [Updated] (ZOOKEEPER-4716) upgrade jackson to 2.15.2, suppress two false positive CVE errors
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4716: -- Labels: pull-request-available (was: ) > upgrade jackson to 2.15.2, suppress two false positive CVE errors > - > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Our jackson is quite old, I want to upgrade it before release 3.8.2. > Also we have a few false positive CVEs reported by OWASP: > * CVE-2023-35116: according to jackson community, this is not a security > issue, see > [https://github.com/FasterXML/jackson-databind/issues/3972#issuecomment-1596193098] > * CVE-2022-45688: the following CVE is not even jackson related, but a > vulnerability in json-java which we don't use in ZooKeeper > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4716) upgrade jackson to 2.15.2, suppress two false positive CVE errors
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4716: Description: Our jackson is quite old, I want to upgrade it before release 3.8.2. Also we have a few false positive CVEs reported by OWASP: * CVE-2023-35116: according to jackson community, this is not a security issue, see [https://github.com/FasterXML/jackson-databind/issues/3972#issuecomment-1596193098] * CVE-2022-45688: the following CVE is not even jackson related, but a vulnerability in json-java which we don't use in ZooKeeper {code:java} [INFO] Finished at: 2023-06-30T13:23:38+02:00 [INFO] [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check (default-cli) on project zookeeper: [ERROR] [ERROR] One or more dependencies were identified with vulnerabilities that have a CVSS score greater than or equal to '0.0': [ERROR] [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) {code} was: {code:java} [INFO] Finished at: 2023-06-30T13:23:38+02:00 [INFO] [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check (default-cli) on project zookeeper: [ERROR] [ERROR] One or more dependencies were identified with vulnerabilities that have a CVSS score greater than or equal to '0.0': [ERROR] [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) {code} > upgrade jackson to 2.15.2, suppress two false positive CVE errors > - > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > > Our jackson is quite old, I want to upgrade it before release 3.8.2. > Also we have a few false positive CVEs reported by OWASP: > * CVE-2023-35116: according to jackson community, this is not a security > issue, see > [https://github.com/FasterXML/jackson-databind/issues/3972#issuecomment-1596193098] > * CVE-2022-45688: the following CVE is not even jackson related, but a > vulnerability in json-java which we don't use in ZooKeeper > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4716) upgrade jackson to 2.15.2, suppress two false positive CVE errors
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4716: Summary: upgrade jackson to 2.15.2, suppress two false positive CVE errors (was: Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116) > upgrade jackson to 2.15.2, suppress two false positive CVE errors > - > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4716) Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4716: Description: {code:java} [INFO] Finished at: 2023-06-30T13:23:38+02:00 [INFO] [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check (default-cli) on project zookeeper: [ERROR] [ERROR] One or more dependencies were identified with vulnerabilities that have a CVSS score greater than or equal to '0.0': [ERROR] [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) {code} > Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116 > > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > > {code:java} > [INFO] Finished at: 2023-06-30T13:23:38+02:00 > [INFO] > > [ERROR] Failed to execute goal org.owasp:dependency-check-maven:7.1.0:check > (default-cli) on project zookeeper: > [ERROR] > [ERROR] One or more dependencies were identified with vulnerabilities that > have a CVSS score greater than or equal to '0.0': > [ERROR] > [ERROR] jackson-core-2.13.4.jar: CVE-2022-45688(7.5) > [ERROR] jackson-databind-2.13.4.2.jar: CVE-2023-35116(7.5) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4716) Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko updated ZOOKEEPER-4716: Affects Version/s: 3.8.1 > Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116 > > > Key: ZOOKEEPER-4716 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.8.1 >Reporter: Mate Szalay-Beko >Assignee: Mate Szalay-Beko >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4716) Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116
Mate Szalay-Beko created ZOOKEEPER-4716: --- Summary: Fix jackson related CVEs: CVE-2022-45688, CVE-2023-35116 Key: ZOOKEEPER-4716 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4716 Project: ZooKeeper Issue Type: Improvement Reporter: Mate Szalay-Beko Assignee: Mate Szalay-Beko -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4628) CVE-2022-42003 CVE-2022-42004 HIGH: upgrade jackson-databind-2.13.3.jar to 2.13.4.1
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mate Szalay-Beko resolved ZOOKEEPER-4628. - Resolution: Duplicate Thank you [~ivodujmovic] for reporting this issue and submitting a PR! I see that in the meanwhile this was fixed by ZOOKEEPER-4661. (of course since that time we had an other CVE, but I will take care of that in a separate ticket) > CVE-2022-42003 CVE-2022-42004 HIGH: upgrade jackson-databind-2.13.3.jar to > 2.13.4.1 > --- > > Key: ZOOKEEPER-4628 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4628 > Project: ZooKeeper > Issue Type: Task > Components: security >Affects Versions: 3.5.10, 3.8.0, 3.7.1 >Reporter: Ivo Dujmovic >Priority: Critical > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Two High issues > [https://nvd.nist.gov/vuln/detail/CVE-2022-42003] > [https://nvd.nist.gov/vuln/detail/CVE-2022-42004] > affect jackson version 2.13.3 which zk should update to 2.13.4.1 > Other projects have done this, but Zookeeper has not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4715: -- Labels: pull-request-available (was: ) > Verify file size and position in testGetCurrentLogSize. > --- > > Key: ZOOKEEPER-4715 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Time Spent: 10m > Remaining Estimate: 0h > > This is pre-PR for ZOOKEEPER-4714. > In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we > want our values to match the original values. Therefore, we added checks for > fileSize and filePosition in our tests. After adding the checks, we used a > new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested > whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4715) Verify file size and position in testGetCurrentLogSize.
Yan Zhao created ZOOKEEPER-4715: --- Summary: Verify file size and position in testGetCurrentLogSize. Key: ZOOKEEPER-4715 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4715 Project: ZooKeeper Issue Type: Wish Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao Fix For: 3.9.0, 3.8.2 This is pre-PR for ZOOKEEPER-4714. In ZOOKEEPER-4714, we maintain fileSize and filePosition ourselves and we want our values to match the original values. Therefore, we added checks for fileSize and filePosition in our tests. After adding the checks, we used a new method to retrieve fileSize and filePosition in ZOOKEEPER-4714 and tested whether the tests can still pass. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4714: -- Labels: pull-request-available (was: ) > Improve syncRequestProcessor performance > > > Key: ZOOKEEPER-4714 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Major > Labels: pull-request-available > Fix For: 3.9.0, 3.8.2 > > Attachments: 761688051587_.pic.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > In the SyncRequestProcessor, a write operation is performed for each write > request. Two methods are relatively time-consuming. > 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the > current file is retrieved, which involves a system call. > Call stack: > java.io.File.length(File.java) > org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) > org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) > org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) > org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) > 2. Within ZKDatabase#append, the current position of the current file is > retrieved, which also involves a system call. > Call stack: > sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) > sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) > org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) > org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) > org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) > org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) > Therefore, it is best to maintain the current size and position of the > current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4714) Improve syncRequestProcessor performance
Yan Zhao created ZOOKEEPER-4714: --- Summary: Improve syncRequestProcessor performance Key: ZOOKEEPER-4714 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4714 Project: ZooKeeper Issue Type: Wish Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao Fix For: 3.9.0, 3.8.2 Attachments: 761688051587_.pic.jpg In the SyncRequestProcessor, a write operation is performed for each write request. Two methods are relatively time-consuming. 1. Within SyncRequestProcessor#shouldSnapshot, the current size of the current file is retrieved, which involves a system call. Call stack: java.io.File.length(File.java) org.apache.zookeeper.server.persistence.FileTxnLog.getCurrentLogSize(FileTxnLog.java:211) org.apache.zookeeper.server.persistence.FileTxnLog.getTotalLogSize(FileTxnLog.java:221) org.apache.zookeeper.server.persistence.FileTxnSnapLog.getTotalLogSize(FileTxnSnapLog.java:671) org.apache.zookeeper.server.ZKDatabase.getTxnSize(ZKDatabase.java:790) org.apache.zookeeper.server.SyncRequestProcessor.shouldSnapshot(SyncRequestProcessor.java:145) org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:182) 2. Within ZKDatabase#append, the current position of the current file is retrieved, which also involves a system call. Call stack: sun.nio.ch.FileDispatcherImpl.seek(FileDispatcherImpl.java) sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:264) org.apache.zookeeper.server.persistence.FilePadding.padFile(FilePadding.java:76) org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:298) org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:592) org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:678) org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:181) Therefore, it is best to maintain the current size and position of the current file ourselves, as this can greatly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> Z
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor. The QuorumPeer thread should wait for the exit of syncThread before back in LOOKING state: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolea
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see {*}Potential Risk{*}). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeep
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Summary: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency (was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency) > Follower.shutdown() and Observer.shutdown() do not correctly shutdown the > syncProcessor, which may lead to data inconsistency > - > > Key: ZOOKEEPER-4712 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4712 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server >Affects Versions: 3.5.10, 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.6.4, 3.8.1 >Reporter: Sirius >Priority: Critical > > Follower.shutdown() and Observer.shutdown() do not correctly shutdown the > syncProcessor. It may lead to potential data inconsistency (see Potential > Risk). > > A follower / observer will invoke syncProcessor.shutdown() in > LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), > respectively. > However, after the > [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] > of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke > LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() > anymore. > > h2. Call stack > h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... > * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> > ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) > > h5. For comparison, in version 3.4.X, > * Observer.shutdown() -> Learner.shutdown() -> > {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> > ZooKeeperServer.shutdown(boolean) > * Follower.shutdown() -> Learner.shutdown() -> > {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> > ZooKeeperServer.shutdown(boolean) > > h2. Code Details > Take version 3.8.0 as an example. > In Follower.shutdown() : > {code:java} > public void shutdown() { > LOG.info("shutdown Follower"); > + // invoke Learner.shutdown() > super.shutdown(); > } {code} > > In Learner.java: > {code:java} > public void shutdown() { > ... > // shutdown previous zookeeper > if (zk != null) { > // If we haven't finished SNAP sync, force fully shutdown > // to avoid potential inconsistency > + // This will invoke ZooKeeperServer.shutdown(boolean), > + // which will not shutdown syncProcessor > + // Before the fix of ZOOLEEPER-3642, > + // FollowerZooKeeperServer.shutdown() will be invoked here > zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); > } > } {code} > > In ZooKeeperServer.java: > {code:java} > public synchronized void shutdown(boolean fullyShutDown) { > ... > if (firstProcessor != null) { > + // For a follower, this will not shutdown its syncProcessor. > firstProcessor.shutdown(); > } > ... > } {code} > > In expectation, Follower.shutdown() should invoke > LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: > {code:java} > public synchronized void shutdown() { > ... > try { > + // shutdown the syncProcessor here > if (syncProcessor != null) { > syncProcessor.shutdown(); > } > } ... > } {code} > Observer.shutdown() has the similar problem. > > h2. Potential Risk > When Follower.shutdown() is called, the follower's QuorumPeer thread may > invoke fastForwardDataBase() and > update the lastProcessedZxid for the election and recovery phase before its > syncThread drains the pending requests and flushes them to disk. > In consequence, this lastProcessedZxid is not the latest zxid in its log, > leading to log inconsistency after the SYNC phase. (Similar to the symptoms > of ZOOKEEPER-2845.) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeep
[jira] [Updated] (ZOOKEEPER-4713) ObserverZooKeeperServer.shutdown() is redundant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4713: -- Description: After the [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible for the shutdown logic of both the follower and observer. ObserverZooKeeperServer.shutdown() seems redundant, because it is not in the call stack of Observer.shutdown(). (Note that FollowerZooKeeperServer does not have the shutdown() method.) Related analysis can be found in ZOOKEEPER-4712 was: After the [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible for the shutdown logic of both the follower and observer. ObserverZooKeeperServer.shutdown() seems redundant. Related analysis can be found in [ZOOKEEPER-4712|https://issues.apache.org/jira/browse/ZOOKEEPER-4712)ZOOKEEPER-4712] > ObserverZooKeeperServer.shutdown() is redundant > --- > > Key: ZOOKEEPER-4713 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4713 > Project: ZooKeeper > Issue Type: Improvement > Components: quorum, server >Affects Versions: 3.5.10, 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.6.4, 3.8.1 >Reporter: Sirius >Priority: Minor > > After the > [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] > of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible > for the shutdown logic of both the follower and observer. > ObserverZooKeeperServer.shutdown() seems redundant, because it is not in the > call stack of Observer.shutdown(). (Note that FollowerZooKeeperServer does > not have the shutdown() method.) > Related analysis can be found in ZOOKEEPER-4712 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4713) ObserverZooKeeperServer.shutdown() is redundant
Sirius created ZOOKEEPER-4713: - Summary: ObserverZooKeeperServer.shutdown() is redundant Key: ZOOKEEPER-4713 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4713 Project: ZooKeeper Issue Type: Improvement Components: quorum, server Affects Versions: 3.8.1, 3.7.1, 3.8.0, 3.7.0, 3.6.3, 3.5.10, 3.6.4 Reporter: Sirius After the [FIX|https://github.com/apache/zookeeper/commit/66646796c2173423655c7faf2b458b658143e6b5] of ZOOKEEPER-1796, LearnerZooKeeperServer.shutdown() should be responsible for the shutdown logic of both the follower and observer. ObserverZooKeeperServer.shutdown() seems redundant. Related analysis can be found in [ZOOKEEPER-4712|https://issues.apache.org/jira/browse/ZOOKEEPER-4712)ZOOKEEPER-4712] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h2. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h2. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h2. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) h3. Example trace (TODO) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdow
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may invoke fastForwardDataBase() and update the lastProcessedZxid for the election and recovery phase before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeep
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of [ZOOKEEPER-3642|https://issues.apache.org/jira/browse/ZOOKEEPER-3642], Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeep
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [FIX|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOKEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of [ZOOKEEPER-3642|https://issues.apache.org/jira/browse/ZOOKEEPER-3642], Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeep
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : {code:java} public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } {code} In Learner.java: {code:java} public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } {code} In ZooKeeperServer.java: {code:java} public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } {code} In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: {code:java} public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } {code} Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean)
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() -> ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() -> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown() -> Learner.shutdown() -> {*}ObserverZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) * Follower.shutdown() -> Learner.shutdown() -> {*}FollowerZooKeeperServer.shutdown() -{*}> ZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } In ZooKeeperServer.java: public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -- ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() ->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown(){-}>Learner.shutdown(){-}>{*}ObserverZooKeeperServer.shutdown(){*}{-}>ZooKeeperServer.shutdown(){-}>ZooKeeperServer.shutdown(boolean) * Follower.shutdown(){-}>Learner.shutdown(){-}>{*}FollowerZooKeeperServer.shutdown(){*}>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.sh
[jira] [Updated] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sirius updated ZOOKEEPER-4712: -- Description: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown() -> Learner.shutdown() ->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown() -> Learner.shutdown() -- ->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String) -> LeaderZooKeeper.shutdown() ->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown(){-}>Learner.shutdown(){-}>{*}ObserverZooKeeperServer.shutdown(){*}{-}>ZooKeeperServer.shutdown(){-}>ZooKeeperServer.shutdown(boolean) * Follower.shutdown(){-}>Learner.shutdown(){-}>{*}FollowerZooKeeperServer.shutdown(){*}>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } In ZooKeeperServer.java: public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) was: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String)->LeaderZooKeeper.shutdown()->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown()->Learner.shutdown()->*ObserverZooKeeperServer.shutdown()*->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) * Follower.shutdown()->Learner.shutdown()->*FollowerZooKeeperServer.shutdown()*>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdow
[jira] [Created] (ZOOKEEPER-4712) Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency
Sirius created ZOOKEEPER-4712: - Summary: Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to potential data inconsistency Key: ZOOKEEPER-4712 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4712 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.8.1, 3.7.1, 3.8.0, 3.7.0, 3.6.3, 3.5.10, 3.6.4 Reporter: Sirius Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor. It may lead to potential data inconsistency (see Potential Risk). A follower / observer will invoke syncProcessor.shutdown() in LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown(), respectively. However, after the [fix|https://github.com/apache/zookeeper/commit/efbd660e1c4b90a8f538f2cccb5dcb7094cf9a22] of ZOOLEEPER-3642, Follower.shutdown() / Observer.shutdown() will not invoke LearnerZooKeeperServer.shutdown() / ObserverZooKeeperServer.shutdown() anymore. h4. Call stack h5. Version 3.8.1 / 3.8.0 / 3.7.1 / 3.7.0 / 3.6.4 / 3.6.3 / 3.5.10 ... * *(Buggy)* Observer.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * *(Buggy)* Follower.shutdown()->Learner.shutdown()->ZooKeeperServer.shutdown(boolean) * (For comparison) Leader.shutdown(String)->LeaderZooKeeper.shutdown()->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h5. For comparison, in version 3.4.X, * Observer.shutdown()->Learner.shutdown()->*ObserverZooKeeperServer.shutdown()*->ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) * Follower.shutdown()->Learner.shutdown()->*FollowerZooKeeperServer.shutdown()*>ZooKeeperServer.shutdown()->ZooKeeperServer.shutdown(boolean) h4. Code Details Take version 3.8.0 as an example. In Follower.shutdown() : public void shutdown() { LOG.info("shutdown Follower"); + // invoke Learner.shutdown() super.shutdown(); } In Learner.java: public void shutdown() { ... // shutdown previous zookeeper if (zk != null) { // If we haven't finished SNAP sync, force fully shutdown // to avoid potential inconsistency + // This will invoke ZooKeeperServer.shutdown(boolean), + // which will not shutdown syncProcessor + // Before the fix of ZOOLEEPER-3642, + // FollowerZooKeeperServer.shutdown() will be invoked here zk.shutdown(self.getSyncMode().equals(QuorumPeer.SyncMode.SNAP)); } } In ZooKeeperServer.java: public synchronized void shutdown(boolean fullyShutDown) { ... if (firstProcessor != null) { + // For a follower, this will not shutdown its syncProcessor. firstProcessor.shutdown(); } ... } In expectation, Follower.shutdown() should invoke LearnerZooKeeperServer.shutdown() to shutdown the syncProcessor: public synchronized void shutdown() { ... try { + // shutdown the syncProcessor here if (syncProcessor != null) { syncProcessor.shutdown(); } } ... } Observer.shutdown() has the similar problem. h4. Potential Risk When Follower.shutdown() is called, the follower's QuorumPeer thread may update its lastProcessedZxid for the election before its syncThread drains the pending requests and flushes them to disk. In consequence, this lastProcessedZxid is not the latest zxid in its log, leading to log inconsistency after the SYNC phase. (Similar to the symptoms of ZOOKEEPER-2845.) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4711) a data race in org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated ZOOKEEPER-4711: - Description: When we run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false The following method in class : org.apache.zookeeper.server.watch.WatcherCleaner {code:java} public void addDeadWatcher(int watcherBit) { // Wait if there are too many watchers waiting to be closed, // this is will slow down the socket packet processing and // the adding watches in the ZK pipeline. while (maxInProcessingDeadWatchers > 0 && !stopped && totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { try { RATE_LOGGER.rateLimitLog("Waiting for dead watchers cleaning"); long startTime = Time.currentElapsedTime(); synchronized (processingCompletedEvent) { processingCompletedEvent.wait(100); } long latency = Time.currentElapsedTime() - startTime; ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); } catch (InterruptedException e) { LOG.info("Got interrupted while waiting for dead watches queue size"); break; } } synchronized (this) { if (deadWatchers.add(watcherBit)) { totalDeadWatchers.incrementAndGet(); ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); if (deadWatchers.size() >= watcherCleanThreshold) { synchronized (cleanEvent) { cleanEvent.notifyAll(); } } } } }{code} {code:java} @Override public void run() { while (!stopped) { synchronized (cleanEvent) { try { // add some jitter to avoid cleaning dead watchers at the // same time in the quorum if (!stopped && deadWatchers.size() < watcherCleanThreshold) { int maxWaitMs = (watcherCleanIntervalInSeconds + ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * 1000; cleanEvent.wait(maxWaitMs); } } catch (InterruptedException e) { LOG.info("Received InterruptedException while waiting for cleanEvent"); break; } } if (deadWatchers.isEmpty()) { continue; } synchronized (this) { // Clean the dead watchers need to go through all the current // watches, which is pretty heavy and may take a second if // there are millions of watches, that's why we're doing lazily // batch clean up in a separate thread with a snapshot of the // current dead watchers. final Set snapshot = new HashSet<>(deadWatchers); deadWatchers.clear(); int total = snapshot.size(); LOG.info("Processing {} dead watchers", total); cleaners.schedule(new WorkRequest() { @Override public void doWork() throws Exception { long startTime = Time.currentElapsedTime(); listener.processDeadWatchers(snapshot); long latency = Time.currentElapsedTime() - startTime; LOG.info("Takes {} to process {} watches", latency, total); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEANER_LATENCY.add(latency); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEARED.add(total); totalDeadWatchers.addAndGet(-total); synchronized (processingCompletedEvent) { processingCompletedEvent.notifyAll(); } } }); } } LOG.info("WatcherCleaner thread exited"); }{code} As we can see, the two methods visist deadWatchers Object by different thread. *Thread in run()* is *read* operation on deadWachers and Thread in addDeadWatcher is *write* operation on deadWachers. This causes a data race without any lock. was: When we run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false
[jira] [Updated] (ZOOKEEPER-4711) a data race in org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated ZOOKEEPER-4711: - Summary: a data race in org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers (was: There is a data race bettween run() and addDeadWatcher in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test.) > a data race in > org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > --- > > Key: ZOOKEEPER-4711 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4711 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.9.0 > Environment: download zookeeper 3.9.0-SNAPSHOT from github repository > ([https://github.com/apache/zookeeper)] > Then run : mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false >Reporter: lujie >Priority: Critical > > When we run : > mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false > The method of addDeadWatcher > ( > System.out.println("2s::" +Thread.currentThread().getName()+ " > "+System.identityHashCode(deadWatchers)+" " + System.currentTimeMillis()); > this is my debug info. > ) > {code:java} > public void addDeadWatcher(int watcherBit) { > // Wait if there are too many watchers waiting to be closed, > // this is will slow down the socket packet processing and > // the adding watches in the ZK pipeline. > while (maxInProcessingDeadWatchers > 0 && !stopped && > totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { > try { > RATE_LOGGER.rateLimitLog("Waiting for dead watchers > cleaning"); > long startTime = Time.currentElapsedTime(); > synchronized (processingCompletedEvent) { > processingCompletedEvent.wait(100); > } > long latency = Time.currentElapsedTime() - startTime; > > ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); > } catch (InterruptedException e) { > LOG.info("Got interrupted while waiting for dead watches > queue size"); > break; > } > } > synchronized (this) { > > if (deadWatchers.add(watcherBit)) { > totalDeadWatchers.incrementAndGet(); > ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); > if (deadWatchers.size() >= watcherCleanThreshold) { > synchronized (cleanEvent) { > cleanEvent.notifyAll(); > } > } > } > } > }{code} > > {code:java} > @Override > public void run() { > while (!stopped) { > synchronized (cleanEvent) { > try { > // add some jitter to avoid cleaning dead watchers at the > // same time in the quorum > if (!stopped && deadWatchers.size() < > watcherCleanThreshold) { > > int maxWaitMs = (watcherCleanIntervalInSeconds > + > ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * > 1000; > cleanEvent.wait(maxWaitMs); > } > } catch (InterruptedException e) { > LOG.info("Received InterruptedException while waiting for > cleanEvent"); > break; > } > } if (deadWatchers.isEmpty()) { > continue; > } synchronized (this) { > // Clean the dead watchers need to go through all the current > // watches, which is pretty heavy and may take a second if > // there are millions of watches, that's why we're doing > lazily > // batch clean up in a separate thread with a snapshot of the > // current
[jira] [Updated] (ZOOKEEPER-4711) There is a data race bettween run() and addDeadWatcher in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManag
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated ZOOKEEPER-4711: - Summary: There is a data race bettween run() and addDeadWatcher in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test. (was: There is a data race bettween run() and "public void addDeadWatcher(int watcherBit)" in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test.) > There is a data race bettween run() and addDeadWatcher in > org.apache.zookeeper.server.watch.WatcherCleaner class when run > org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit > test. > - > > Key: ZOOKEEPER-4711 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4711 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.9.0 > Environment: download zookeeper 3.9.0-SNAPSHOT from github repository > ([https://github.com/apache/zookeeper)] > Then run : mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false >Reporter: lujie >Priority: Critical > > When we run : > mvn test -Dmaven.test.failure.ignore=true > -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers > -DfailIfNoTests=false -DredirectTestOutputToFile=false > The method of addDeadWatcher > ( > System.out.println("2s::" +Thread.currentThread().getName()+ " > "+System.identityHashCode(deadWatchers)+" " + System.currentTimeMillis()); > this is my debug info. > ) > {code:java} > public void addDeadWatcher(int watcherBit) { > // Wait if there are too many watchers waiting to be closed, > // this is will slow down the socket packet processing and > // the adding watches in the ZK pipeline. > while (maxInProcessingDeadWatchers > 0 && !stopped && > totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { > try { > RATE_LOGGER.rateLimitLog("Waiting for dead watchers > cleaning"); > long startTime = Time.currentElapsedTime(); > synchronized (processingCompletedEvent) { > processingCompletedEvent.wait(100); > } > long latency = Time.currentElapsedTime() - startTime; > > ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); > } catch (InterruptedException e) { > LOG.info("Got interrupted while waiting for dead watches > queue size"); > break; > } > } > synchronized (this) { > > if (deadWatchers.add(watcherBit)) { > totalDeadWatchers.incrementAndGet(); > ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); > if (deadWatchers.size() >= watcherCleanThreshold) { > synchronized (cleanEvent) { > cleanEvent.notifyAll(); > } > } > } > } > }{code} > > {code:java} > @Override > public void run() { > while (!stopped) { > synchronized (cleanEvent) { > try { > // add some jitter to avoid cleaning dead watchers at the > // same time in the quorum > if (!stopped && deadWatchers.size() < > watcherCleanThreshold) { > > int maxWaitMs = (watcherCleanIntervalInSeconds > + > ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * > 1000; > cleanEvent.wait(maxWaitMs); > } > } catch (InterruptedException e) { > LOG.info("Received InterruptedException while waiting for > cleanEvent"); > break; > } > } if (deadWatchers.isEmpty()) { > continue; > }
[jira] [Created] (ZOOKEEPER-4711) There is a data race bettween run() and "public void addDeadWatcher(int watcherBit)" in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zoo
lujie created ZOOKEEPER-4711: Summary: There is a data race bettween run() and "public void addDeadWatcher(int watcherBit)" in org.apache.zookeeper.server.watch.WatcherCleaner class when run org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers junit test. Key: ZOOKEEPER-4711 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4711 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.0 Environment: download zookeeper 3.9.0-SNAPSHOT from github repository ([https://github.com/apache/zookeeper)] Then run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false Reporter: lujie When we run : mvn test -Dmaven.test.failure.ignore=true -Dtest=org.apache.zookeeper.server.watch.WatchManagerTest#testDeadWatchers -DfailIfNoTests=false -DredirectTestOutputToFile=false The method of addDeadWatcher ( System.out.println("2s::" +Thread.currentThread().getName()+ " "+System.identityHashCode(deadWatchers)+" " + System.currentTimeMillis()); this is my debug info. ) {code:java} public void addDeadWatcher(int watcherBit) { // Wait if there are too many watchers waiting to be closed, // this is will slow down the socket packet processing and // the adding watches in the ZK pipeline. while (maxInProcessingDeadWatchers > 0 && !stopped && totalDeadWatchers.get() >= maxInProcessingDeadWatchers) { try { RATE_LOGGER.rateLimitLog("Waiting for dead watchers cleaning"); long startTime = Time.currentElapsedTime(); synchronized (processingCompletedEvent) { processingCompletedEvent.wait(100); } long latency = Time.currentElapsedTime() - startTime; ServerMetrics.getMetrics().ADD_DEAD_WATCHER_STALL_TIME.add(latency); } catch (InterruptedException e) { LOG.info("Got interrupted while waiting for dead watches queue size"); break; } } synchronized (this) { if (deadWatchers.add(watcherBit)) { totalDeadWatchers.incrementAndGet(); ServerMetrics.getMetrics().DEAD_WATCHERS_QUEUED.add(1); if (deadWatchers.size() >= watcherCleanThreshold) { synchronized (cleanEvent) { cleanEvent.notifyAll(); } } } } }{code} {code:java} @Override public void run() { while (!stopped) { synchronized (cleanEvent) { try { // add some jitter to avoid cleaning dead watchers at the // same time in the quorum if (!stopped && deadWatchers.size() < watcherCleanThreshold) { int maxWaitMs = (watcherCleanIntervalInSeconds + ThreadLocalRandom.current().nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * 1000; cleanEvent.wait(maxWaitMs); } } catch (InterruptedException e) { LOG.info("Received InterruptedException while waiting for cleanEvent"); break; } } if (deadWatchers.isEmpty()) { continue; } synchronized (this) { // Clean the dead watchers need to go through all the current // watches, which is pretty heavy and may take a second if // there are millions of watches, that's why we're doing lazily // batch clean up in a separate thread with a snapshot of the // current dead watchers. final Set snapshot = new HashSet<>(deadWatchers); deadWatchers.clear(); int total = snapshot.size(); LOG.info("Processing {} dead watchers", total); cleaners.schedule(new WorkRequest() { @Override public void doWork() throws Exception { long startTime = Time.currentElapsedTime(); listener.processDeadWatchers(snapshot); long latency = Time.currentElapsedTime() - startTime; LOG.info("Takes {} to process {} watches", latency, total); ServerMetrics.getMetrics().DEAD_WATCHERS_CLEANER_LATENCY.add(latency); ServerMetrics.getMetrics()
[jira] [Commented] (ZOOKEEPER-4628) CVE-2022-42003 CVE-2022-42004 HIGH: upgrade jackson-databind-2.13.3.jar to 2.13.4.1
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737581#comment-17737581 ] AvnerW commented on ZOOKEEPER-4628: --- Are there any plans to upgrade jackson-databind, jackson-core etc. to 2.15.x for the next ZK releases 3.8.2/3.9.0? There are few scanner reports about 2.13.x (e.g.: sonatype-2022-6438). > CVE-2022-42003 CVE-2022-42004 HIGH: upgrade jackson-databind-2.13.3.jar to > 2.13.4.1 > --- > > Key: ZOOKEEPER-4628 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4628 > Project: ZooKeeper > Issue Type: Task > Components: security >Affects Versions: 3.5.10, 3.8.0, 3.7.1 >Reporter: Ivo Dujmovic >Priority: Critical > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Two High issues > [https://nvd.nist.gov/vuln/detail/CVE-2022-42003] > [https://nvd.nist.gov/vuln/detail/CVE-2022-42004] > affect jackson version 2.13.3 which zk should update to 2.13.4.1 > Other projects have done this, but Zookeeper has not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737473#comment-17737473 ] Paolo Patierno commented on ZOOKEEPER-4708: --- This problem we are facing looks pretty similar to this [https://github.com/confluentinc/cp-helm-charts/issues/205] AFAIU, ZooKeeper just gives up when after some time/attempts it's not able to form the quorum (maybe because DNS resolving issues). Raising the NPE was helpful because it drove ZooKeeper to crash, Kubernetes to restart the container. The quorum is formed because pod is still up and DNS already resolved. Avoiding the NPE drives ZooKeeper to give up forming the quorum and get stuck. We also see that in this situation if you try to make a connection, it logs "ZK Down" ... which is the truth because the ensamble is not actually working. > ZooKeeper 3.6.4 quorum failing due to address > -- > > Key: ZOOKEEPER-4708 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4708 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.6.4, 3.8.1 >Reporter: Paolo Patierno >Priority: Major > > We work on the Strimzi project which is about deploying an Apache Kafka > cluster on Kubernetes together with a ZooKeeper ensamble. > Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues > when running on minikube for development purposes. > With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to > have issues during the quorum formation and leader election. > The first one was about ZooKeeper pods not able to bind the quorum port 3888 > to the Cluster IP but during the DNS resolution they get the loopback address > instead. > Following a possible log at ZooKeeper startup where you can see the binding > at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a > valid not loopback IP address). > > {code:java} > INFO 3 is accepting connections now, my election bind port: > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] > This specific issue had two solutions: using quorumListenOnAllIPs=true on > ZooKeeper configuration or binding to 0.0.0.0 address. {code} > > Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed > for getting 3.6.4 working. What is changed from this perspective? > Said that, While binding to 0.0.0.0 seems to work fine, using the > quorumListenOnAllIPs=true doesn't. > Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current > ZooKeeper leader (ID=3) we see the following. > (Starting with ** you can see some additional logs added to > {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to > get more information.) > {code:java} > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, > 3]]; starting up and setting last processed zxid: 0x1 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,990 INFO ** > newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr > = > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,992 INFO ** qs.addr > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, > qs.electionAddr > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, > qs.clientAddr/127.0.0.1:12181 > (org.apache.zookeeper.server.quorum.QuorumPeer) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,992 DEBUG zookeeper > (org.apache.zookeeper.common.PathTrie) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,993 WARN Restarting Leader Election > (org.apache.zookeeper.server.quorum.QuorumPeer) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code} > So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node > ID=1. > As you can see we are in the {{Leader#startZ
[jira] [Commented] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737449#comment-17737449 ] Paolo Patierno commented on ZOOKEEPER-4708: --- I went with git bisect between release-3.6.3 tag (good) and release-3.6.4 tag (bad). It ended highlighting the following commit as the reason of a not working 3.6.4: {code:java} 357e88c1438780e28d36bf54784937e18547e136 is the first bad commit commit 357e88c1438780e28d36bf54784937e18547e136 Author: Enrico Olivelli Date: Tue Jan 25 12:48:34 2022 + ZOOKEEPER-3988: rg.apache.zookeeper.server.NettyServerCnxn.receiveMessage throws NullPointerException Modifications: - prevent the NPE, the code that throws NPE is only to record some metrics for non TLS requests Related to: - apache/pulsar#11070 - https://github.com/pravega/zookeeper-operator/issues/393 Author: Enrico Olivelli Reviewers: Nicolo² Boschi , Andor Molnar , Mate Szalay-Beko Closes #1798 from eolivelli/fix/ZOOKEEPER-3988-npe (cherry picked from commit 957f8fc0afbeca638f13f6fb739e49a921da2b9d) Signed-off-by: Mate Szalay-Beko .../zookeeper/server/NettyServerCnxnFactory.java | 18 ++- .../zookeeper/server/NettyServerCnxnTest.java | 26 +++--- .../apache/zookeeper/server/TxnLogCountTest.java | 2 +- 3 files changed, 31 insertions(+), 15 deletions(-) {code} Taking a look at the NettServerCnxnFactory class, it's just adding a check around zkServer to avoid that an NPE is raised on calling zkServer.serverStats() when it's null. I think there is nothing wrong with it, but when the NPE was raised before the fix, it forced the container restarting and the ZooKeeper nodes were able to form the quorum. Avoiding the NPE seems to leave ZooKeeper in a situation where it's not able to recover and form the quorum, it's stuck. At this point my question could be, is it normal that zkServer is null? Is it showing a more subtle bug? The NPE wasn't happening with 3.6.3. > ZooKeeper 3.6.4 quorum failing due to address > -- > > Key: ZOOKEEPER-4708 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4708 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.6.4, 3.8.1 >Reporter: Paolo Patierno >Priority: Major > > We work on the Strimzi project which is about deploying an Apache Kafka > cluster on Kubernetes together with a ZooKeeper ensamble. > Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues > when running on minikube for development purposes. > With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to > have issues during the quorum formation and leader election. > The first one was about ZooKeeper pods not able to bind the quorum port 3888 > to the Cluster IP but during the DNS resolution they get the loopback address > instead. > Following a possible log at ZooKeeper startup where you can see the binding > at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a > valid not loopback IP address). > > {code:java} > INFO 3 is accepting connections now, my election bind port: > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] > This specific issue had two solutions: using quorumListenOnAllIPs=true on > ZooKeeper configuration or binding to 0.0.0.0 address. {code} > > Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed > for getting 3.6.4 working. What is changed from this perspective? > Said that, While binding to 0.0.0.0 seems to work fine, using the > quorumListenOnAllIPs=true doesn't. > Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current > ZooKeeper leader (ID=3) we see the following. > (Starting with ** you can see some additional logs added to > {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to > get more information.) > {code:java} > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, > 3]]; starting up and setting last processed zxid: 0x1 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,990 INFO ** > newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr > = > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myi
[jira] [Updated] (ZOOKEEPER-4710) Fix ZkUtil deleteInBatch() by releasing semaphore after set flag
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhao updated ZOOKEEPER-4710: Summary: Fix ZkUtil deleteInBatch() by releasing semaphore after set flag (was: Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail) > Fix ZkUtil deleteInBatch() by releasing semaphore after set flag > > > Key: ZOOKEEPER-4710 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4710 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Enrico Olivelli >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > https://github.com/apache/zookeeper/blob/58eed9f5280be1c6a9ccacc47dd6afa65e916ae8/zookeeper-server/src/main/java/org/apache/zookeeper/ZKUtil.java#L111-L116 > We should set the flag before releasing the Semaphore. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4710) Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736471#comment-17736471 ] Enrico Olivelli commented on ZOOKEEPER-4710: Committed to master branch > Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail > > > Key: ZOOKEEPER-4710 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4710 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Enrico Olivelli >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > https://github.com/apache/zookeeper/blob/58eed9f5280be1c6a9ccacc47dd6afa65e916ae8/zookeeper-server/src/main/java/org/apache/zookeeper/ZKUtil.java#L111-L116 > We should set the flag before releasing the Semaphore. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4710) Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Olivelli resolved ZOOKEEPER-4710. Fix Version/s: 3.9.0 Resolution: Fixed > Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail > > > Key: ZOOKEEPER-4710 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4710 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Enrico Olivelli >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 40m > Remaining Estimate: 0h > > https://github.com/apache/zookeeper/blob/58eed9f5280be1c6a9ccacc47dd6afa65e916ae8/zookeeper-server/src/main/java/org/apache/zookeeper/ZKUtil.java#L111-L116 > We should set the flag before releasing the Semaphore. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ZOOKEEPER-4710) Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Olivelli reassigned ZOOKEEPER-4710: -- Assignee: Enrico Olivelli > Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail > > > Key: ZOOKEEPER-4710 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4710 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Assignee: Enrico Olivelli >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > https://github.com/apache/zookeeper/blob/58eed9f5280be1c6a9ccacc47dd6afa65e916ae8/zookeeper-server/src/main/java/org/apache/zookeeper/ZKUtil.java#L111-L116 > We should set the flag before releasing the Semaphore. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4710) Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4710: -- Labels: pull-request-available (was: ) > Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail > > > Key: ZOOKEEPER-4710 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4710 > Project: ZooKeeper > Issue Type: Wish > Components: server >Affects Versions: 3.8.1 >Reporter: Yan Zhao >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/apache/zookeeper/blob/58eed9f5280be1c6a9ccacc47dd6afa65e916ae8/zookeeper-server/src/main/java/org/apache/zookeeper/ZKUtil.java#L111-L116 > We should set the flag before releasing the Semaphore. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4710) Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail
Yan Zhao created ZOOKEEPER-4710: --- Summary: Flaky test of org.apache.zookeeper.ZooKeeperTest#testDeleteRecursiveFail Key: ZOOKEEPER-4710 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4710 Project: ZooKeeper Issue Type: Wish Components: server Affects Versions: 3.8.1 Reporter: Yan Zhao https://github.com/apache/zookeeper/blob/58eed9f5280be1c6a9ccacc47dd6afa65e916ae8/zookeeper-server/src/main/java/org/apache/zookeeper/ZKUtil.java#L111-L116 We should set the flag before releasing the Semaphore. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-4709: -- Labels: dependency-upgrade pull-request-available (was: dependency-upgrade) > Upgrade Netty to 4.1.94.Final > - > > Key: ZOOKEEPER-4709 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 > Project: ZooKeeper > Issue Type: Improvement >Affects Versions: 3.7.1, 3.8.1 >Reporter: Fabio Buso >Priority: Major > Labels: dependency-upgrade, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes > several improvements and bug fixes, including a resolution for > [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] > related to potential memory allocation vulnerabilities during a TLS > handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4709) Upgrade Netty to 4.1.94.Final
Fabio Buso created ZOOKEEPER-4709: - Summary: Upgrade Netty to 4.1.94.Final Key: ZOOKEEPER-4709 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4709 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.8.1, 3.7.1 Reporter: Fabio Buso [Netty 4.1.94|https://netty.io/news/2023/06/19/4-1-94-Final.html] includes several improvements and bug fixes, including a resolution for [CVE-2023-34462|https://github.com/netty/netty/security/advisories/GHSA-6mjq-h674-j845] related to potential memory allocation vulnerabilities during a TLS handshake with Server Name Indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-4026) CREATE2 requests embeded in a MULTI request only get a regular CREATE response
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damien Diederen resolved ZOOKEEPER-4026. Fix Version/s: 3.7.2 3.9.0 3.8.2 Resolution: Fixed Issue resolved by pull request 1978 [https://github.com/apache/zookeeper/pull/1978] > CREATE2 requests embeded in a MULTI request only get a regular CREATE response > -- > > Key: ZOOKEEPER-4026 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4026 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.8, 3.6.2 > Environment: Tested with official docker hub images of the server and > a python Zookeeper client (Kazoo, http://github.com/python-zk/kazoo) >Reporter: Charles-Henri de Boysson >Assignee: Damien Diederen >Priority: Major > Labels: pull-request-available > Fix For: 3.7.2, 3.9.0, 3.8.2 > > Attachments: MULTI_CREATE2_bug.txt > > Time Spent: 6h > Remaining Estimate: 0h > > When making a MULTI request with a CREATE2 payload, the reply from the server > only contains a regular CREATE response (the path but without the stat data). > > See attachment for a capture and decode of the request/reply. > > How to reproduce: > * Connect to the ensemble > * Make a MULTI (OpCode 14) request with a CREATE2 operation (OpCode 15) > * Reply from server is success, znode is create, but the MULTI reply > contains a CREATE (OpCode 1) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735607#comment-17735607 ] Paolo Patierno commented on ZOOKEEPER-4708: --- It turned out that using the binding to 0.0.0.0 doesn't work properly for a 1 node ZooKeeper ensamble, we got the following problem: {code:java} 2023-06-21 07:32:59,906 INFO ** newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = my-cluster-zookeeper-0.my-cluster-zookeeper-nodes.default.svc/10.244.0.54:2888 (org.apache.zookeeper.server.quorum.Leader) [SyncThread:1] 2023-06-21 07:32:59,906 INFO ** self.getQuorumAddress() = /0.0.0.0:2888 (org.apache.zookeeper.server.quorum.Leader) [SyncThread:1] 2023-06-21 07:32:59,907 ERROR Severe unrecoverable error, from thread : SyncThread:1 (org.apache.zookeeper.server.ZooKeeperCriticalThread) [SyncThread:1] java.util.NoSuchElementException at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1599) at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620) at org.apache.zookeeper.server.quorum.Leader.getDesignatedLeader(Leader.java:864) at org.apache.zookeeper.server.quorum.Leader.tryToCommit(Leader.java:939) at org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:1029) at org.apache.zookeeper.server.quorum.AckRequestProcessor.processRequest(AckRequestProcessor.java:47) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:246) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:169) {code} Still in the Leader.getDesignatedLeader and this time the self address is different from the one coming from the voting members (which is just one node). The code move forward to get another candidate with long curCandidate = candidates.iterator().next(); which obviously doesn't exist. I was wondering, why ZooKeeper is not able to recover or refresh the address resolution if this is really a slow DNS registration problem. And just to reinforce again, this problem doesn't exist with ZooKeeper 3.6.3 > ZooKeeper 3.6.4 quorum failing due to address > -- > > Key: ZOOKEEPER-4708 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4708 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.6.4, 3.8.1 >Reporter: Paolo Patierno >Priority: Major > > We work on the Strimzi project which is about deploying an Apache Kafka > cluster on Kubernetes together with a ZooKeeper ensamble. > Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues > when running on minikube for development purposes. > With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to > have issues during the quorum formation and leader election. > The first one was about ZooKeeper pods not able to bind the quorum port 3888 > to the Cluster IP but during the DNS resolution they get the loopback address > instead. > Following a possible log at ZooKeeper startup where you can see the binding > at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a > valid not loopback IP address). > > {code:java} > INFO 3 is accepting connections now, my election bind port: > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] > This specific issue had two solutions: using quorumListenOnAllIPs=true on > ZooKeeper configuration or binding to 0.0.0.0 address. {code} > > Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed > for getting 3.6.4 working. What is changed from this perspective? > Said that, While binding to 0.0.0.0 seems to work fine, using the > quorumListenOnAllIPs=true doesn't. > Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current > ZooKeeper leader (ID=3) we see the following. > (Starting with ** you can see some additional logs added to > {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to > get more information.) > {code:java} > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, > 3]]; starting up and setting last processed zxid: 0x1 > (org.apache.zookeeper.server.quorum.Leader) > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > 2023-06-19 12:32:51,990 INFO ** > newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr > = > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 > (org.apache.zookeeper.se
[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paolo Patierno updated ZOOKEEPER-4708: -- Description: We work on the Strimzi project which is about deploying an Apache Kafka cluster on Kubernetes together with a ZooKeeper ensamble. Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues when running on minikube for development purposes. With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have issues during the quorum formation and leader election. The first one was about ZooKeeper pods not able to bind the quorum port 3888 to the Cluster IP but during the DNS resolution they get the loopback address instead. Following a possible log at ZooKeeper startup where you can see the binding at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid not loopback IP address). {code:java} INFO 3 is accepting connections now, my election bind port: my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] This specific issue had two solutions: using quorumListenOnAllIPs=true on ZooKeeper configuration or binding to 0.0.0.0 address. {code} Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed for getting 3.6.4 working. What is changed from this perspective? Said that, While binding to 0.0.0.0 seems to work fine, using the quorumListenOnAllIPs=true doesn't. Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current ZooKeeper leader (ID=3) we see the following. (Starting with ** you can see some additional logs added to {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to get more information.) {code:java} 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last processed zxid: 0x1 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 INFO ** qs.addr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, qs.electionAddr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,993 WARN Restarting Leader Election (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code} So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node ID=1. As you can see we are in the {{Leader#startZkServer}} method, and because of the reconfiguration enabled, the designatedLeader is processed. The problem is that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but another one (ID=1), because of the difference in the quorum address. >From the above log, it’s not an actual difference in terms of addresses but >the {{self.getQuorumAddress()}} is returning an (even if it’s >still the same hostname related to ZooKeeper-2 instance). This difference >causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported >as leader but it’s not able to commit, so prevents any requests and the >ZooKeeper ensemble gets stuck. {code:java} 2023-06-19 12:32:51,996 WARN Suggested leader: 1 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will be initialized with allowedToCommit = false (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code} The overall issue could be related to DNS problems, with DNS records not registered yet during pod initialization (where ZooKeeper is running on Kubernetes). But we don’t understand why it’s not able to recover somehow. What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding specific configuration an
[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paolo Patierno updated ZOOKEEPER-4708: -- Description: We work on the Strimzi project which is about deploying an Apache Kafka cluster on Kubernetes together with a ZooKeeper ensamble. Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues when running on minikube for development purposes. With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have issues during the quorum formation and leader election. The first one was about ZooKeeper pods not able to bind the quorum port 3888 to the Cluster IP but during the DNS resolution they get the loopback address instead. Following a possible log at ZooKeeper startup where you can see the binding at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid not loopback IP address). INFO 3 is accepting connections now, my election bind port: my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] This specific issue had two solutions: using quorumListenOnAllIPs=true on ZooKeeper configuration or binding to 0.0.0.0 address. Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed for getting 3.6.4 working. What is changed from this perspective? Said that, While binding to 0.0.0.0 seems to work fine, using the quorumListenOnAllIPs=true doesn't. Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current ZooKeeper leader (ID=3) we see the following. (Starting with ** you can see some additional logs added to {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to get more information.) {code:java} 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last processed zxid: 0x1 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 INFO ** qs.addr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, qs.electionAddr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,993 WARN Restarting Leader Election (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code} So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node ID=1. As you can see we are in the {{Leader#startZkServer}} method, and because of the reconfiguration enabled, the designatedLeader is processed. The problem is that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but another one (ID=1), because of the difference in the quorum address. >From the above log, it’s not an actual difference in terms of addresses but >the {{self.getQuorumAddress()}} is returning an (even if it’s >still the same hostname related to ZooKeeper-2 instance). This difference >causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported >as leader but it’s not able to commit, so prevents any requests and the >ZooKeeper ensemble gets stuck. 2023-06-19 12:32:51,996 WARN Suggested leader: 1 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will be initialized with allowedToCommit = false (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] The overall issue could be related to DNS problems, with DNS records not registered yet during pod initialization (where ZooKeeper is running on Kubernetes). But we don’t understand why it’s not able to recover somehow. What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding specific configuration and was working just fine, while t
[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paolo Patierno updated ZOOKEEPER-4708: -- Description: We work on the Strimzi project which is about deploying an Apache Kafka cluster on Kubernetes together with a ZooKeeper ensamble. Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues when running on minikube for development purposes. With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have issues during the quorum formation and leader election. The first one was about ZooKeeper pods not able to bind the quorum port 3888 to the Cluster IP but during the DNS resolution they get the loopback address instead. Following a possible log at ZooKeeper startup where you can see the binding at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid not loopback IP address). {code:java} INFO 3 is accepting connections now, my election bind port: my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] This specific issue had two solutions: using quorumListenOnAllIPs=true on ZooKeeper configuration or binding to 0.0.0.0 address. {code} Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed for getting 3.6.4 working. What is changed from this perspective? Said that, While binding to 0.0.0.0 seems to work fine, using the quorumListenOnAllIPs=true doesn't. Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current ZooKeeper leader (ID=3) we see the following. (Starting with ** you can see some additional logs added to {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to get more information.) {code:java} 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last processed zxid: 0x1 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 INFO ** qs.addr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, qs.electionAddr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,993 WARN Restarting Leader Election (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code} So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node ID=1. As you can see we are in the {{Leader#startZkServer}} method, and because of the reconfiguration enabled, the designatedLeader is processed. The problem is that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but another one (ID=1), because of the difference in the quorum address. >From the above log, it’s not an actual difference in terms of addresses but >the {{self.getQuorumAddress()}} is returning an (even if it’s >still the same hostname related to ZooKeeper-2 instance). This difference >causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported >as leader but it’s not able to commit, so prevents any requests and the >ZooKeeper ensemble gets stuck. {code:java} 2023-06-19 12:32:51,996 WARN Suggested leader: 1 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will be initialized with allowedToCommit = false (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] {code} The overall issue could be related to DNS problems, with DNS records not registered yet during pod initialization (where ZooKeeper is running on Kubernetes). But we don’t understand why it’s not able to recover somehow. What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding specific configuration an
[jira] [Created] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
Paolo Patierno created ZOOKEEPER-4708: - Summary: ZooKeeper 3.6.4 quorum failing due to address Key: ZOOKEEPER-4708 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4708 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.8.1, 3.6.4 Reporter: Paolo Patierno We work on the Strimzi project which is about deploying an Apache Kafka cluster on Kubernetes together with a ZooKeeper ensamble. Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues when running on minikube for development purposes. With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have issues during the quorum formation and leader election. The first one was about ZooKeeper pods not able to bind the quorum port 3888 to the Cluster IP but during the DNS resolution they get the loopback address instead. Following a possible log at ZooKeeper startup where you can see the binding at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid not loopback IP address). INFO 3 is accepting connections now, my election bind port: my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] This specific issue had two solutions: using quorumListenOnAllIPs=true on ZooKeeper configuration or binding to 0.0.0.0 address. Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed for getting 3.6.4 working. What is changed from this perspective? Said that, While binding to 0.0.0.0 seems to work fine, using the quorumListenOnAllIPs=true doesn't. Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current ZooKeeper leader (ID=3) we see the following. (Starting with ** you can see some additional logs added to {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to get more information.) 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last processed zxid: 0x1 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 INFO ** qs.addr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, qs.electionAddr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,993 WARN Restarting Leader Election (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node ID=1. As you can see we are in the {{Leader#startZkServer}} method, and because of the reconfiguration enabled, the designatedLeader is processed. The problem is that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but another one (ID=1), because of the difference in the quorum address. >From the above log, it’s not an actual difference in terms of addresses but >the {{self.getQuorumAddress()}} is returning an (even if it’s >still the same hostname related to ZooKeeper-2 instance). This difference >causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported >as leader but it’s not able to commit, so prevents any requests and the >ZooKeeper ensemble gets stuck. 2023-06-19 12:32:51,996 WARN Suggested leader: 1 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will be initialized with allowedToCommit = false (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] The overall issue could be related to DNS problems, with DNS records not registered yet during pod initialization (where ZooKeeper is running on Kubernetes). But we don’t understand why it’s not able to recover somehow.
[jira] [Updated] (ZOOKEEPER-4708) ZooKeeper 3.6.4 quorum failing due to address
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paolo Patierno updated ZOOKEEPER-4708: -- Description: We work on the Strimzi project which is about deploying an Apache Kafka cluster on Kubernetes together with a ZooKeeper ensamble. Until ZooKeeper version 3.6.3 (brought by Kafka 3.4.0), there were no issues when running on minikube for development purposes. With using ZooKeeper version 3.6.4 (brought by Kafka 3.4.1), we started to have issues during the quorum formation and leader election. The first one was about ZooKeeper pods not able to bind the quorum port 3888 to the Cluster IP but during the DNS resolution they get the loopback address instead. Following a possible log at ZooKeeper startup where you can see the binding at 127.0.0.1:3888 instead of something like 172.17.0.4:3888 (so getting a valid not loopback IP address). INFO 3 is accepting connections now, my election bind port: my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/127.0.0.1:3888] This specific issue had two solutions: using quorumListenOnAllIPs=true on ZooKeeper configuration or binding to 0.0.0.0 address. Anyway it is actually not clear why it wasn't needed until 3.6.3, but needed for getting 3.6.4 working. What is changed from this perspective? Said that, While binding to 0.0.0.0 seems to work fine, using the quorumListenOnAllIPs=true doesn't. Assuming a ZooKeeper ensamble with 3 nodes, Getting the log of the current ZooKeeper leader (ID=3) we see the following. (Starting with ** you can see some additional logs added to {{org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader}} in order to get more information.) 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last processed zxid: 0x1 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/:2888 (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 INFO ** qs.addr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:2888, qs.electionAddr my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,992 DEBUG zookeeper (org.apache.zookeeper.common.PathTrie) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,993 WARN Restarting Leader Election (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper node ID=1. As you can see we are in the {{Leader#startZkServer}} method, and because of the reconfiguration enabled, the designatedLeader is processed. The problem is that the {{Leader#getDesignatedLeader}} is not returning “self” as leader but another one (ID=1), because of the difference in the quorum address. >From the above log, it’s not an actual difference in terms of addresses but >the {{self.getQuorumAddress()}} is returning an (even if it’s >still the same hostname related to ZooKeeper-2 instance). This difference >causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still reported >as leader but it’s not able to commit, so prevents any requests and the >ZooKeeper ensemble gets stuck. 2023-06-19 12:32:51,996 WARN Suggested leader: 1 (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it will be initialized with allowedToCommit = false (org.apache.zookeeper.server.quorum.Leader) [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] The overall issue could be related to DNS problems, with DNS records not registered yet during pod initialization (where ZooKeeper is running on Kubernetes). But we don’t understand why it’s not able to recover somehow. What we don't get a reason is why ZooKeeper 3.6.3 didn't need any binding specific configuration and was working just fine, while the new 3.6.4 needs
[jira] [Resolved] (ZOOKEEPER-4271) Flaky test - ReadOnlyModeTest.testConnectionEvents
[ https://issues.apache.org/jira/browse/ZOOKEEPER-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kezhu Wang resolved ZOOKEEPER-4271. --- Resolution: Duplicate > Flaky test - ReadOnlyModeTest.testConnectionEvents > -- > > Key: ZOOKEEPER-4271 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4271 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.6.2 >Reporter: Amichai Rothman >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The test fails sometimes. If I run this test class (with > -Dtest=ReadOnlyModeTest) in a loop it always hits the failure eventually > after a few runs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ZOOKEEPER-3996) Flaky test: ReadOnlyModeTest.testConnectionEvents
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andor Molnar resolved ZOOKEEPER-3996. - Fix Version/s: 3.9.0 Assignee: Kezhu Wang Resolution: Fixed > Flaky test: ReadOnlyModeTest.testConnectionEvents > - > > Key: ZOOKEEPER-3996 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3996 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Reporter: Ling Mao >Assignee: Kezhu Wang >Priority: Minor > Labels: pull-request-available > Fix For: 3.9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > We noticed that the unit case: ReadOnlyModeTest.testConnectionEvents has > failed frequently when building the CI. > The link is: > https://ci-hadoop.apache.org/blue/organizations/jenkins/zookeeper-precommit-github-pr/detail/PR-1527/1/pipeline/ > {code:java} > [2020-11-06T13:21:34.527Z] [INFO] Running > org.apache.zookeeper.RemoveWatchesTest > [2020-11-06T13:21:36.136Z] [INFO] Tests run: 352, Failures: 0, Errors: 0, > Skipped: 0, Time elapsed: 14.475 s - in > org.apache.zookeeper.common.X509UtilTest > [2020-11-06T13:22:06.176Z] [INFO] Tests run: 13, Failures: 0, Errors: 0, > Skipped: 0, Time elapsed: 414.867 s - in > org.apache.zookeeper.server.quorum.QuorumSSLTest > [2020-11-06T13:22:41.949Z] [INFO] Tests run: 46, Failures: 0, Errors: 0, > Skipped: 0, Time elapsed: 66.898 s - in org.apache.zookeeper.RemoveWatchesTest > [2020-11-06T13:22:41.949Z] [INFO] > [2020-11-06T13:22:41.949Z] [INFO] Results: > [2020-11-06T13:22:41.949Z] [INFO] > [2020-11-06T13:22:41.949Z] [ERROR] Errors: > [2020-11-06T13:22:41.949Z] [ERROR] > ReadOnlyModeTest.testConnectionEvents:205 » Timeout Failed to connect in > read-... > [2020-11-06T13:22:41.949Z] [INFO] > [2020-11-06T13:22:41.949Z] [ERROR] Tests run: 2863, Failures: 0, Errors: 1, > Skipped: 4 > [2020-11-06T13:22:41.949Z] [INFO] > [2020-11-06T13:22:43.552Z] [INFO] > > [2020-11-06T13:22:43.552Z] [INFO] Reactor Summary for Apache ZooKeeper > 3.7.0-SNAPSHOT:{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)