[jira] [Created] (ZOOKEEPER-4831) update slf4j from 1.x to 2.0.13, logback to 1.3.14
ZhangJian He created ZOOKEEPER-4831: --- Summary: update slf4j from 1.x to 2.0.13, logback to 1.3.14 Key: ZOOKEEPER-4831 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4831 Project: ZooKeeper Issue Type: Improvement Reporter: ZhangJian He -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4830) zk_learners is incorrectly referenced as zk_followers
Nicholas Feinberg created ZOOKEEPER-4830: Summary: zk_learners is incorrectly referenced as zk_followers Key: ZOOKEEPER-4830 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4830 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.9.2, 3.8.4 Reporter: Nicholas Feinberg https://issues.apache.org/jira/browse/ZOOKEEPER-3117 renamed the `zk_followers` metric to `zk_learners`, but some references to `zk_followers` remained in the repo, including in the documentation. These should be corrected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4829) Support DatadirCleanup in minutes
Purshotam Shah created ZOOKEEPER-4829: - Summary: Support DatadirCleanup in minutes Key: ZOOKEEPER-4829 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4829 Project: ZooKeeper Issue Type: Improvement Reporter: Purshotam Shah On the cloud, space can be limited. Currently, the DatadirCleanup only supports hours; we should also support cleanup intervals in minutes. 2024-02-20 20:55:28,862 - WARN [QuorumPeer[myid=5](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:50512):o.a.z.s.q.Follower@131] - Exception when following the leader java.io.IOException: No space left on device at java.base/java.io.FileOutputStream.writeBytes(Native Method) at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354) at org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72) at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:233) at java.base/sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:312) at java.base/sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:316) at java.base/sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:153) at java.base/java.io.OutputStreamWriter.flush(OutputStreamWriter.java:251) at java.base/java.io.BufferedWriter.flush(BufferedWriter.java:257) at org.apache.zookeeper.common.AtomicFileWritingIdiom.(AtomicFileWritingIdiom.java:72) at org.apache.zookeeper.common.AtomicFileWritingIdiom.(AtomicFileWritingIdiom.java:54) at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2229) at org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2258) at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:511) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1551) 2024-02-20 20:55:28,863 - INFO [QuorumPeer[myid=5](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:50512):o.a.z.s.q.Follower@145] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4828) Minor 3.9 broke custom TLS setup with ssl.context.supplier.class
Jon Marius Venstad created ZOOKEEPER-4828: - Summary: Minor 3.9 broke custom TLS setup with ssl.context.supplier.class Key: ZOOKEEPER-4828 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4828 Project: ZooKeeper Issue Type: Bug Reporter: Jon Marius Venstad We run embedded ZooKeeper in Vespa, and use a custom TLS stack, where we, e.g., do additional validation and authorisation in our TLS trust manager, for client certificates. The changes in https://github.com/apache/zookeeper/commit/4a794276d3d371071c31f86c14da824fdd2e53c0, done for ZOOKEEPER-4622, broke the `ssl.context.supplier.class configuration parameter`, documented in the ZK admin guide (https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_configuration). Consequently, current code (3.9.2) now enforces a file-based key and trust store for _any TLS_, which is not an option for us. I looked at two ways to fix this: 1. Add new configuration parameters for _key_ and _trust_ store _suppliers_, as an alternative to the key and trust store _files_ required in the new (with 3.9.0) ClientX509Util code—this adds another pair of config options, of which there are already plenty, and the user is stuck with the default JDK `Provider` (optional argument to SSLContext.getInstance(protocols, provider); it also lets users with a custom key and trust store use the native SSL support of Netty. Oh, and, Netty provides the option to specify a JDK `Provider` in the SslContextBuilder, too, so that _could_ be made configurable as well. 2. Restore the option of specifying a custom SSL context, and prefer this over using the Netty SslContextBuilder in the new ClientX509Util code, when present—this lets users specify a JDK `Provider`, but file based key and trust stores will be required for the native SSL added in 3.9.0. I don't have a strong opinion on which option is better. I can also contribute a code change with either. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4827) Bump bouncycastl version from 1.75 to 1.78
ZhangJian He created ZOOKEEPER-4827: --- Summary: Bump bouncycastl version from 1.75 to 1.78 Key: ZOOKEEPER-4827 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4827 Project: ZooKeeper Issue Type: Task Reporter: ZhangJian He Upgrade Bouncy Castle to 1.78 to address CVEs https://bouncycastle.org/releasenotes.html#r1rv78 - https://www.cve.org/CVERecord?id=CVE-2024-29857 (reserved) - https://security.snyk.io/vuln/SNYK-JAVA-ORGBOUNCYCASTLE-6613079 - https://www.cve.org/CVERecord?id=CVE-2024-30171 (reserved) - https://security.snyk.io/vuln/SNYK-JAVA-ORGBOUNCYCASTLE-6613076 - https://www.cve.org/CVERecord?id=CVE-2024-30172 (reserved) - https://security.snyk.io/vuln/SNYK-JAVA-ORGBOUNCYCASTLE-6612984 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4826) Reduce unnecessary executable permissions on files
ZhangJian He created ZOOKEEPER-4826: --- Summary: Reduce unnecessary executable permissions on files Key: ZOOKEEPER-4826 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4826 Project: ZooKeeper Issue Type: Improvement Reporter: ZhangJian He ***Summary:*** This patch aims to modify the permissions of various files within the ZooKeeper repository that currently have executable permissions set (755) but do not require such permissions for their operation. Changing these permissions to 644 enhances security and maintains the consistency of file permissions throughout the project. ***Details:*** Several non-executable files (not including scripts or executable binaries) are currently set with executable permissions. This is generally unnecessary and can lead to potential security concerns. This patch will adjust these permissions to a more appropriate setting (644), which is sufficient for reading and writing operations but does not allow execution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4825) CVE-2023-6378 is present in the current logback version (1.2.13) and hence we need to upgrade to 1.4.12
Bhavya hoda created ZOOKEEPER-4825: -- Summary: CVE-2023-6378 is present in the current logback version (1.2.13) and hence we need to upgrade to 1.4.12 Key: ZOOKEEPER-4825 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4825 Project: ZooKeeper Issue Type: Bug Reporter: Bhavya hoda -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4824) fix CVE-2024-29025 in netty package
Nikita Pande created ZOOKEEPER-4824: --- Summary: fix CVE-2024-29025 in netty package Key: ZOOKEEPER-4824 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4824 Project: ZooKeeper Issue Type: Improvement Reporter: Nikita Pande [CVE-2024-29025|https://github.com/advisories/GHSA-5jpm-x58v-624v] is the CVE for all netty-codec-http < 4.1.108.Final -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4823) Proposal: Update the wiki of Zab 1.0 (Phase 2) to make it more precise and conform to the implementation
Sirius created ZOOKEEPER-4823: - Summary: Proposal: Update the wiki of Zab 1.0 (Phase 2) to make it more precise and conform to the implementation Key: ZOOKEEPER-4823 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4823 Project: ZooKeeper Issue Type: Improvement Reporter: Sirius As ZooKeeper evolves these years, its code implementation deviates the design of [Zab 1.0|https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0] in several aspects. One critical deviation lies in the _atomic actions_ upon a follower receives NEWLEADER (see 2.*f* in Phase 2). The protocol requires that the follower " _*atomically*_ applies the new state and sets *f*.currentEpoch = _e_". However, the atomicity is not guaranteed with the current code implementation. Asynchronous logging and committing by multi-threads with node crash can interrupt this process and lead to possible data loss (see -ZOOKEEPER-3911-, ZOOKEEPER-4643, ZOOKEEPER-4646, -ZOOKEEPER-4785-). On the other hand, to implement atomicity is expensive and affecting performance. It is reasonable to adopt an implementation without requiring atomic updates in this step. It is highly recommended to update the design of Zab without requiring atomicity in Step 2.*f* to better guide the code implementation. h3. Update Step 2.*f* by removing the requirement of atomicity Here provides a possible design of Step 2.*f* in Phase 2 with the removal of atomicity requirement. h4. Phase 2: Sync with followers # *l* ... # *f* The follower syncs with the leader, but doesn't modify its state until it receives the NEWLEADER(_e_) packet. Once it receives NEWLEADER(_e_), -_it atomically applies the new state, and then sets f.currentEpoch = e. It then sends ACK(e << 32)._- it executes the following actions sequentially: *2.1. applies the new state;* *2.2. sets f.currentEpoch = e;* *2.3. sends ACK(e << 32).* # *l* ... Note: * To ensure the correctness without requiring atomicity, the follower must persist and sync the data before it updates its currentEpoch and replies NEWLEADER ack (See the analysis in ZOOKEEPER-4643 & ZOOKEEPER-4785) * This new design conforms to the code implementation in current latest code version (ZooKeeper v3.9.2). This code version has fixed the known data loss issues that stay unresolved for a long time due to non-atomic executions in Step 2.*f* , including -ZOOKEEPER-3911-, ZOOKEEPER-4643, ZOOKEEPER-4646 & -ZOOKEEPER-4785-. (see the code fixes in [PR-2111|https://github.com/apache/zookeeper/pull/2111] & [PR-2152|https://github.com/apache/zookeeper/pull/2152]). * The correctness of this new design has been verified with the TLA+ specifications of Zab at different abstraction levels, including ** [High-level protocol specification|https://github.com/AlphaCanisMajoris/zookeeper-tla-spec/blob/main/Zab_new.tla] (developed based on the original [protocol spec|https://github.com/apache/zookeeper/blob/master/zookeeper-specifications/protocol-spec/Zab.tla]) ** [Multi-threading-level specification|https://github.com/AlphaCanisMajoris/zookeeper-tla-spec/blob/main/zk_pr_2152.tla] (developed based on the original [system spec.|https://github.com/apache/zookeeper/blob/master/zookeeper-specifications/system-spec/zk-3.7/ZkV3_7_0.tla] This spec is corresponding to [PR-2152|https://github.com/apache/zookeeper/pull/2152], an effort to fix more known issues in Phase 2.) In the verification, the TLC model checker checks whether the new design satisfies the properties given by the Zab paper. No violation is found during the checking with various configurations. We sincerely hope that the above update of the protocol design can be presented at the wiki page, and make it guide the future code implementation better! About us: We are a research team using TLA+ to verify the correctness of distributed systems. Looking forward to receiving feedback from the ZooKeeper community! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4822) Quorum TLS - Enable member authorization based on certificate CN
Damien Diederen created ZOOKEEPER-4822: -- Summary: Quorum TLS - Enable member authorization based on certificate CN Key: ZOOKEEPER-4822 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4822 Project: ZooKeeper Issue Type: New Feature Components: server Reporter: Damien Diederen Assignee: Damien Diederen Quorum TLS enables mutual authentication of quorum members. Member authorization, however, cannot be configured on the basis of the presented principal CN; a round of SASL authentication has to be performed on top of the secured connection. This ticket is about enabling authorization based on trusted client certificates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4821) ConnectRequest got NOTREADONLY ReplyHeader
Kezhu Wang created ZOOKEEPER-4821: - Summary: ConnectRequest got NOTREADONLY ReplyHeader Key: ZOOKEEPER-4821 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4821 Project: ZooKeeper Issue Type: Bug Components: java client, server Affects Versions: 3.9.2, 3.8.4 Reporter: Kezhu Wang I would expect {{ConnectRequest}} has two kinds of response in normal conditions: {{ConnectResponse}} and socket close. But if sever was configured with {{readonlymode.enabled}} but not {{localSessionsEnabled}}, then client could get {{NOTREADONLY}} in reply to {{ConnectRequest}}. I saw, at least, no handling in java client. And, I encountered this in writing tests for rust client. It guess it is not by design. And we probably could close the socket in early phase. But also, it could be solved in client sides as {{sizeof(ConnectResponse)}} is larger than {{sizeof(ReplyHeader)}}. Then, we gain ability to carry error for {{ConnectRequest}} while {{ConnectResponse}} does not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4820) zookeeper pom leaks logback dependency
PJ Fanning created ZOOKEEPER-4820: - Summary: zookeeper pom leaks logback dependency Key: ZOOKEEPER-4820 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4820 Project: ZooKeeper Issue Type: Task Components: java client Reporter: PJ Fanning Since v3.8.0 https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper/3.8.0 It's fine that Zookeeper uses Logback on the server side - but users who want to access Zookeeper using client side code also add this zookeeper jar to their classpaths. When zookeeper is used as client side lib, it should ideally not expose a logback dependency - just an slf4j-api jar dependency. Would it be possible to repwork the zookeper pom so that client side users don't have to explicitly exclude logback jars? Many users will have their own preferred logging framework. Is there another zookeeper client side jar that could be instead of zookeeper.jar? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4819) Can't seek for writable tls server if connected to readonly server
Kezhu Wang created ZOOKEEPER-4819: - Summary: Can't seek for writable tls server if connected to readonly server Key: ZOOKEEPER-4819 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4819 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.9.2, 3.8.4 Reporter: Kezhu Wang {{[ClientCnxn::pingRwServer|https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1280]}} uses raw socket to issue "isro" 4lw command. This results in unsuccessful handshake to tls server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4818) Export JVM heap metrics in ServerMetrics
Andrew Kyle Purtell created ZOOKEEPER-4818: -- Summary: Export JVM heap metrics in ServerMetrics Key: ZOOKEEPER-4818 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4818 Project: ZooKeeper Issue Type: Improvement Components: metric system Reporter: Andrew Kyle Purtell A metric for JVM heap occupancy is not included in ServerMetrics. According to [https://zookeeper.apache.org/doc/current/zookeeperMonitor.html] the recommended practice is for someone to enable the PrometheusMetricsProvider and the Prometheus base class upon which that provider is based does export that information. See [https://zookeeper.apache.org/doc/current/zookeeperMonitor.html] . The example provided for alerting on heap utilization there is: {noformat} - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: "JVM memory filling up (instance {{ $labels.instance }})" description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n" {noformat} where {{jvm_memory_bytes_used}} and {{jvm_memory_bytes_max}} are provided by a Prometheus base class. Where PrometheusMetricsProvider is the right choice that's good enough but where the ServerMetrics information is consumed in an alternate way, by 4-letter-word scraping, or by JMX, ServerMetrics should provide the same information. {{jvm_memory_bytes_used}} and {{jvm_memory_bytes_max}} (presuming heap) are reasonable names. An alternative could be to calculate the heap occupancy and provide that as a percentage, either an integer in the range 0 - 100 or floating point value in the range 0.0 - 1.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4817) CancelledKeyException does not work in some cases.
gendong1 created ZOOKEEPER-4817: --- Summary: CancelledKeyException does not work in some cases. Key: ZOOKEEPER-4817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4817 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.10.0 Reporter: gendong1 If the client connection is disconnected with zoo server, cancelledkeyexception will arise. Here is a strange scenarios. NIOServerCxn.doIO is blocked at line 333 by the fail-slow nic. If the delay lasts more than 30s, cancelledkeyexception will disappear. If the delay lasts for 25s, cancelledkeyexception will arise. When the doIO encounters the slowdown caused by teh fail-slow nic, the context is same. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4816) A follower can not join the cluster for 20s seconds
gendong1 created ZOOKEEPER-4816: --- Summary: A follower can not join the cluster for 20s seconds Key: ZOOKEEPER-4816 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4816 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.10.0 Reporter: gendong1 Attachments: node1.log, node2.log, node3.log We encounter a strange scenario. When we set up the cluster of zookeeper(3 nodes totally), the third node is stuck in serializing the snapshot to the local disk. However, the leader election is executed normally. After the election, the third node is elected as the leader. The other two nodes fail to connect with the leader. Hence, the first and second nodes restart the leader election, finally the second node is elected as the leader. At this time, the third node still act as the leader. There are two leaders in the cluster. The first node can not join the cluster for 20s. During this procedure, the client can not connect with any nodes of the cluster. Runtime logs are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4815) custom the data format of /zookeeper/config
yangoofy created ZOOKEEPER-4815: --- Summary: custom the data format of /zookeeper/config Key: ZOOKEEPER-4815 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4815 Project: ZooKeeper Issue Type: Improvement Reporter: yangoofy When using QuorumMaj, I hope to support custom /zookeeper/config node data formats, such as server. x=xx. xx. xx. xx: 2888:3888: observer; 0.0.0.0:2181; Group1 server. y=xx. xx. xx. xx: 2888:3888: observer; 0.0.0.0:2181; Group2 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4814) Protocol desynchronization after Connect for (some) old clients
Damien Diederen created ZOOKEEPER-4814: -- Summary: Protocol desynchronization after Connect for (some) old clients Key: ZOOKEEPER-4814 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4814 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.0 Reporter: Damien Diederen Assignee: Damien Diederen Some old clients experience a protocol synchronization after receiving a {{ConnectResponse}} from the server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4813) Make zookeeper start successfully when the last log file is dirty during the restore progress
Yan Zhao created ZOOKEEPER-4813: --- Summary: Make zookeeper start successfully when the last log file is dirty during the restore progress Key: ZOOKEEPER-4813 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4813 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.1 Reporter: Yan Zhao Assignee: Yan Zhao Fix For: 3.9.2 When the zookeeper restarts, it will restore the data from the last valid snapshot file, and replay txn log to append data. But if the last log file is empty due to some reason, the restore will fail, not make the zookeeper can not restart. {noformat} 14:12:16.023 [main] INFO org.apache.zookeeper.server.persistence.SnapStream - Invalid snapshot snapshot.188700025d87. len = 761554294, byte = 45 14:12:16.024 [main] INFO org.apache.zookeeper.server.persistence.FileSnap - Reading snapshot /pulsar/data/zookeeper/version-2/snapshot.188700025a05 14:12:17.350 [main] INFO org.apache.zookeeper.server.DataTree - The digest in the snapshot has digest version of 2, with zxid as 0x188700025b07, and digest value as 510776662607117 14:12:17.492 [main] ERROR org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on disk java.io.EOFException: null at java.io.DataInputStream.readInt(DataInputStream.java:386) ~[?:?] at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) ~[org.apache.zookeeper-zookeeper-jute-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67) ~[org.apache.zookeeper-zookeeper-jute-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:725) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:743) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:711) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:792) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:288) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1149) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1135) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] 14:12:17.502 [main] INFO org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider - Shutdown executor service with timeout 1000 14:12:17.508 [main] INFO org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@2484f433{HTTP/1.1, (http/1.1)}{0.0.0.0:8000} 14:12:17.510 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@59a67c3a{/,null,STOPPED} 14:12:17.515 [main] ERROR org.apache.zookeeper.server.quorum.QuorumPeerMain - Unexpected exception, exiting abnormally java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1204) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1135) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1] at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1
[jira] [Created] (ZOOKEEPER-4812) Another reconfiguration is in progress -- concurrent reconfigs not supported (yet)
sunfeifei created ZOOKEEPER-4812: Summary: Another reconfiguration is in progress -- concurrent reconfigs not supported (yet) Key: ZOOKEEPER-4812 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4812 Project: ZooKeeper Issue Type: Bug Environment: zk版本为3.7.0 # echo mntr |nc 127.0.0.1 2181|grep version zk_version 3.7.0-e3704b390a6697bfdf4b0bef79e3da7a4f6bac4b, built on 2021-03-17 09:46 UTC # echo mntr |nc 127.0.0.1 2181|grep version zk_version 3.7.0-e3704b390a6697bfdf4b0bef79e3da7a4f6bac4b, built on 2021-03-17 09:46 UTC Reporter: sunfeifei Attachments: image-2024-02-28-11-49-37-465.png, image-2024-02-28-11-49-52-155.png 在使用reconfig命令增加或删除成员时提示 Another reconfiguration is in progress -- concurrent reconfigs not supported (yet) ,一直无法操作集群的增减节点。 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4811) When configuring the IP address of the zk server on the client side to connect to zk, the connection establishment time is high
yangoofy created ZOOKEEPER-4811: --- Summary: When configuring the IP address of the zk server on the client side to connect to zk, the connection establishment time is high Key: ZOOKEEPER-4811 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4811 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.7.1, 3.7.0 Reporter: yangoofy When configuring the IP address of the zk server on the client side to connect to zk, the connection establishment time is high。Mainly because Obtaining the hostname of the address takes approximately 5 seconds. 3.4.6 has mechanism to safely avoid reverse DNS lookup,but 3.7 don't do that。 1.What's the reasone? 2.Can we modify the method StaticHostProvider/resolve() to avoid reverse DNS lookup? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4810) Fix data race in format_endpoint_info()
fanyang created ZOOKEEPER-4810: -- Summary: Fix data race in format_endpoint_info() Key: ZOOKEEPER-4810 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4810 Project: ZooKeeper Issue Type: Bug Components: c client Reporter: fanyang -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4809) Fix do_completion() use-after-free when log level is debug
fanyang created ZOOKEEPER-4809: -- Summary: Fix do_completion() use-after-free when log level is debug Key: ZOOKEEPER-4809 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4809 Project: ZooKeeper Issue Type: Bug Components: c client Reporter: fanyang -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4808) Fix the log statement in FastLeaderElection
Li Wang created ZOOKEEPER-4808: -- Summary: Fix the log statement in FastLeaderElection Key: ZOOKEEPER-4808 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4808 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Li Wang The proposedZxid and proposedEpoch is out of order in the following debug statement. {code:java} LOG.debug( "Sending Notification: {} (n.leader), 0x{} (n.peerEpoch), 0x{} (n.zxid), 0x{} (n.round), {} (recipient),"+underlined text+ + " {} (myid) ", proposedLeader, Long.toHexString(proposedZxid), Long.toHexString(proposedEpoch), Long.toHexString(logicalclock.get()), sid, self.getMyId()); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4807) Add sid for the leader goodbyte log
Yan Zhao created ZOOKEEPER-4807: --- Summary: Add sid for the leader goodbyte log Key: ZOOKEEPER-4807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4807 Project: ZooKeeper Issue Type: Wish Components: server Affects Versions: 3.9.1 Reporter: Yan Zhao Fix For: 3.9.2 When a follower disconnects with the leader, the leader will print the remote address. But if the zookeeper is along with istio, the remote address is not right. 2024-02-05T03:23:54,967+ [LearnerHandler-/127.0.0.6:56085] WARN org.apache.zookeeper.server.quorum.LearnerHandler - *** GOODBYE /127.0.0.6:56085 We would better print the sid in the goodbye log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4806) Commits have to be refreshed after merging
Andor Molnar created ZOOKEEPER-4806: --- Summary: Commits have to be refreshed after merging Key: ZOOKEEPER-4806 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4806 Project: ZooKeeper Issue Type: Sub-task Reporter: Andor Molnar Assignee: Szucs Villo The following error occurs if somebody wants to cherry-pick immediately after merge: {noformat} All checks have passed on the github. Pull request #2115 merged. Sha: #18c78cd10bc02d764a46ac1659b263cf69f2671d Would you like to pick 18c78cd10bc02d764a46ac1659b263cf69f2671d into another branch? (y/n): y Enter a branch name [branch-3.9]: git fetch apache >From https://gitbox.apache.org/repos/asf/zookeeper 72e3d9ce9..e571dd814 master -> apache/master git checkout -b PR_TOOL_PICK_PR_2115_BRANCH-3.9 apache/branch-3.9 Switched to a new branch 'PR_TOOL_PICK_PR_2115_BRANCH-3.9' git cherry-pick -sx 18c78cd10bc02d764a46ac1659b263cf69f2671d fatal: bad object 18c78cd10bc02d764a46ac1659b263cf69f2671d Error cherry-picking: Command '['git', 'cherry-pick', '-sx', '18c78cd10bc02d764a46ac1659b263cf69f2671d']' returned non-zero exit status 128.{noformat} The reason for this is, because the local git repo doesn't know about the new commit yet. We should do a {{git fetch}} after successfully merged via GitHub. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4805) Update cwiki page with latest changes
Andor Molnar created ZOOKEEPER-4805: --- Summary: Update cwiki page with latest changes Key: ZOOKEEPER-4805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4805 Project: ZooKeeper Issue Type: Sub-task Components: documentation Reporter: Andor Molnar Assignee: Szucs Villo Update the following wiki page with latest changes and instructions how to use the script: [https://cwiki.apache.org/confluence/display/ZOOKEEPER/Merging+Github+Pull+Requests] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4804) Use daemon threads for Netty client
Istvan Toth created ZOOKEEPER-4804: -- Summary: Use daemon threads for Netty client Key: ZOOKEEPER-4804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4804 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.8.3 Reporter: Istvan Toth When the Netty client is used, the Java process hangs on System.exit if there is an open Zookeeper connection. This is caused by the non-daemon threads created by Netty. Exiting without closing the connection is not a good practice, but this hang does not happen with the NIO client, and I think ZK should behave the same regardless of the client implementation used. The Netty ThreadFactory implementation is configurable, it shouldn't be too hard make sure that daemon threads are created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4803) Flaky test: QuorumPeerMainTest.testLeaderOutOfView
Ling Mao created ZOOKEEPER-4803: --- Summary: Flaky test: QuorumPeerMainTest.testLeaderOutOfView Key: ZOOKEEPER-4803 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4803 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.10 Reporter: Ling Mao {code:java} [2024-02-08T02:01:19.039Z] [INFO] [2024-02-08T02:01:19.039Z] [INFO] Results: [2024-02-08T02:01:19.039Z] [INFO] [2024-02-08T02:01:19.039Z] [ERROR] Failures: [2024-02-08T02:01:19.039Z] [ERROR] QuorumPeerMainTest.testLeaderOutOfView:881 expected: but was: [2024-02-08T02:01:19.039Z] [INFO] [2024-02-08T02:01:19.039Z] [ERROR] Tests run: 3116, Failures: 1, Errors: 0, Skipped: 4 [2024-02-08T02:01:19.039Z] [INFO] [2024-02-08T02:01:19.039Z] [INFO] [2024-02-08T02:01:19.039Z] [INFO] Reactor Summary for Apache ZooKeeper 3.10.0-SNAPSHOT: {code} Link: https://ci-hadoop.apache.org/blue/organizations/jenkins/zookeeper-precommit-github-pr/detail/PR-2043/4/pipeline -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4802) Flaky test:RestoreQuorumTest.testRestoreAfterQuorumLost
Ling Mao created ZOOKEEPER-4802: --- Summary: Flaky test:RestoreQuorumTest.testRestoreAfterQuorumLost Key: ZOOKEEPER-4802 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4802 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.10 Reporter: Ling Mao {code:java} [INFO] Running org.apache.zookeeper.server.admin.RestoreQuorumTest 836[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.489 s <<< FAILURE! - in org.apache.zookeeper.server.admin.RestoreQuorumTest 837[ERROR] testRestoreAfterQuorumLost Time elapsed: 3.344 s <<< ERROR! 838java.net.ConnectException: Connection refused (Connection refused) 839 at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) 840 at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412) 841 at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255) 842 at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237) 843 at java.base/java.net.Socket.connect(Socket.java:609) 844 at java.base/java.net.Socket.connect(Socket.java:558) 845 at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) 846 at org.apache.zookeeper.server.admin.SnapshotAndRestoreCommandTest.takeSnapshotAndValidate(SnapshotAndRestoreCommandTest.java:413) 847 at org.apache.zookeeper.server.admin.RestoreQuorumTest.testRestoreAfterQuorumLost(RestoreQuorumTest.java:56) 848 849[INFO] Running org.apache.zookeeper.server.admin.CommandResponseTest {code} Link: https://github.com/apache/zookeeper/actions/runs/7812662872/job/21310154983 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4801) Add memory size limitation policy for ZkDataBase#committedLog
Yan Zhao created ZOOKEEPER-4801: --- Summary: Add memory size limitation policy for ZkDataBase#committedLog Key: ZOOKEEPER-4801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4801 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.1 Reporter: Yan Zhao Fix For: 3.9.2 The ZkDataBase support commit log count to limit the memory, which is not precise, some request payloads may be huge, it will cost lots of heap memory. So support payload size limitation will be better. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4800) Flaky test:ReconfigRollingRestartCompatibilityTest.testRollingRestartWithExtendedMembershipConfig
Ling Mao created ZOOKEEPER-4800: --- Summary: Flaky test:ReconfigRollingRestartCompatibilityTest.testRollingRestartWithExtendedMembershipConfig Key: ZOOKEEPER-4800 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4800 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.10 Reporter: Ling Mao Link: https://github.com/apache/zookeeper/actions/runs/7781886238/job/21217202024?pr=1932 {code:java} [ERROR] Failures: 1023[ERROR] ReconfigRollingRestartCompatibilityTest.testRollingRestartWithExtendedMembershipConfig:263 waiting for server 2 being up ==> expected: but was: 1024[INFO] 1025[ERROR] Tests run: 3114, Failures: 1, Errors: 0, Skipped: 4 1026[INFO] 1027[INFO] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4799) Refactor ACL check in addWatch command
Damien Diederen created ZOOKEEPER-4799: -- Summary: Refactor ACL check in addWatch command Key: ZOOKEEPER-4799 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4799 Project: ZooKeeper Issue Type: Improvement Reporter: Damien Diederen Assignee: Damien Diederen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4798) Secure prometheus support
Purshotam Shah created ZOOKEEPER-4798: - Summary: Secure prometheus support Key: ZOOKEEPER-4798 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4798 Project: ZooKeeper Issue Type: Improvement Reporter: Purshotam Shah -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4797) Allow for -XX:MaxRamPercentage JVM setting
Frederiko Costa created ZOOKEEPER-4797: -- Summary: Allow for -XX:MaxRamPercentage JVM setting Key: ZOOKEEPER-4797 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4797 Project: ZooKeeper Issue Type: Improvement Components: scripts Reporter: Frederiko Costa When running Zk in a containerized environment, it's sometimes desirable to express your heap size in terms of percentage of available memory allocated to a container. As it stands, zkEnv.sh forces your to have -Xmx, which defaults to 1GB. Some environments wanted to set it to more, mostly related to the amount of Ram. This is a request to implement the option of using -XX:MaxRamPercentage option when starting zookeeper. Suggested implementation is to also make a variable ZK_SERVER_MAXRAMPERCENTAGE available to be added to SERVER_JVMFLAGS. If the variable is set, ZK_HEAP_SERVER is ignored, if no ZK_SERVER_MAXRAMPERCENTAGE, ZK_SERVER_HEAP is set as usual. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4796) Requests submitted first may carry a larger xid resulting in ZRUNTIMEINCONSISTENCY
fanyang created ZOOKEEPER-4796: -- Summary: Requests submitted first may carry a larger xid resulting in ZRUNTIMEINCONSISTENCY Key: ZOOKEEPER-4796 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4796 Project: ZooKeeper Issue Type: Bug Components: c client Reporter: fanyang When multiple threads attempt to submit requests, it's possible for a request from a thread that acquired its xid earlier to be inserted after a request from a thread that acquired its xid later in the submission queue, which causes a ZRUNTIMEINCONSISTENCY error. To fix it, acquires the lock before get_xid() and releases it after request submission. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4795) add namespace support for prometheus metrics
Purshotam Shah created ZOOKEEPER-4795: - Summary: add namespace support for prometheus metrics Key: ZOOKEEPER-4795 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4795 Project: ZooKeeper Issue Type: Improvement Components: metric system Reporter: Purshotam Shah -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4794) Reduce the ZKDatabase#committedLog memory usage
Yan Zhao created ZOOKEEPER-4794: --- Summary: Reduce the ZKDatabase#committedLog memory usage Key: ZOOKEEPER-4794 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4794 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.1 Reporter: Yan Zhao In ZKDatabase, after a quorum request is committed successfully, the ZKDatabase will wrap the request into a proposal and store it in the committedLog. The wrap operation: Serialize the request to a byte array and wrap the byte array in the QuorumPacket, so if the request payload size is 1M, the Proposal will occupy 2M memory, which will increase the memory pressure. The committedLog is used for fast follower synchronization, so we can serialize the request in the synchronization of the processes, no need to serialize the request in advance. It can reduce half of the memory for committedLog -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4793) The zookeeper server return the wrong response for ruok command.
Yan Zhao created ZOOKEEPER-4793: --- Summary: The zookeeper server return the wrong response for ruok command. Key: ZOOKEEPER-4793 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4793 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.1 Reporter: Yan Zhao Users often use the ruok command to probe the zookeeper server status. If the ruok command doesn't return the response, the automation tool will restart the zookeeper. But if the quorum zookeeper server encounters some unexpected error, it changes the state to State.ERROR. That means the server can't serve the service anymore. But the ruok still returns the `imok`. In this case, it should return `This ZooKeeper instance is not currently serving requests` like other command(WatchCommand) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4792) Tune the env log at the start of the process
Yan Zhao created ZOOKEEPER-4792: --- Summary: Tune the env log at the start of the process Key: ZOOKEEPER-4792 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4792 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.1 Reporter: Yan Zhao At the start of the process, it will print the env info in the log. There are three logs for the memory metrics. {code:java} // Get memory information. Runtime runtime = Runtime.getRuntime(); int mb = 1024 * 1024; put(l, "os.memory.free", runtime.freeMemory() / mb + "MB"); put(l, "os.memory.max", runtime.maxMemory() / mb + "MB"); put(l, "os.memory.total", runtime.totalMemory() / mb + "MB"); {code} https://github.com/apache/zookeeper/blob/9e40464d98319b4553d93b12c6d7db4d240bbce9/zookeeper-server/src/main/java/org/apache/zookeeper/Environment.java#L88-L90 It's misleading for the user, use jvm as the prefix will be better. Change to: {code:java} // Get memory information. Runtime runtime = Runtime.getRuntime(); int mb = 1024 * 1024; put(l, "jvm,.memory.free", runtime.freeMemory() / mb + "MB"); put(l, "jvm.memory.max", runtime.maxMemory() / mb + "MB"); put(l, "jvm.memory.total", runtime.totalMemory() / mb + "MB"); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4791) Improve logging when the connection to a remote server is closed
Sönke Liebau created ZOOKEEPER-4791: --- Summary: Improve logging when the connection to a remote server is closed Key: ZOOKEEPER-4791 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4791 Project: ZooKeeper Issue Type: New Feature Components: server Affects Versions: 3.9.1 Reporter: Sönke Liebau When a server closes the connection to a remote server, the logging around why the connection is being closed could be improved a bit. https://github.com/apache/zookeeper/blob/1cc1eb6a2be7323a5c326652d59a070473bb8779/zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java#L524 {code:java} ZooKeeperServer zks = this.zkServer; if (zks == null || !zks.isRunning()) { LOG.info("Closing connection to {} because the server is not ready", getRemoteSocketAddress()); close(DisconnectReason.IO_EXCEPTION); return; } {code} It would be helpful to log what zkServer is, because it can have multiple states that would trigger this shutdown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4790) TLS Quorum hostname verification breaks in some scenarios
Sönke Liebau created ZOOKEEPER-4790: --- Summary: TLS Quorum hostname verification breaks in some scenarios Key: ZOOKEEPER-4790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4790 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.9.1 Reporter: Sönke Liebau Currently, enabling Quorum TLS will make the server validate SANs client certificates of connecting quorum peers against their reverse DNS address. We have seen this cause issues when running in Kubernetes, due to ip addresses resolving to multiple dns names, when ZooKeeper pods participate in multiple services. Since `InetAddress.getHostAddress()` returns a String, it basically becomes a game of chance which dns name is checked against the cert. This has caused issues in the Strimzi operator as well (see [this issue|https://github.com/strimzi/strimzi-kafka-operator/issues/3099]) - they solved this by pretty much adding anything they can find that might be relevant to the SAN, and a few wildcards on top of that. This is both, error prone and doesn't really add any relevant extra amount of security, since "This certificate matches the connecting peer" shouldn't automatically mean "this peer should be allowed to connect". There are two (probably more) ways to fix this: # Retrieve _all_ reverse entries and check against all of them # The ZK server could verify the SAN against the list of servers ({{{}servers.N{}}} in the config). A peer should be able to connect on the quorum port if and only if at least one SAN matches at least one of the listed servers. I'd argue that the second option is the better one, especially since the java api doesn't even seem to have the option of retrieving all dns entries, but also because it better matches the expressed intent of the ZK admin. Additionally, it would be nice to have a "disable client hostname verification" option that still leaves server hostname verification enabled. Strictly speaking this is a separate issue though, I'd be happy to spin that out into a ticket of its own.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4789) Avoid static blocks in QuorumAuth Tests
Muthuraj Ramalingakumar created ZOOKEEPER-4789: -- Summary: Avoid static blocks in QuorumAuth Tests Key: ZOOKEEPER-4789 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4789 Project: ZooKeeper Issue Type: Test Reporter: Muthuraj Ramalingakumar In QuorumAuth tests test setup code is written in static code blocks {}, instead use @BeforeAll if possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4788) high maxlatency
yangoofy created ZOOKEEPER-4788: --- Summary: high maxlatency Key: ZOOKEEPER-4788 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4788 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.7.1 Reporter: yangoofy We use Zookeeper as the registration center for microservices. Zookeeper uses the default configuration. Snapshot600m, with 70w ephemeral nodes. A single Zookeeper server is 56c128g and xmx16g. The number of client connections is 8k, and the number of watches is 400w. During the centralized start and stop time of the client, the maxlatency of the server is as high as 10 seconds. How can me optimize it? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4787) Failed to establish connection between zookeeper
softrock created ZOOKEEPER-4787: --- Summary: Failed to establish connection between zookeeper Key: ZOOKEEPER-4787 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4787 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.8.3 Environment: z/OS{*}{*} Reporter: softrock *Problem:* When run zookeepers version 3.8.3 on z/OS platform,they cannot establish the connection Error: [2024-01-17 23:06:44,194] INFO Received connection request from /xx.xx.xx.xx:23840 (org.apache.zookeeper.server.quorum.QuorumCnxManager) [2024-01-17 23:06:44,197] ERROR Initial message parsing error! (org.apache.zookeeper.server.quorum.QuorumCnxManager) org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException: Badly formed address: K???K???K???z at org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage.parse(QuorumCnxManager.java:271) at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:607) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:555) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1085) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1039) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) at java.util.concurrent.FutureTask.run(FutureTask.java:277) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.lang.Thread.run(Thread.java:825) *Root cause:* The receiver cannot resolve the address from the sender requesting a connection. This is because the sender sends the address in UTF-8 encoding, but the receiver parses the address in IBM-1047 encoding (the default). *Resolution:* Both receiver and sender sides use UTF-8 encoding -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4786) Bug in file "Breadcrumbszookeeper/zookeeper-server/src/main/java/org/apache/zookeeper/server/util /OSMXBean.java"
Abdelaziz Assem created ZOOKEEPER-4786: -- Summary: Bug in file "Breadcrumbszookeeper/zookeeper-server/src/main/java/org/apache/zookeeper/server/util /OSMXBean.java" Key: ZOOKEEPER-4786 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4786 Project: ZooKeeper Issue Type: Bug Components: other Affects Versions: 3.6.3 Environment: AIX Servers. Reporter: Abdelaziz Assem Attachments: Exception Error.PNG, Failed to Start.PNG Function "{*}getOpenFileDescriptorCount{*}" in that file line 98 supports only ibmvendor machines and the implantation is done for Linux servers only, Now I am using it on AIX servers and when I tried to start zookeeper it says "Failed to Start", because "java.lang.NumberFormatException" which happened in line 121 in "{*}OSMXBean.java{*}". -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4785) Txn loss due to race condition when follower DIFF sync with leader
Li Wang created ZOOKEEPER-4785: -- Summary: Txn loss due to race condition when follower DIFF sync with leader Key: ZOOKEEPER-4785 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4785 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.1, 3.8.2, 3.7.2, 3.8.1, 3.7.1, 3.8.0 Reporter: Li Wang We had txn loss incident in production recently. The root cause is follower writes the current epoch and sends the ACK_LD before successfully persisting all the txns from DIFF sync, as persisting txns is handled asynchronously via SyncRequestProcessor. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4784) Token based ASF JIRA authentication
Szucs Villo created ZOOKEEPER-4784: -- Summary: Token based ASF JIRA authentication Key: ZOOKEEPER-4784 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4784 Project: ZooKeeper Issue Type: Sub-task Components: tools Reporter: Szucs Villo Assignee: Szucs Villo https://issues.apache.org/jira/browse/SPARK-44802 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4783) leader crash because of zxid 32b rollover but no other server takes the lead
Stéphane Loeuillet created ZOOKEEPER-4783: - Summary: leader crash because of zxid 32b rollover but no other server takes the lead Key: ZOOKEEPER-4783 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4783 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.8.3 Environment: Linux amd64 Ubuntu 20.04.5 Java OpenJDK17U-jre_x64_linux_hotspot_17.0.8.1_1.tar.gz Reporter: Stéphane Loeuillet Attachments: zookeeper_crash.log Got a 5 node cluster running on baremetal servers (with NVMe) used by a ClickHouse cluster on a separate cluster. This morning, a crash on the leader did let my clusters unusable as while the leader crashed, none of the 4 followers did take the lead zookeeper leader was zookeeper08 05/06/07/09 were the followers Only a restart of zookeeper05 process did unfreeze the whole cluster -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4782) The installation guide in zookeeper-client/zookeeper-client-c is oudated
yangzhenxing created ZOOKEEPER-4782: --- Summary: The installation guide in zookeeper-client/zookeeper-client-c is oudated Key: ZOOKEEPER-4782 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4782 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.9.1 Reporter: yangzhenxing The installation guide in zookeeper-client/zookeeper-client-c is oudated, it should point out the current good way to build is contained in zookeeper-docs/src/main/resources/markdown/zookeeperProgrammers.md -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4781) ZooKeeper not starting because the accepted epoch is less than the current epoch.
zhanglu153 created ZOOKEEPER-4781: - Summary: ZooKeeper not starting because the accepted epoch is less than the current epoch. Key: ZOOKEEPER-4781 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4781 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.1, 3.8.3, 3.7.2, 3.6.4, 3.4.14, 3.5.10 Reporter: zhanglu153 This issue occurred in our abnormal testing environment, where the disk was injected with anomalies and frequently filled up. The the scenario is as follows: # Configure three node ZooKeeper cluster, lets say nodes are A, B and C. # Start the cluster, and node C becomes the leader. # The disk of Node C is injected with a full disk exception. # Node C called the org.apache.zookeeper.server.quorum.Leader#lead method. {code:java} void lead() throws IOException, InterruptedException { self.end_fle = Time.currentElapsedTime(); long electionTimeTaken = self.end_fle - self.start_fle; self.setElectionTimeTaken(electionTimeTaken); LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken); self.start_fle = 0; self.end_fle = 0; zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean); try { self.tick.set(0); zk.loadData(); leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid()); // Start thread that waits for connection requests from // new followers. cnxAcceptor = new LearnerCnxAcceptor(); cnxAcceptor.start(); readyToStart = true; long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch()); zk.setZxid(ZxidUtils.makeZxid(epoch, 0)); synchronized(this){ lastProposed = zk.getZxid(); } newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(), null, null); if ((newLeaderProposal.packet.getZxid() & 0xL) != 0) { LOG.info("NEWLEADER proposal has Zxid of " + Long.toHexString(newLeaderProposal.packet.getZxid())); } waitForEpochAck(self.getId(), leaderStateSummary); self.setCurrentEpoch(epoch); ... {code} # Node C, as the leader, will start the LearnerCnxAcceptor thread, and then call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. At this time, the value of waitingForNewEpoch is true, and the size of connectingFollowers is not greater than n/2. Node C directly calls connectingFollowers.wait to wait. The maximum waiting time is self.getInitLimit()*self.getTickTime() ms. {code:java} public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException { synchronized(connectingFollowers) { if (!waitingForNewEpoch) { return epoch; } if (lastAcceptedEpoch >= epoch) { epoch = lastAcceptedEpoch+1; } if (isParticipant(sid)) { connectingFollowers.add(sid); } QuorumVerifier verifier = self.getQuorumVerifier(); if (connectingFollowers.contains(self.getId()) && verifier.containsQuorum(connectingFollowers)) { self.setAcceptedEpoch(epoch); waitingForNewEpoch = false; connectingFollowers.notifyAll(); } else { long start = Time.currentElapsedTime(); long cur = start; long end = start + self.getInitLimit()*self.getTickTime(); while(waitingForNewEpoch && cur < end) { connectingFollowers.wait(end - cur); cur = Time.currentElapsedTime(); } if (waitingForNewEpoch) { throw new InterruptedException("Timeout while waiting for epoch from quorum"); } } return epoch; } } {code} # Node B connects to the 2888 communication port of node C and starts a new LeanerHandler thread. # Node A connects to the 2888 communication port of node C and starts a new LeanerHandler thread. # After node B connects to node C, call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread.At this point, the value of waitingForNewEpoch is true, and the size of connectingFollowers is greater than n/2. Then, set the value of waitingForNewEpoch to false. Due to the disk of node C being full, calling setAcceptedEpoch to write the acceptedEpoch value failed with an IO exception. Node C fails to update the acceptedEpoch file and did not successfully call the connectingFollowers.notifyAll() method. This will cause node C to wait at connectingFollowers.wait, with a maximum wait of self.getInitLimit()*self.getTickTime() m
[jira] [Created] (ZOOKEEPER-4780) Avoid creating temporary files in source directory.
Muthuraj Ramalingakumar created ZOOKEEPER-4780: -- Summary: Avoid creating temporary files in source directory. Key: ZOOKEEPER-4780 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4780 Project: ZooKeeper Issue Type: Test Reporter: Muthuraj Ramalingakumar The zookeeper-server module has several unit tests which create temporary folders and files in the test directory. And has logic to delete the files after test run. We can use the Junit TempDir annotation to handle tempfile/tempdir creation. And dont have to manage that in source code. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4779) ZKUtilTest fails to run on WSL
Muthuraj Ramalingakumar created ZOOKEEPER-4779: -- Summary: ZKUtilTest fails to run on WSL Key: ZOOKEEPER-4779 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4779 Project: ZooKeeper Issue Type: Task Components: tests Reporter: Muthuraj Ramalingakumar The `ZKUtilTest#testUnreadableFileInput` fails to run when running on WSL (Windows subsystem for Linux). Skip that test when running in WSL. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4778) Patch jetty, netty, and logback to remove high severity vulnerabilities
Ivgeni created ZOOKEEPER-4778: - Summary: Patch jetty, netty, and logback to remove high severity vulnerabilities Key: ZOOKEEPER-4778 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4778 Project: ZooKeeper Issue Type: Improvement Components: security Reporter: Ivgeni logback-core & logback-classic: [https://nvd.nist.gov/vuln/detail/CVE-2023-6378] netty-codec: [https://nvd.nist.gov/vuln/detail/CVE-2023-44487] jetty-io: [https://nvd.nist.gov/vuln/detail/CVE-2023-36478] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4777) Zookeeper becomes unresponsive when using native GSSAPI
Rickey Visinski created ZOOKEEPER-4777: -- Summary: Zookeeper becomes unresponsive when using native GSSAPI Key: ZOOKEEPER-4777 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4777 Project: ZooKeeper Issue Type: Bug Components: kerberos, server Affects Versions: 3.8.3, 3.8.2, 3.7.2, 3.6.4, 3.7.1, 3.6.2, 3.5.7, 3.5.6, 3.4.14, 3.4.13 Environment: RHEL 7 and OpenJDK Runtime Environment (build 1.8.0_392-b08) RHEL 8 and OpenJDK Runtime Environment (Red_Hat-17.0.9.0.9-1) (build 17.0.9+9-LTS) Reporter: Rickey Visinski Zookeeper ensemble starts up properly after quorum is made. The leader is elected and it starts serving requests. After a while the Leader gets stuck, so its just accepting requests but not processing it, same is the case with participants. They are accepting requests but since the leader doesn't process they keep piling up. This causes an issue with sudden increase on the no. of CLOSE_WAIT connections on the zookeeper servers. When this happens, the ensemble is completely unresponsive causing connection loss/timeouts. Once the CLOSE_WAIT start the number of open connections on each server spike as high as 10 from a mere 200 connections within a few minutes. A pattern was found in thread dump where we always saw {{NIOServerCxnFactory}} selector thread blocked on a lock waiting in {{org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer}} {code:java} tdump_zkdev14.i.ia55.net_1694037623.logs-"NIOServerCxnFactory.SelectorThread-0" #16 daemon prio=5 os_prio=0 cpu=9126323.70ms elapsed=25935.16s tid=0x7f9118702320 nid=0x20ed94 waiting for monitor entry [0x7f907e635000] tdump_zkdev14.i.ia55.net_1694037623.logs: java.lang.Thread.State: BLOCKED (on object monitor) tdump_zkdev14.i.ia55.net_1694037623.logs- at org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:42) tdump_zkdev14.i.ia55.net_1694037623.logs- - waiting to lock <0x000700391098> (a org.apache.zookeeper.Login) tdump_zkdev14.i.ia55.net_1694037623.logs- at org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:38) {code} {{}} Seems to be related to https://issues.apache.org/jira/browse/ZOOKEEPER-2230 Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4776) CVE-2023-36478 | org.eclipse.jetty_jetty-io
Aayush Suri created ZOOKEEPER-4776: -- Summary: CVE-2023-36478 | org.eclipse.jetty_jetty-io Key: ZOOKEEPER-4776 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4776 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.9.1 Reporter: Aayush Suri {*}Vulnerability summary{*}: Eclipse Jetty provides a web server and servlet container. In versions 11.0.0 through 11.0.15, 10.0.0 through 10.0.15, and 9.0.0 through 9.4.52, an integer overflow in `MetaDataBuilder.checkSize` allows for HTTP/2 HPACK header values to exceed their size limit. `MetaDataBuilder.java` determines if a header name or value exceeds the size limit, and throws an exception if the limit is exceeded. However, when length is very large and huffman is true, the multiplication by 4 in line 295 will overflow, and length will become negative. `(_size+length)` will now be negative, and the check on line 296 will not be triggered. Furthermore, `MetaDataBuilder.checkSize` allows for user-entered HPACK header value sizes to be negative, potentially leading to a very large buffer allocation later on when the user-entered size is multiplied by 2. This means that if a user provides a negative length value (or, more precisely, a length value which, when multiplied by the 4/3 fudge factor, is negative), and this length value is a very large positive number when multiplied by 2, then the user can cause a very large buffer to be allocated on the server. Users of HTTP/2 can be impacted by a remote denial of service attack. The issue has been fixed in versions 11.0.16, 10.0.16, and 9.4.53. There are no known workarounds. Looking for a version the fixes this vulnerability. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4775) Add a version of check_zookeeper that works with Python 3
Enrico Olivelli created ZOOKEEPER-4775: -- Summary: Add a version of check_zookeeper that works with Python 3 Key: ZOOKEEPER-4775 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4775 Project: ZooKeeper Issue Type: Improvement Reporter: Enrico Olivelli -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4774) zookeeper.ssl.context.supplier.class is not working in 3.9
Jerry Chung created ZOOKEEPER-4774: -- Summary: zookeeper.ssl.context.supplier.class is not working in 3.9 Key: ZOOKEEPER-4774 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4774 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.9.1 Reporter: Jerry Chung hi, {{zookeeper.ssl.context.supplier.class}} was working since 3.6 - 3.8, but it's not working anymore in 3.9. Is this a permanent decision? The document is still mentioning it. Thanks. jerry -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4773) Ephemeral node is not deleted when all followers are blocked with leader
May created ZOOKEEPER-4773: -- Summary: Ephemeral node is not deleted when all followers are blocked with leader Key: ZOOKEEPER-4773 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4773 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.9.1, 3.8.3 Reporter: May The test case EphemeralNodeDeletionTest describes that a follower loses its connection with leader when the client writes an ephemeral node, and it should delete the node after the client closed. However, the case fails when I make all followers lose connections. To reproduce the bug, I simply modified testEphemeralNodeDeletion() as following: {code:java} // 2: inject network problem in two followers ArrayList followers = getFollowers(); for (CustomQuorumPeer follower : followers) { follower.setInjectError(true); } //CustomQuorumPeer follower = (CustomQuorumPeer) getByServerState(mt, ServerState.FOLLOWING); //follower.setInjectError(true); // 3: close the session so that ephemeral node is deleted zk.close(); // remove the error //follower.setInjectError(false); for (CustomQuorumPeer follower : followers) { follower.setInjectError(false); assertTrue(ClientBase.waitForServerUp("127.0.0.1:" + follower.getClientPort(), CONNECTION_TIMEOUT), "Faulted Follower should have joined quorum by now"); } {code} And here is added method getFollowers(): {code:java} private ArrayList getFollowers() { ArrayList followers = new ArrayList<>(); for (int i = 0; i <= mt.length - 1; i++) { QuorumPeer quorumPeer = mt[i].getQuorumPeer(); if (null != quorumPeer && ServerState.FOLLOWING == quorumPeer.getPeerState()) { followers.add((CustomQuorumPeer)quorumPeer); } } return followers; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4772) Wrong sync logic in LearnerHandler when sync (0,0) to a new epoch follower
May created ZOOKEEPER-4772: -- Summary: Wrong sync logic in LearnerHandler when sync (0,0) to a new epoch follower Key: ZOOKEEPER-4772 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4772 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.1, 3.8.3, 3.7.2 Reporter: May Current LearnerHandler's syncFollower does not consider the situation that the proposal (0,0) is committed and snaped. It will not use snap to sync when minCommittedLog is 0. The bug can be reproduced by modifying testNewEpochZxid in LearnerHandlerTest: {code:java} public void testNewEpochZxid() throws Exception { long peerZxid; db.txnLog.add(createProposal(getZxid(0, 0))); // Added db.txnLog.add(createProposal(getZxid(0, 1))); db.txnLog.add(createProposal(getZxid(1, 1))); db.txnLog.add(createProposal(getZxid(1, 2))); // After leader election, lastProcessedZxid will point to new epoch db.lastProcessedZxid = getZxid(2, 0); db.committedLog.add(createProposal(getZxid(0, 0))); // Added db.committedLog.add(createProposal(getZxid(1, 1))); db.committedLog.add(createProposal(getZxid(1, 2))); // Peer has zxid of epoch 0 peerZxid = getZxid(0, 0); // We should get snap, we can do better here, but the main logic is // that we should never send diff if we have never seen any txn older // than peer zxid assertTrue(learnerHandler.syncFollower(peerZxid, leader)); // Fail here {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4771) Fast leader election taking too long
Ivo Vrdoljak created ZOOKEEPER-4771: --- Summary: Fast leader election taking too long Key: ZOOKEEPER-4771 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.10 Reporter: Ivo Vrdoljak Attachments: zookeeper10.log, zookeeper11.log, zookeeper12.log, zookeeper20.log, zookeeper21.log Hello ZooKeeper Community, Background: We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper servers running, that are distributed across 2 clusters of servers. In the first cluster, we have 3 Zookeeper servers, each deployed on its own machine, and in the second cluster we have 2 Zookeeper servers, also each on its own machine. Zookeeper servers that are distributed on the same cluster communicate through the local network, and with the servers on the remote cluster through an external network. The situation is the following: {code:java} Cluster 1 Zookeeper server 10 Zookeeper server 11 Zookeeper server 12 -> Leader Cluster 2 Zookeeper server 20 Zookeeper server 21 {code} Problem: We have an issue with Fast Leader Election when we kill the ZooKeeper leader process. After the leader (server 12) is killed and leader election starts, we can see in Zookeeper logs that voting notifications are exchanged from each Zookeeper server that remained alive towards all the others. Notification on Zookeeper servers located in the same cluster (communicating over the local network) are successfully exchanged. The problem seems to be for Zookeeper server sending votes over external network as according to the logs they are only sent in one direction. Logs from zookeeper server 10: {code:java} Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO LOOKING Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election protocol... Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), 0xe9c97 (newzxid), 12 (oldleader), 0xd1380 (oldzxid) Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO New election. My id = 10, proposed zxid=0xe9c97 Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11 Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11{code} Logs from zookeeper server 20: {code:java} Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO LOOKING Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election protocol... Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch) Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe9c97 (n.zxid), 0x11 (n.round),
[jira] [Created] (ZOOKEEPER-4770) zkSnapshotRecursiveSummaryToolkit.sh Error: Could not find or load main class
nailcui created ZOOKEEPER-4770: -- Summary: zkSnapshotRecursiveSummaryToolkit.sh Error: Could not find or load main class Key: ZOOKEEPER-4770 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4770 Project: ZooKeeper Issue Type: Bug Components: scripts, tools Affects Versions: 3.9.1 Environment: CentOS Linux release 7.4.1708 Reporter: nailcui Fix For: 3.9.2 When I execute the following code to analyze the snapshot file: {code:java} ./bin/zkSnapshotRecursiveSummaryToolkit.sh /data/version-2/snapshot.c0009 / 2 {code} Getting this error: {code:java} Error: Could not find or load main class {code} I checked the source code and found that $JVMFLAGS was surrounded by quotation marks. This problem occurs when the variable $JVMFLAGS is empty. {code:java} "$JAVA" -cp "$CLASSPATH" "$JVMFLAGS" \ org.apache.zookeeper.server.SnapshotRecursiveSummary "$@" {code} The correct code should be like this {code:java} "$JAVA" -cp "$CLASSPATH" $JVMFLAGS \ org.apache.zookeeper.server.SnapshotRecursiveSummary "$@"{code} Thank you, I will solve it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4769) Update plugin for SBOM generation to 2.7.10
Vinod Anandan created ZOOKEEPER-4769: Summary: Update plugin for SBOM generation to 2.7.10 Key: ZOOKEEPER-4769 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4769 Project: ZooKeeper Issue Type: Improvement Reporter: Vinod Anandan Update the CycloneDX Maven plugin for SBOM generation to 2.7.10 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4768) Flaky test org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo
Yike Xiao created ZOOKEEPER-4768: Summary: Flaky test org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo Key: ZOOKEEPER-4768 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4768 Project: ZooKeeper Issue Type: Test Components: metric system Affects Versions: 3.10.0 Reporter: Yike Xiao If the {{io.prometheus.client.hotspot.DefaultExports#initialize}} method has been executed by other test cases before running the {{org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo}} test case, then this test case will fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4767) New implementation of prometheus qunatile metrics based on DataSketches
Yike Xiao created ZOOKEEPER-4767: Summary: New implementation of prometheus qunatile metrics based on DataSketches Key: ZOOKEEPER-4767 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767 Project: ZooKeeper Issue Type: Improvement Components: metric system Reporter: Yike Xiao If the built-in Prometheus metrics feature introduced after version 3.6 is enabled, under high-load scenarios (such as when there are a large number of read requests), the percentile metrics (Summary) used to collect request latencies can easily become a bottleneck and impact the service itself. This is because the internal implementation of Summary involves the overhead of lock operations. In scenarios with a large number of requests, lock contention can lead to a dramatic deterioration in request latency. The details of this issue and related profiling can be viewed in ZOOKEEPER-4741. In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a separate thread pool. While this approach avoids the overhead of lock contention caused by multiple threads updating Summary simultaneously, it introduces the operational overhead of the thread pool queue and additional garbage collection (GC) overhead. Especially when the thread pool queue is full, a large number of RejectedExecutionException instances will be thrown, further increasing the pressure on GC. To address problems above, I have implemented an almost lock-free solution based on DataSketches. Benchmark results show that it offers over a 10x speed improvement compared to version 3.9.1 and avoids frequent GC caused by creating a large number of temporary objects. The trade-off is that the latency percentiles will be displayed with a relative delay (default is 60 seconds), and each Summary metric will have a certain amount of permanent memory overhead. This solution refers to Matteo Merli's optimization work on the percentile latency metrics for Bookkeeper, as detailed in https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4766) Ensure leader election time does not unnecessarily scale with tree size due to snapshotting
Rishabh Rai created ZOOKEEPER-4766: -- Summary: Ensure leader election time does not unnecessarily scale with tree size due to snapshotting Key: ZOOKEEPER-4766 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4766 Project: ZooKeeper Issue Type: Improvement Components: leaderElection Affects Versions: 3.8.3, 3.5.9 Environment: General behavior, should occur in all environments Reporter: Rishabh Rai Fix For: 3.8.3, 3.5.9 Hi ZK community, this is regarding a fix for a behavior that is causing the leader election time to unnecessarily scale with the amount of data in the ZK data tree. *tl;dr:* During leader election, the leader always saves a snapshot when loading its data tree. This snapshot seems unnecessary, even in the case where the leader needs to send an updated SNAP to a learner, since it serializes the tree before sending anyway. Snapshotting slows down leader election and increases ZK downtime significantly as more data is stored in the tree. This improvement is to avoid taking a snapshot so that this unnecessary downtime is avoided. During leader election, when the [data is loaded|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L601] by the tentatively elected (i.e. pre-finalized quorum) leader server, a [snapshot of the tree is always taken|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L540]. The loadData method is called from multiple places, but specifically in the context of leader election, it seems like the snapshotting step is unnecessary for the leader when loading data: * Because it has loaded the tree at this point, we know that if the leader were to go down again, it would still be able to recover back to the current state at which we are snapshotting without using the snapshot that we are taking in loadData() * There are no ongoing transactions until leader election is completed and the ZK ensemble is back up, so no data would be lost after the point at which the data tree is loaded * Once the ensemble is healthy and the leader is handling transactions again, any new transactions are being logged and when needed the log is being rolled over when needed anyway, so if the leader is recovering from a failure, the snapshot taken during loadData() does not afford us any additional benefits over the initial snapshot (if it existed) and transaction log that the leader used to load its data from in loadData() * When the leader is deciding to send a SNAP or a DIFF to a learner, a [SNAP is serialized|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L582] and sent [if and only if it is needed|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L562]. The snapshot taken in loadData() again does not seem to be beneficial here. The PR for this fix only skips this snapshotting step in loadData() during leader election. The behavior of the function remains the same for other usages. With this change, during leader election the data tree would only be serialized when sending a SNAP to a learner. In other scenarios, no data tree serialization would be needed at all. In both cases, there is a significant in the time spent in leader election. If my understanding of any of this is incorrect, or if I'm failing to consider some other aspect of the process, please let me know. The PR for the change can also be changed to enable/disable this behavior via a java property. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4765) high maxlatency & xid out of order
yangoofy created ZOOKEEPER-4765: --- Summary: high maxlatency & xid out of order Key: ZOOKEEPER-4765 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4765 Project: ZooKeeper Issue Type: Wish Reporter: yangoofy 1. We use Zookeeper as the registration center for microservices. Zookeeper uses the default configuration. Snapshot600m, with 70w ephemeral nodes. A single Zookeeper server is 56c128g and xmx16g. The number of client connections is 8k, and the number of watches is 400w. During the centralized start and stop time of the client, the maxlatency of the server is as high as 10 seconds. How can I optimize it? 2. Is it necessary for the Zookeeper client to receive a response before verifying the request sent first, otherwise an error will be reported as' xid out of order '? Can we modify it to send requests and responses based on xid matching instead of strong verification order? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4764) Tune the log of refuse session request.
Yan Zhao created ZOOKEEPER-4764: --- Summary: Tune the log of refuse session request. Key: ZOOKEEPER-4764 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4764 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.1, 3.8.3, 3.7.2 Reporter: Yan Zhao Fix For: 3.7.3, 3.8.4, 3.9.2 The log: Refusing session request for client as it has seen zxid our last zxid is 0x0 client must try another server (org.apache.zookeeper.server.ZooKeeperServer) We would better print the sessionId in the content. After improvement: Refusing session(0xab) request for client as it has seen zxid our last zxid is 0x0 client must try another server (org.apache.zookeeper.server.ZooKeeperServer) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4763) Logback dependency should be scope provided/test
Thomas Mortagne created ZOOKEEPER-4763: -- Summary: Logback dependency should be scope provided/test Key: ZOOKEEPER-4763 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4763 Project: ZooKeeper Issue Type: Bug Components: other Affects Versions: 3.8.0 Reporter: Thomas Mortagne In general, a library which use SLF4J is not supposed to impose implementation (logback, log4j2, etc.) to use which remain a choice of the final runtime/WAR. The logback dependency in zookeeper should be set to scope "provided" or "test" so that Maven does not follow it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4762) Update netty jars to 4.1.99+ to fix CVE-2023-4586
Dhoka Pramod created ZOOKEEPER-4762: --- Summary: Update netty jars to 4.1.99+ to fix CVE-2023-4586 Key: ZOOKEEPER-4762 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4762 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.8.3 Reporter: Dhoka Pramod Fix For: 3.8.4 [https://nvd.nist.gov/vuln/detail/CVE-2023-4586] A vulnerability was found in the Hot Rod client. This security issue occurs as the Hot Rod client does not enable hostname validation when using TLS, possibly resulting in a man-in-the-middle (MITM) attack. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4761) Cli tool read saved clientid fail
Yang Guo created ZOOKEEPER-4761: --- Summary: Cli tool read saved clientid fail Key: ZOOKEEPER-4761 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4761 Project: ZooKeeper Issue Type: Bug Reporter: Yang Guo A very simple bug when reading saved clientid using fread. It causes the cli tool fail to recover from the last session connected to zookeeper server. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4760) Add support for filename to get and set cli commands
Soumitra Kumar created ZOOKEEPER-4760: - Summary: Add support for filename to get and set cli commands Key: ZOOKEEPER-4760 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4760 Project: ZooKeeper Issue Type: Improvement Components: tools Reporter: Soumitra Kumar CLI supports get and set commands to read and write data. Add support for: # reading input data for set command from a file, and # writing output data in get command to a file This will help in dealing with arbitrary byte arrays and also scripting read/write to large number of znodes using CLI. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4759) Handle Netty CVE-2023-44487 and CVE-2023-39325
Aurélien Pupier created ZOOKEEPER-4759: -- Summary: Handle Netty CVE-2023-44487 and CVE-2023-39325 Key: ZOOKEEPER-4759 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4759 Project: ZooKeeper Issue Type: Task Reporter: Aurélien Pupier https://netty.io/news/2023/10/10/4-1-100-Final.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4758) Upgrade snappy-java to 1.1.10.4 to fix CVE-2023-43642
Dhoka Pramod created ZOOKEEPER-4758: --- Summary: Upgrade snappy-java to 1.1.10.4 to fix CVE-2023-43642 Key: ZOOKEEPER-4758 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4758 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.8.3 Reporter: Dhoka Pramod Fix For: 3.8.4 The SnappyInputStream was found to be vulnerable to Denial of Service (DoS) attacks when decompressing data with a too large chunk size. Due to missing upper bound check on chunk length, an unrecoverable fatal error can occur. All versions of snappy-java including the latest released version 1.1.10.3 are vulnerable to this issue. A fix has been introduced in commit `9f8c3cf74` which will be included in the 1.1.10.4 release. Users are advised to upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4757) Support JSON format logging
Jan Høydahl created ZOOKEEPER-4757: -- Summary: Support JSON format logging Key: ZOOKEEPER-4757 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4757 Project: ZooKeeper Issue Type: Improvement Reporter: Jan Høydahl More and more enterprise users request structured JSON format logging for their applications. This removes the need for configuring custom log line parsers for every application when collecting logs centrally. Zookeeper has flexible logging through Slf4j and Logback, for which there are several ways to achieve JSON logging. But for end users (such as helm chart user) it is very difficult to achieve. It should ideally be as simple as a configuration option. OpenTelemetry is a CNCF project that has become the defacto standard for metrics and traces collection. They also have a logging standard, and they recently [standardized on ECS JSON format|https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/] as their log schema for OTEL-logging. Although there are other JSON formats in use, a pragmatic option is to only support ECS. Proposed way to enable JSON logging: {code:java} export ZOO_LOG_FORMAT=json bin/zkServer.sh start # OR bin/zkServer.sh start -Dzookeeper.log.format=json{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4756) Merge script should use GitHub api to merge pull requests
Andor Molnar created ZOOKEEPER-4756: --- Summary: Merge script should use GitHub api to merge pull requests Key: ZOOKEEPER-4756 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4756 Project: ZooKeeper Issue Type: Improvement Components: tools Affects Versions: 3.9.0 Reporter: Andor Molnar Github merge script (zk-merge-pr.py) is a nice tool which does a lot of housekeeping tasks when merging a PR including fixing the commit message or closing the Jira. Merging on the Github UI is also possible, but could lead to mistakes like leaving the commit message without the Jira id. Unfortunately when the script merges the PR it does that without Github and leaving the PR in 'Closed' rather than 'Merged'. This is misleading. Let's improve the script to use Github API for merging PRs and possibly disable merging on the Github UI. Email thread: [https://lists.apache.org/thread/cbmktklydtlylkybvq6jrx5m4l8b2cm5] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4755) Handle Netty CVE-2023-4586
Damien Diederen created ZOOKEEPER-4755: -- Summary: Handle Netty CVE-2023-4586 Key: ZOOKEEPER-4755 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4755 Project: ZooKeeper Issue Type: Task Reporter: Damien Diederen Assignee: Damien Diederen The {{dependency-check:check}}... check currently fails with the following: {noformat} [ERROR] netty-handler-4.1.94.Final.jar: CVE-2023-4586(6.5) {noformat} According to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-4586 , CVE-2023-4586 is reserved. No fix or additional information is available as of the creation of this ticket. We have to: # Temporarily suppress the check; # Monitor CVE-2023-4586 and apply the remediation as soon as it becomes available. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4754) Update Jetty to avoid CVE-2023-36479, CVE-2023-40167, and CVE-2023-41900
Damien Diederen created ZOOKEEPER-4754: -- Summary: Update Jetty to avoid CVE-2023-36479, CVE-2023-40167, and CVE-2023-41900 Key: ZOOKEEPER-4754 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4754 Project: ZooKeeper Issue Type: Task Reporter: Damien Diederen Assignee: Damien Diederen -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4753) Explicit handling of DIGEST-MD5 vs GSSAPI in quorum auth
Damien Diederen created ZOOKEEPER-4753: -- Summary: Explicit handling of DIGEST-MD5 vs GSSAPI in quorum auth Key: ZOOKEEPER-4753 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4753 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.9.0 Reporter: Damien Diederen Assignee: Damien Diederen The SASL-based quorum authorizer does not explicitly distinguish between the DIGEST-MD5 and GSSAPI mechanisms: it is simply relying on {{NameCallback}} and {{PasswordCallback}} for authentication with the former and examining Kerberos principals in {{AuthorizeCallback}} for the latter. It turns out that some SASL/DIGEST-MD5 configurations cause authentication and authorization IDs not to match the expected format, and the DIGEST-MD5-based portions of the quorum test suite to fail with obscure errors. (They can be traced to failures to join the quorum, but only by looking into detailed logs.) We can use the login module name to determine whether DIGEST-MD5 or GSSAPI is used, and relax the authentication ID check for the former. As a cleanup, we can keep the password-based credential map empty when Kerberos principals are expected. Finally, we can adapt tests to ensure "weirdly-shaped" credentials only cause authentication failures in the GSSAPI case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4752) Remove version files in zookeeper-server/src/main from .gitignore
Istvan Toth created ZOOKEEPER-4752: -- Summary: Remove version files in zookeeper-server/src/main from .gitignore Key: ZOOKEEPER-4752 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4752 Project: ZooKeeper Issue Type: Bug Components: build Affects Versions: 3.8.2 Reporter: Istvan Toth The Info.java and VersionInfoMain.java files are currently generated into the target/generated-sources directory. However .gitignore still includes the following lines for the main src directory. {noformat} zookeeper-server/src/main/java/org/apache/zookeeper/version/Info.java zookeeper-server/src/main/java/org/apache/zookeeper/version/VersionInfoMain.java {noformat} Let's remove them. I've just spent two hours trying to debug mysterious build failures, which were caused by an old Info.java file in src, which didn't show up in git status because of those out-of-date .gitignore entries. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4751) Update snappy-java to 1.1.10.5 to address CVE-2023-43642
Lari Hotari created ZOOKEEPER-4751: -- Summary: Update snappy-java to 1.1.10.5 to address CVE-2023-43642 Key: ZOOKEEPER-4751 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4751 Project: ZooKeeper Issue Type: Task Reporter: Lari Hotari snappy-java 1.1.10.1 contains CVE-2023-43642 . Upgrade the dependency to 1.1.10.5 to get rid of the CVE. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4750) RequestPathMetricsCollector does not align with FinalRequestProcessor
Kezhu Wang created ZOOKEEPER-4750: - Summary: RequestPathMetricsCollector does not align with FinalRequestProcessor Key: ZOOKEEPER-4750 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4750 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.0 Reporter: Kezhu Wang For example, it does not handle {{createTTL}}. {noformat} 2023-09-30 17:46:59,212 [myid:] - ERROR [SyncThread:0:o.a.z.s.u.RequestPathMetricsCollector@216] - We should not handle 21 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4749) Request timeout is not respected for asynchronous api
Kezhu Wang created ZOOKEEPER-4749: - Summary: Request timeout is not respected for asynchronous api Key: ZOOKEEPER-4749 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4749 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.9.0 Reporter: Kezhu Wang "zookeeper.request.timeout" is only consulted in synchronous code path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4748) quorum.QuorumCnxManager: BufferUnderflowException
Ke Han created ZOOKEEPER-4748: - Summary: quorum.QuorumCnxManager: BufferUnderflowException Key: ZOOKEEPER-4748 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4748 Project: ZooKeeper Issue Type: Bug Components: quorum Reporter: Ke Han Attachments: hbase--zookeeper-8db357045302.log, persistent.tar.gz When running zookeeper (3.5.7, integrated in HBase-2.4.7), I met the following error message. {code:java} 2023-09-25T11:24:41,326 ERROR [SendWorker:1] quorum.QuorumCnxManager: BufferUnderflowException java.nio.BufferUnderflowException: null at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:155) ~[?:1.8.0_362] at java.nio.ByteBuffer.get(ByteBuffer.java:723) ~[?:1.8.0_362] at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.send(QuorumCnxManager.java:1083) ~[zookeeper-3.5.7.jar:3.5.7] at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1115) ~[zookeeper-3.5.7.jar:3.5.7] {code} Here's the structure of my cluster N0: ZK0, HMaster N1: ZK1, Regionserver1 N1: ZK2, Regionserver2 N_100: HDFS This error happen when I upgrade the HBase cluster, the zookeeper cluster also gets a restart. The error message happens rarely. Considering its ERROR level, I am not sure whether it will cause other issues. But the cluster still seems to be working correctly. I noticed that the send() code remains the same in the new version. I suspect it might also happen in the latest version. If it's benign, would it be better to be output as WARN level? I have attached my full logs (persistent.tar.gz). The specific error occurred in hbase--zookeeper-8db357045302.log. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4747) Java api lacks synchronous version of sync() call
Kezhu Wang created ZOOKEEPER-4747: - Summary: Java api lacks synchronous version of sync() call Key: ZOOKEEPER-4747 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4747 Project: ZooKeeper Issue Type: New Feature Components: java client Reporter: Kezhu Wang Assignee: Kezhu Wang Fix For: 3.10.0 Ideally, it should be redundant just as what [~breed] says in ZOOKEEPER-1167. {quote} it wasn't an oversight. there is no reason for a synchronous version. because of the ordering guarantees, if you issue an asynchronous sync, the next call, whether synchronous or asynchronous will see the updated state. {quote} But in case of connection loss and absent of ZOOKEEPER-22, client has to check result of asynchronous sync before next call. So, currently, we can't simply issue an fire-and-forget asynchronous sync and an read to gain strong consistent. Then in a synchronous call chain, client has to convert asynchronous {{sync}} to synchronous to gain strong consistent. This is what I do in [EagerACLFilterTest::syncClient|https://github.com/apache/zookeeper/blob/f42c01de73867ffbc12707b3e9f9cd7f847fe462/zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/EagerACLFilterTest.java#L98], it is apparently unfriendly to end users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4746) cppunit tests hang and cancelled
Kezhu Wang created ZOOKEEPER-4746: - Summary: cppunit tests hang and cancelled Key: ZOOKEEPER-4746 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4746 Project: ZooKeeper Issue Type: Test Components: tests Affects Versions: 3.10.0 Reporter: Kezhu Wang * https://github.com/apache/zookeeper/actions/runs/6007712384/job/16337953123 * https://github.com/apache/zookeeper/actions/runs/6047057349/job/16409786315 * https://github.com/apache/zookeeper/actions/runs/6195151365/job/16819317479 * https://github.com/apache/zookeeper/actions/runs/6196548582/job/16823409398 Hang too long to be cancelled by runner. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4745) End to End tests fail occasionally
Kezhu Wang created ZOOKEEPER-4745: - Summary: End to End tests fail occasionally Key: ZOOKEEPER-4745 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4745 Project: ZooKeeper Issue Type: Test Components: tests Affects Versions: 3.10.0 Reporter: Kezhu Wang I saw: * https://github.com/apache/zookeeper/actions/runs/5587157838/job/15131211778 * https://github.com/kezhuw/zookeeper/actions/runs/5251205631/job/14209201285 * https://github.com/kezhuw/zookeeper/actions/runs/6198985701/job/16830576384#step:9:38 * https://github.com/apache/zookeeper/actions/runs/6244974218/job/16952757583#step:11:44 {noformat} 2023-07-18 12:08:34,046 [myid:] - ERROR [main:o.a.z.u.ServiceUtils@48] - Exiting JVM with code 1 ZooKeeper JMX enabled by default Using config: /home/runner/work/zookeeper/zookeeper/apache-zookeeper-3.7.0-bin/bin/../conf/zoo_sample.cfg Stopping zookeeper ... STOPPED Traceback (most recent call last): File "/home/runner/work/zookeeper/zookeeper/tools/ci/test-connectivity.py", line 48, in subprocess.run([f'{client_binpath}', 'sync', '/'], check=True) File "/usr/lib/python3.10/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['/home/runner/work/zookeeper/zookeeper/bin/zkCli.sh', 'sync', '/']' returned non-zero exit status 1. Error: Process completed with exit code 1. {noformat} I guess it could cause by asynchronous start. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4744) Zookeeper fails to start after power failure
Maria Ramos created ZOOKEEPER-4744: -- Summary: Zookeeper fails to start after power failure Key: ZOOKEEPER-4744 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4744 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.7.1 Environment: These are the configurations of the ZooKeeper cluster (omitting IPs): {{tickTime=2000}} {{dataDir=/home/data/zk37}} {{clientPort=2181}} {{maxClientCnxns=60}} {{initLimit=100}} {{syncLimit=100}} {{server.1=[IP1]:2888:3888}} {{server.2=[IP2]:2888:3888}} {{server.3=[IP3]:2888:3888}} Reporter: Maria Ramos Attachments: reported_error.txt The underlying issue stems from consecutive writes to the log file that are not interleaved with {{fsync}} operations. This is a well-documented behavior of operating systems, and there are several references addressing this problem: - [https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai] - [https://dl.acm.org/doi/pdf/10.1145/2872362.2872406] - [https://mariadb.com/kb/en/atomic-write-support/] - [https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf] (Page 9) This issue can be replicated using [LazyFS|https://github.com/dsrhaslab/lazyfs], a file system capable of simulating power failures and exhibiting the OS behavior mentioned above, i.e., the out-of-order file writes at the disk level. LazyFS persists these writes out of order and then crashes to simulate a power failure. To reproduce this problem, one can follow these steps: {*}1{*}. Mount LazyFS on a directory where ZooKeeper data will be saved, with a specified root directory. Assuming the data path for ZooKeeper is {{/home/data/zk}} and the root directory is {{{}/home/data/zk-root{}}}, add the following lines to the default configuration file (located in the {{config/default.toml}} directory): {{[[injection]] }} {{type="reorder" }} {{occurrence=1 }} {{op="write" }} {{file="/home/data/zk-root/version-2/log.10001" }} {{persist=[3]}} These lines define a fault to be injected. A power failure will be simulated after the third write to the {{/home/data/zk-root/version-2/log.10001}} file. The `occurrence` parameter allows specifying that this is the first group where this happens, as there might be more than one group of consecutive writes. {*}2{*}. Start LazyFS as the underlying file system of a node_ in the cluster with the following command: {{ ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/zk -r /home/data/zk-root -f}} {*}3{*}. Start ZooKeeper with the command: {{ apache-zookeeper-3.7.1-bin/bin/zkServer.sh start-foreground}} {*}4{*}. Connect a client to the node that has LazyFS as the underlying file system: {{apache-zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181}} Immediately after this step, LazyFS will be unmounted, simulating a power failure, and ZooKeeper will keep printing error messages in the terminal, requiring a forced shutdown. At this point, one can analyze the logs produced by LazyFS to examine the system calls issued up to the moment of the fault. Here is a simplified version of the log: {'syscall': 'create', 'path': '/home/gsd/data/zk37-root/version-2/log.10001', 'mode': 'O_TRUNC'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 'size': '16', 'off': '0'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 'size': '1', 'off': '67108879'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 'size': '67108863', 'off': '16'} {'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 'size': '61', 'off': '16'} Note that the third write is issued by LazyFS for padding. {*}5{*}. Remove the fault from the configuration file, unmount the file system with {{fusermount -uz /home/data/zk}} {*}6{*}. Mount LazyFS again with the previously provided command. {*}7{*}. Attempt to start ZooKeeper (it fails). By following these steps, one can replicate the issue and analyze the effects of the power failure on ZooKeeper's restart process. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4743) Jump from -2 to 0 when zookeeper increments znode dataVersion
HaiyuanZhao created ZOOKEEPER-4743: -- Summary: Jump from -2 to 0 when zookeeper increments znode dataVersion Key: ZOOKEEPER-4743 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4743 Project: ZooKeeper Issue Type: Improvement Reporter: HaiyuanZhao -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4742) Config watch path get truncated abnormally for chroot "/zoo" or alikes
Kezhu Wang created ZOOKEEPER-4742: - Summary: Config watch path get truncated abnormally for chroot "/zoo" or alikes Key: ZOOKEEPER-4742 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4742 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.8.2, 3.9.0, 3.7.1 Reporter: Kezhu Wang Assignee: Kezhu Wang This is a leftover of ZOOKEEPER-4565 and splitted from [pr#1996|https://github.com/apache/zookeeper/pull/1996] to make ZOOKEEPER-4601 concentrate. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled
Yike Xiao created ZOOKEEPER-4741: Summary: High latency under heavy load when prometheus metrics enabled Key: ZOOKEEPER-4741 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.8.2, 3.6.4 Environment: zookeeper version: 3.6.4 kernel: 3.10.0-1160.95.1.el7.x86_64 java version "1.8.0_111" metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider Reporter: Yike Xiao Attachments: 32010.threaddump.001.txt, 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load. Measured in a heavy client side, the latency could more than 25 seconds. !image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q on the server side. CommitProcWorkThread *BLOCKED* in {{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}{{{}{}}} "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 nid=0x6bca waiting for monitor entry [0x7f86deb95000] java.lang.Thread.State: BLOCKED (on object monitor)at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91)- waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) at io.prometheus.client.Summary$Child.observe(Summary.java:281)at io.prometheus.client.Summary.observe(Summary.java:307)at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The wall lock profile shows that there is lock contention within the {{CommitProcWorkThread}} threads. !https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png! {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4740) I want to use kerberos for Zookeeper, but my authentication has been unsuccessful
LiJie2023 created ZOOKEEPER-4740: Summary: I want to use kerberos for Zookeeper, but my authentication has been unsuccessful Key: ZOOKEEPER-4740 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4740 Project: ZooKeeper Issue Type: Wish Components: kerberos Affects Versions: 3.5.9 Reporter: LiJie2023 Attachments: image-2023-09-01-16-37-20-848.png zookeeper_jaas.conf {code:java} Server { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true useTicketCache=false keyTab="/opt/test2.keytab" principal="test2/bigdata.hadoop.master01"; };Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/opt/test2.keytab" principal="test2/bigdata.hadoop.master01" useTicketCache=false debug=true; }; {code} [root@bigdata conf]# cat java.env {code:java} export JVMFLAGS="-Djava.security.auth.login.config=/usr/lib/zookeeper/conf/zookeeper_jaas.conf" {code} /etc/krb5.conf {code:java} # Configuration snippets may be placed in this directory as well includedir /etc/krb5.conf.d/[logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log[libdefaults] dns_lookup_realm = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true rdns = false default_realm = EXAMPLE.COM default_ccache_name = KEYRING:persistent:%{uid}[realms] EXAMPLE.COM = { kdc = bigdata.hadoop.master01 admin_server = bigdata.hadoop.master01 }[domain_realm] .bigdata.hadoop.master01 = EXAMPLE.COM bigdata.hadoop.master01 = EXAMPLE.COM {code} !image-2023-09-01-16-37-20-848.png! When I use a client connection: {code:java} zookeeper-client -server localhost:12181 {code} Connecting to localhost:12181 2023-09-01 16:38:05,528 - INFO [main:Environment@109] - Client environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, built on 10/25/2022 23:07 GMT 2023-09-01 16:38:05,530 - INFO [main:Environment@109] - Client environment:host.name=bigdata.hadoop.master01 2023-09-01 16:38:05,530 - INFO [main:Environment@109] - Client environment:java.version=1.8.0_351 2023-09-01 16:38:05,532 - INFO [main:Environment@109] - Client environment:java.vendor=Oracle Corporation 2023-09-01 16:38:05,532 - INFO [main:Environment@109] - Client environment:java.home=/usr/java/jdk1.8.0_351-amd64/jre 2023-09-01 16:38:05,532 - INFO [main:Environment@109] - Client environment:java.class.path=/usr/lib/zookeeper/bin/../zookeeper-server/target/classes:/usr/lib/zookeeper/bin/../build/classes:/usr/lib/zookeeper/bin/../zookeeper-server/target/lib/*.jar:/usr/lib/zookeeper/bin/../build/lib/*.jar:/usr/lib/zookeeper/bin/../lib/zookeeper-jute-3.5.9.jar:/usr/lib/zookeeper/bin/../lib/zookeeper-3.5.9.jar:/usr/lib/zookeeper/bin/../lib/slf4j-log4j12-1.7.25.jar:/usr/lib/zookeeper/bin/../lib/slf4j-api-1.7.25.jar:/usr/lib/zookeeper/bin/../lib/netty-transport-native-unix-common-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-transport-native-epoll-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-transport-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-resolver-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-handler-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-common-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-codec-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-buffer-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/log4j-1.2.17.jar:/usr/lib/zookeeper/bin/../lib/json-simple-1.1.1.jar:/usr/lib/zookeeper/bin/../lib/jline-2.14.6.jar:/usr/lib/zookeeper/bin/../lib/jetty-util-ajax-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-util-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-servlet-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-server-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-security-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-io-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-http-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/javax.servlet-api-3.1.0.jar:/usr/lib/zookeeper/bin/../lib/jackson-databind-2.10.5.1.jar:/usr/lib/zookeeper/bin/../lib/jackson-core-2.10.5.jar:/usr/lib/zookeeper/bin/../lib/jackson-annotations-2.10.5.jar:/usr/lib/zookeeper/bin/../lib/commons-cli-1.2.jar:/usr/lib/zookeeper/bin/../lib/audience-annotations-0.5.0.jar:/usr/lib/zookeeper/bin/../zookeeper-jute.jar:/usr/lib/zookeeper/bin/../zookeeper-jute-3.5.9.jar:/usr/lib/zookeeper/bin/../zookeeper-3.5.9.jar:/usr/lib/zookeeper/bin/../zookeeper-server/src/main/resources/lib/*.jar:/etc/zookeeper/conf::/etc/zookeeper/conf:/usr/lib/zookeeper/zookeeper-3.5.9.jar:/usr/lib/zookeeper/zookeeper-jute-3.5.9.jar:/usr/lib/zookeeper/zookeeper-jute.jar:/usr/lib/zookeeper/zookeeper.jar:/usr/lib/zookeeper/lib/audience-annotations-0.5.0.jar:/usr/lib/zookeeper/lib/commons-cli-
[jira] [Created] (ZOOKEEPER-4739) Disable netty-tcnative for s390x arch
Vibhuti Sawant created ZOOKEEPER-4739: - Summary: Disable netty-tcnative for s390x arch Key: ZOOKEEPER-4739 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4739 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.0, 3.9.1 Reporter: Vibhuti Sawant As netty-tc-native is not supported for s390x arch, there were TC failures observed in ClientSSLTest, hence skipping this TC for s390x arch only. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4738) Clean test cases by refactoring assertFalse(equals()) with assertNotEquals
Taher Ghaleb created ZOOKEEPER-4738: --- Summary: Clean test cases by refactoring assertFalse(equals()) with assertNotEquals Key: ZOOKEEPER-4738 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4738 Project: ZooKeeper Issue Type: Improvement Reporter: Taher Ghaleb -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4737) Error occurs with the zookeeper_interest() function in version 3.5.8 of zookeeper-client-c
wangyuzhi created ZOOKEEPER-4737: Summary: Error occurs with the zookeeper_interest() function in version 3.5.8 of zookeeper-client-c Key: ZOOKEEPER-4737 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4737 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.8 Environment: zookeeper server version: 3.4.12 Reporter: wangyuzhi "I encountered an intermittent error while using zookeeper-client-c: /lib64/libc.so.6(cfree+0x1c) [0x7f3f9e74f4dc] /lib64/libc.so.6(_IO_free_backup_area+0x1a) [0x7f3f9e74713a] /lib64/libc.so.6(_IO_file_overflow+0x1d5) [0x7f3f9e7468d5] /lib64/libc.so.6(_IO_file_xsputn+0xb1) [0x7f3f9e745651] /lib64/libc.so.6(_IO_vfprintf+0x151d) [0x7f3f9e71769d] /lib64/libc.so.6(_IO_fprintf+0x87) [0x7f3f9e720827] proxy(log_message+0x20c) [0x8c39d2] proxy(zookeeper_interest+0x16b) [0x8b2c5c] We are only using ZooKeeper for service discovery, but this error occurs every once in a while." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4736) socket fd leak
lchq created ZOOKEEPER-4736: --- Summary: socket fd leak Key: ZOOKEEPER-4736 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4736 Project: ZooKeeper Issue Type: Bug Components: java client, server Affects Versions: 3.9.0, 3.8.0, 3.7.0, 3.6.3 Environment: zookeeper 3.6.3 !4cea510a57af58c08e73d146e8535ee4.jpg! Reporter: lchq Attachments: 4cea510a57af58c08e73d146e8535ee4.jpg, IMG_20230815_114433.jpg if network service is unavailable, as "ifdown eth0" or "service network stop" and so on, zk-client process running on this node will experience fd leakage 。it happens for invoking "new Zookeeper(..)". when network service is unavailable, ClientCnxn::SendThread::run() method will continuely do startConnect(),and suffer exception "SocketException: Network is unreachable". Exception handlers catch this exception and do SendThread::cleanup() to do some clean operation,but because in ClientCnxnSocketNIO::registerAndConnect method socket is registed to selector firstly and do sock.connect operation leading the fd of sock can't be closed. Changing the order of sock.connect and sock.register can solve this issue,and it will not affect Original sense because of the sock.reister take effect when selector.select(waitTimeOut) is triggered -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4735) set the RMI port to address issues with monitoring Zookeeper running in containers
Enrico Olivelli created ZOOKEEPER-4735: -- Summary: set the RMI port to address issues with monitoring Zookeeper running in containers Key: ZOOKEEPER-4735 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4735 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Enrico Olivelli Fix For: 3.10.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4734) FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears
Haoze Wu created ZOOKEEPER-4734: --- Summary: FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears Key: ZOOKEEPER-4734 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.6.0 Reporter: Haoze Wu In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and restarted to test for loading snapshots. However, during restarting of quorum server, we would call into ZkDataBase#loadDataBase(), from in which an IOException could be thrown because of transient disk failure. {code:java} public long loadDataBase() throws IOException { long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener); // line 240 and IOException thrown here initialized = true; return zxid; } {code} In FileTxnSnapLog#restore {code:java} public long restore(DataTree dt, Map sessions, PlayBackListener listener) throws IOException { long deserializeResult = snapLog.deserialize(dt, sessions); // IOException ... }{code} Here is the stacktrace: {code:java} at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124) at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330) at java.lang.Thread.run(Thread.java:748) {code} Finally, because of this IOException, restart would be failed and test failed. In terms of the fix, we could either retry the test like the one proposed by ZOOKEEPER-3157 or we could add some configurable retry mechanism to ZkDataBase#loadDataBase() to tolerate possible transient disk failure. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ZOOKEEPER-4733) non-return function error and asan error in CPPUNIT TESTs
whyer created ZOOKEEPER-4733: Summary: non-return function error and asan error in CPPUNIT TESTs Key: ZOOKEEPER-4733 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4733 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.8.2 Environment: gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 Reporter: whyer when enable Werror=return-type check in gcc, the following error occurs: {quote}zookeeper/zookeeper-client/zookeeper-client-c/tests/ZooKeeperQuorumServer.cc: In static member function ‘static std::vector ZooKeeperQuorumServer::getCluster(uint32_t, ZooKeeperQuorumServer::tConfigPairs, std::__cxx11::string)’: zookeeper/zookeeper-client/zookeeper-client-c/tests/ZooKeeperQuorumServer.cc:230:1: error: control reaches end of non-void function [-Werror=return-type] }{quote} when enable asan option on cppunit test, the following error occurs: {quote}1: Zookeeper_reconfig::testMigrationCycle= 1: ==415554==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60d0cf6f at pc 0x560905cbbd12 bp 0x7ffe32d10af0 sp 0x7ffe32d10ae8 1: READ of size 1 at 0x60d0cf6f thread T0 1: #0 0x560905cbbd11 in Zookeeper_reconfig::testMigrationCycle() zookeeper/zookeeper-client/zookeeper-client-c/tests/TestReconfig.cc:502 1: #1 0x560905cc21a1 in CppUnit::TestCaller::runTest() /usr/include/cppunit/TestCaller.h:166 1: #2 0x7fb8248815b1 in CppUnit::TestCaseMethodFunctor::operator()() const (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x235b1) 1: #3 0x7fb824877eb2 in CppUnit::DefaultProtector::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x19eb2) 1: #4 0x7fb82487e7e1 in CppUnit::ProtectorChain::protect(CppUnit::Functor const&, CppUnit::ProtectorContext const&) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x207e1) 1: #5 0x7fb824886e4f in CppUnit::TestResult::protect(CppUnit::Functor const&, CppUnit::Test*, std::__cxx11::basic_string, std::allocator > const&) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x28e4f) 1: #6 0x7fb82488138f in CppUnit::TestCase::run(CppUnit::TestResult*) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x2338f) 1: #7 0x7fb8248818e2 in CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x238e2) 1: #8 0x7fb8248817fd in CppUnit::TestComposite::run(CppUnit::TestResult*) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x237fd) 1: #9 0x7fb8248818e2 in CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x238e2) 1: #10 0x7fb8248817fd in CppUnit::TestComposite::run(CppUnit::TestResult*) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x237fd) 1: #11 0x7fb824886d71 in CppUnit::TestResult::runTest(CppUnit::Test*) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x28d71) 1: #12 0x7fb82488947d in CppUnit::TestRunner::run(CppUnit::TestResult&, std::__cxx11::basic_string, std::allocator > const&) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x2b47d) 1: #13 0x560905c918a5 in main zookeeper/zookeeper-client/zookeeper-client-c/tests/TestDriver.cc:152 1: #14 0x7fb8230b42e0 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e0) 1: #15 0x560905c914c9 in _start (build/zookeeper/zookeeper-client/zookeeper-client-c/zktest+0x154c9) 1: 1: 0x60d0cf6f is located 1 bytes to the left of 138-byte region [0x60d0cf70,0x60d0cffa) 1: allocated by thread T0 here: 1: #0 0x7fb824b5fbf0 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc2bf0) 1: #1 0x7fb823a5f1f6 (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xf6) 1: #2 0x7fb823a5bcb9 in std::ostream& std::ostream::_M_insert(unsigned long) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x10dcb9) 1: #3 0x7ffe32d1090f () 1: 1: SUMMARY: AddressSanitizer: heap-buffer-overflow zookeeper/zookeeper-client/zookeeper-client-c/tests/TestReconfig.cc:502 in Zookeeper_reconfig::testMigrationCycle() 1: Shadow bytes around the buggy address: 1: 0x0c1a7fff9990: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: 0x0c1a7fff99a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: 0x0c1a7fff99b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: 0x0c1a7fff99c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: 0x0c1a7fff99d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: =>0x0c1a7fff99e0: fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]00 00 1: 0x0c1a7fff99f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 1: 0x0c1a7fff9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: 0x0c1a7fff9a10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 1: 0x0c1a7fff9a20: fa fa fa fa fa fa fa f
[jira] [Created] (ZOOKEEPER-4732) improve Reproducible Builds
Herve Boutemy created ZOOKEEPER-4732: Summary: improve Reproducible Builds Key: ZOOKEEPER-4732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4732 Project: ZooKeeper Issue Type: Improvement Components: build Affects Versions: 3.9.0 Reporter: Herve Boutemy rebuilding Zookeeper 3.9.0 shows that it's only partially reproducible: https://github.com/jvm-repo-rebuild/reproducible-central/blob/master/content/org/apache/zookeeper/README.md analysis the root cause, there are 2 issues: 1. a few old plugins to upgrade (easy) 2. code generated that contains build timestamp: replacing with git commit timestamp would make the build reproducible (or even removing this, but removing is a bigger change as it impacts API) -- This message was sent by Atlassian Jira (v8.20.10#820010)