[jira] [Created] (ZOOKEEPER-4831) update slf4j from 1.x to 2.0.13, logback to 1.3.14

2024-04-29 Thread ZhangJian He (Jira)
ZhangJian He created ZOOKEEPER-4831:
---

 Summary: update slf4j from 1.x to 2.0.13, logback to 1.3.14
 Key: ZOOKEEPER-4831
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4831
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: ZhangJian He






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4830) zk_learners is incorrectly referenced as zk_followers

2024-04-23 Thread Nicholas Feinberg (Jira)
Nicholas Feinberg created ZOOKEEPER-4830:


 Summary: zk_learners is incorrectly referenced as zk_followers
 Key: ZOOKEEPER-4830
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4830
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.9.2, 3.8.4
Reporter: Nicholas Feinberg


https://issues.apache.org/jira/browse/ZOOKEEPER-3117 renamed the `zk_followers` 
metric to `zk_learners`, but some references to `zk_followers` remained in the 
repo, including in the documentation. These should be corrected.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4829) Support DatadirCleanup in minutes

2024-04-23 Thread Purshotam Shah (Jira)
Purshotam Shah created ZOOKEEPER-4829:
-

 Summary: Support DatadirCleanup in minutes
 Key: ZOOKEEPER-4829
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4829
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Purshotam Shah


On the cloud, space can be limited. Currently, the DatadirCleanup only supports 
hours; we should also support cleanup intervals in minutes.
 

2024-02-20 20:55:28,862 - WARN  
[QuorumPeer[myid=5](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:50512):o.a.z.s.q.Follower@131]
 - Exception when following the leader
java.io.IOException: No space left on device
    at java.base/java.io.FileOutputStream.writeBytes(Native Method)
    at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
    at 
org.apache.zookeeper.common.AtomicFileOutputStream.write(AtomicFileOutputStream.java:72)
    at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:233)
    at 
java.base/sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:312)
    at java.base/sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:316)
    at java.base/sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:153)
    at java.base/java.io.OutputStreamWriter.flush(OutputStreamWriter.java:251)
    at java.base/java.io.BufferedWriter.flush(BufferedWriter.java:257)
    at 
org.apache.zookeeper.common.AtomicFileWritingIdiom.(AtomicFileWritingIdiom.java:72)
    at 
org.apache.zookeeper.common.AtomicFileWritingIdiom.(AtomicFileWritingIdiom.java:54)
    at 
org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:2229)
    at 
org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:2258)
    at 
org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:511)
    at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:91)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1551)
2024-02-20 20:55:28,863 - INFO  
[QuorumPeer[myid=5](plain=disabled)(secure=[0:0:0:0:0:0:0:0]:50512):o.a.z.s.q.Follower@145]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4828) Minor 3.9 broke custom TLS setup with ssl.context.supplier.class

2024-04-22 Thread Jon Marius Venstad (Jira)
Jon Marius Venstad created ZOOKEEPER-4828:
-

 Summary: Minor 3.9 broke custom TLS setup with 
ssl.context.supplier.class
 Key: ZOOKEEPER-4828
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4828
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Jon Marius Venstad


We run embedded ZooKeeper in Vespa, and use a custom TLS stack, where we, e.g., 
do additional validation and authorisation in our TLS trust manager, for client 
certificates.
The changes in 
https://github.com/apache/zookeeper/commit/4a794276d3d371071c31f86c14da824fdd2e53c0,
 done for ZOOKEEPER-4622,
broke the `ssl.context.supplier.class configuration parameter`, documented in 
the ZK admin guide
(https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_configuration).
Consequently, current code (3.9.2) now enforces a file-based key and trust 
store for _any TLS_, which is not an option for us.

I looked at two ways to fix this:
1. Add new configuration parameters for _key_ and _trust_ store _suppliers_, as 
an alternative to the key and trust store _files_ required in the new (with 
3.9.0)
   ClientX509Util code—this adds another pair of config options, of which there 
are already plenty, and the user is stuck with
   the default JDK `Provider` (optional argument to 
SSLContext.getInstance(protocols, provider); it also lets users with a
   custom key and trust store use the native SSL support of Netty.
   Oh, and, Netty provides the option to specify a JDK `Provider` in the 
SslContextBuilder, too, so that _could_ be made configurable as well.
2. Restore the option of specifying a custom SSL context, and prefer this over 
using the Netty SslContextBuilder in the new
   ClientX509Util code, when present—this lets users specify a JDK `Provider`, 
but file based key and trust stores will be required for the native SSL added 
in 3.9.0.

I don't have a strong opinion on which option is better. I can also contribute 
a code change with either.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4827) Bump bouncycastl version from 1.75 to 1.78

2024-04-16 Thread ZhangJian He (Jira)
ZhangJian He created ZOOKEEPER-4827:
---

 Summary: Bump bouncycastl version from 1.75 to 1.78
 Key: ZOOKEEPER-4827
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4827
 Project: ZooKeeper
  Issue Type: Task
Reporter: ZhangJian He


Upgrade Bouncy Castle to 1.78 to address CVEs
https://bouncycastle.org/releasenotes.html#r1rv78

- https://www.cve.org/CVERecord?id=CVE-2024-29857 (reserved)
  - https://security.snyk.io/vuln/SNYK-JAVA-ORGBOUNCYCASTLE-6613079
- https://www.cve.org/CVERecord?id=CVE-2024-30171 (reserved)
  - https://security.snyk.io/vuln/SNYK-JAVA-ORGBOUNCYCASTLE-6613076
- https://www.cve.org/CVERecord?id=CVE-2024-30172 (reserved)
  - https://security.snyk.io/vuln/SNYK-JAVA-ORGBOUNCYCASTLE-6612984



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4826) Reduce unnecessary executable permissions on files

2024-04-16 Thread ZhangJian He (Jira)
ZhangJian He created ZOOKEEPER-4826:
---

 Summary: Reduce unnecessary executable permissions on files
 Key: ZOOKEEPER-4826
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4826
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: ZhangJian He


***Summary:*** This patch aims to modify the permissions of various files 
within the ZooKeeper repository that currently have executable permissions set 
(755) but do not require such permissions for their operation. Changing these 
permissions to 644 enhances security and maintains the consistency of file 
permissions throughout the project. ***Details:*** Several non-executable files 
(not including scripts or executable binaries) are currently set with 
executable permissions. This is generally unnecessary and can lead to potential 
security concerns. This patch will adjust these permissions to a more 
appropriate setting (644), which is sufficient for reading and writing 
operations but does not allow execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4825) CVE-2023-6378 is present in the current logback version (1.2.13) and hence we need to upgrade to 1.4.12

2024-04-11 Thread Bhavya hoda (Jira)
Bhavya hoda created ZOOKEEPER-4825:
--

 Summary: CVE-2023-6378 is present in the current logback version 
(1.2.13) and hence we need to upgrade to 1.4.12
 Key: ZOOKEEPER-4825
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4825
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Bhavya hoda






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4824) fix CVE-2024-29025 in netty package

2024-04-10 Thread Nikita Pande (Jira)
Nikita Pande created ZOOKEEPER-4824:
---

 Summary: fix CVE-2024-29025 in netty package
 Key: ZOOKEEPER-4824
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4824
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Nikita Pande


[CVE-2024-29025|https://github.com/advisories/GHSA-5jpm-x58v-624v] is the CVE 
for all netty-codec-http <  4.1.108.Final



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4823) Proposal: Update the wiki of Zab 1.0 (Phase 2) to make it more precise and conform to the implementation

2024-04-03 Thread Sirius (Jira)
Sirius created ZOOKEEPER-4823:
-

 Summary: Proposal: Update the wiki of Zab 1.0 (Phase 2) to make it 
more precise and conform to the implementation
 Key: ZOOKEEPER-4823
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4823
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Sirius


As ZooKeeper evolves these years, its code implementation deviates the design 
of [Zab 1.0|https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0] in 
several aspects.

One critical deviation lies in the _atomic actions_ upon a follower receives 
NEWLEADER (see 2.*f* in Phase 2).

The protocol requires that the follower " _*atomically*_ applies the new state 
and sets *f*.currentEpoch = _e_". However, the atomicity is not guaranteed with 
the current code implementation. Asynchronous logging and committing by 
multi-threads with node crash can interrupt this process and lead to possible 
data loss (see -ZOOKEEPER-3911-, ZOOKEEPER-4643, ZOOKEEPER-4646, 
-ZOOKEEPER-4785-). 

On the other hand, to implement atomicity is expensive and affecting 
performance. It is reasonable to adopt an implementation without requiring 
atomic updates in this step. It is highly recommended to update the design of 
Zab without requiring atomicity in Step 2.*f* to better guide the code 
implementation. 
h3. Update Step 2.*f* by removing the requirement of atomicity

Here provides a possible design of Step 2.*f* in Phase 2 with the removal of 
atomicity requirement.
h4. Phase 2: Sync with followers
 # *l* ...

 # *f* The follower syncs with the leader, but doesn't modify its state until 
it receives the NEWLEADER(_e_) packet. Once it receives NEWLEADER(_e_), -_it 
atomically applies the new state, and then sets f.currentEpoch = e. It then 
sends ACK(e << 32)._-

it executes the following actions sequentially:

*2.1. applies the new state;*

*2.2. sets f.currentEpoch = e;*

*2.3. sends ACK(e << 32).*

 # *l* ...

 

Note: 
 * To ensure the correctness without requiring atomicity, the follower must 
persist and sync the data before it updates its currentEpoch and replies 
NEWLEADER ack (See the analysis in ZOOKEEPER-4643 & ZOOKEEPER-4785)

 * This new design conforms to the code implementation in current latest code 
version (ZooKeeper v3.9.2). This code version has fixed the known data loss 
issues that stay unresolved for a long time due to non-atomic executions in 
Step 2.*f* , including -ZOOKEEPER-3911-, ZOOKEEPER-4643, ZOOKEEPER-4646 & 
-ZOOKEEPER-4785-. (see the code fixes in 
[PR-2111|https://github.com/apache/zookeeper/pull/2111] & 
[PR-2152|https://github.com/apache/zookeeper/pull/2152]). 

 * The correctness of this new design has been verified with the TLA+ 
specifications of Zab at different abstraction levels, including

 ** [High-level protocol 
specification|https://github.com/AlphaCanisMajoris/zookeeper-tla-spec/blob/main/Zab_new.tla]
 (developed based on the original [protocol 
spec|https://github.com/apache/zookeeper/blob/master/zookeeper-specifications/protocol-spec/Zab.tla])
 

 ** [Multi-threading-level 
specification|https://github.com/AlphaCanisMajoris/zookeeper-tla-spec/blob/main/zk_pr_2152.tla]
 (developed based on the original [system 
spec.|https://github.com/apache/zookeeper/blob/master/zookeeper-specifications/system-spec/zk-3.7/ZkV3_7_0.tla]
 This spec is corresponding to 
[PR-2152|https://github.com/apache/zookeeper/pull/2152], an effort to fix more 
known issues in Phase 2.) 

In the verification, the TLC model checker checks whether the new design 
satisfies the properties given by the Zab paper. No violation is found during 
the checking with various configurations.

 

We sincerely hope that the above update of the protocol design can be presented 
at the wiki page, and make it guide the future code implementation better!

 

About us: 

We are a research team using TLA+ to verify the correctness of distributed 
systems. 

Looking forward to receiving feedback from the ZooKeeper community!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4822) Quorum TLS - Enable member authorization based on certificate CN

2024-03-29 Thread Damien Diederen (Jira)
Damien Diederen created ZOOKEEPER-4822:
--

 Summary: Quorum TLS - Enable member authorization based on 
certificate CN
 Key: ZOOKEEPER-4822
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4822
 Project: ZooKeeper
  Issue Type: New Feature
  Components: server
Reporter: Damien Diederen
Assignee: Damien Diederen


Quorum TLS enables mutual authentication of quorum members.

Member authorization, however, cannot be configured on the basis of the 
presented principal CN; a round of SASL authentication has to be performed on 
top of the secured connection.

This ticket is about enabling authorization based on trusted client 
certificates.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4821) ConnectRequest got NOTREADONLY ReplyHeader

2024-03-28 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4821:
-

 Summary: ConnectRequest got NOTREADONLY ReplyHeader
 Key: ZOOKEEPER-4821
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4821
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client, server
Affects Versions: 3.9.2, 3.8.4
Reporter: Kezhu Wang


I would expect {{ConnectRequest}} has two kinds of response in normal 
conditions: {{ConnectResponse}} and socket close. But if sever was configured 
with {{readonlymode.enabled}} but not {{localSessionsEnabled}}, then client 
could get {{NOTREADONLY}} in reply to {{ConnectRequest}}. I saw, at least, no 
handling in java client. And, I encountered this in writing tests for rust 
client.

It guess it is not by design. And we probably could close the socket in early 
phase. But also, it could be solved in client sides as 
{{sizeof(ConnectResponse)}} is larger than {{sizeof(ReplyHeader)}}. Then, we 
gain ability to carry error for {{ConnectRequest}} while {{ConnectResponse}} 
does not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4820) zookeeper pom leaks logback dependency

2024-03-27 Thread PJ Fanning (Jira)
PJ Fanning created ZOOKEEPER-4820:
-

 Summary: zookeeper pom leaks logback dependency
 Key: ZOOKEEPER-4820
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4820
 Project: ZooKeeper
  Issue Type: Task
  Components: java client
Reporter: PJ Fanning


Since v3.8.0

https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper/3.8.0

It's fine that Zookeeper uses Logback on the server side - but users who want 
to access Zookeeper using client side code also add this zookeeper jar to their 
classpaths. When zookeeper is used as client side lib, it should ideally not 
expose a logback dependency - just an slf4j-api jar dependency.

Would it be possible to repwork the zookeper pom so that client side users 
don't have to explicitly exclude logback jars? Many users will have their own 
preferred logging framework.

Is there another zookeeper client side jar that could be instead of 
zookeeper.jar?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4819) Can't seek for writable tls server if connected to readonly server

2024-03-26 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4819:
-

 Summary: Can't seek for writable tls server if connected to 
readonly server
 Key: ZOOKEEPER-4819
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4819
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.9.2, 3.8.4
Reporter: Kezhu Wang


{{[ClientCnxn::pingRwServer|https://github.com/apache/zookeeper/blob/d12aba599233b0fcba0b9b945ed3d2f45d4016f0/zookeeper-server/src/main/java/org/apache/zookeeper/ClientCnxn.java#L1280]}}
 uses raw socket to issue "isro" 4lw command. This results in unsuccessful 
handshake to tls server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4818) Export JVM heap metrics in ServerMetrics

2024-03-25 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created ZOOKEEPER-4818:
--

 Summary: Export JVM heap metrics in ServerMetrics
 Key: ZOOKEEPER-4818
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4818
 Project: ZooKeeper
  Issue Type: Improvement
  Components: metric system
Reporter: Andrew Kyle Purtell


A metric for JVM heap occupancy is not included in ServerMetrics.

According to [https://zookeeper.apache.org/doc/current/zookeeperMonitor.html] 
the recommended practice is for someone to enable the PrometheusMetricsProvider 
and the Prometheus base class upon which that provider is based does export 
that information. See 
[https://zookeeper.apache.org/doc/current/zookeeperMonitor.html] . The example 
provided for alerting on heap utilization there is:
{noformat}
  - alert: JvmMemoryFillingUp
expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: "JVM memory filling up (instance {{ $labels.instance }})"
  description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }}  
value = {{ $value }}\n"
{noformat}
where {{jvm_memory_bytes_used}} and {{jvm_memory_bytes_max}} are provided by a 
Prometheus base class.

Where PrometheusMetricsProvider is the right choice that's good enough but 
where the ServerMetrics information is consumed in an alternate way, by 
4-letter-word scraping, or by JMX, ServerMetrics should provide the same 
information. {{jvm_memory_bytes_used}} and {{jvm_memory_bytes_max}} (presuming 
heap) are reasonable names. An alternative could be to calculate the heap 
occupancy and provide that as a percentage, either an integer in the range 0 - 
100 or floating point value in the range 0.0 - 1.0. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4817) CancelledKeyException does not work in some cases.

2024-03-22 Thread gendong1 (Jira)
gendong1 created ZOOKEEPER-4817:
---

 Summary: CancelledKeyException does not work in some cases.
 Key: ZOOKEEPER-4817
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4817
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.10.0
Reporter: gendong1


If the client connection is disconnected with zoo server, cancelledkeyexception 
will arise.

Here is a strange scenarios.

NIOServerCxn.doIO is blocked at line 333 by the fail-slow nic.

If the delay lasts more than 30s, cancelledkeyexception will disappear.

If the delay lasts for 25s, cancelledkeyexception will arise.

When the doIO encounters the slowdown caused by teh fail-slow nic, the context 
is same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4816) A follower can not join the cluster for 20s seconds

2024-03-14 Thread gendong1 (Jira)
gendong1 created ZOOKEEPER-4816:
---

 Summary: A follower can not join the cluster for 20s seconds
 Key: ZOOKEEPER-4816
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4816
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.10.0
Reporter: gendong1
 Attachments: node1.log, node2.log, node3.log

We encounter a strange scenario. When we set up the cluster of zookeeper(3 
nodes totally), the third node is stuck in serializing the snapshot to the 
local disk. However, the leader election is executed normally. After the 
election, the third node is elected as the leader. The other two nodes fail to 
connect with the leader. Hence, the first and second nodes restart the leader 
election, finally the second node is elected as the leader. At this time, the 
third node still act as the leader. There are two leaders in the cluster. The 
first node can not join the cluster for 20s. During this procedure, the client 
can not connect with any nodes of the cluster.

  Runtime logs are attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4815) custom the data format of /zookeeper/config

2024-03-12 Thread yangoofy (Jira)
yangoofy created ZOOKEEPER-4815:
---

 Summary: custom the data format of /zookeeper/config
 Key: ZOOKEEPER-4815
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4815
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: yangoofy


When using QuorumMaj, I hope to support custom /zookeeper/config node data 
formats, such as
server. x=xx. xx. xx. xx: 2888:3888: observer; 0.0.0.0:2181; Group1
server. y=xx. xx. xx. xx: 2888:3888: observer; 0.0.0.0:2181; Group2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4814) Protocol desynchronization after Connect for (some) old clients

2024-03-07 Thread Damien Diederen (Jira)
Damien Diederen created ZOOKEEPER-4814:
--

 Summary: Protocol desynchronization after Connect for (some) old 
clients
 Key: ZOOKEEPER-4814
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4814
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.9.0
Reporter: Damien Diederen
Assignee: Damien Diederen


Some old clients experience a protocol synchronization after receiving a 
{{ConnectResponse}} from the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4813) Make zookeeper start successfully when the last log file is dirty during the restore progress

2024-03-06 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4813:
---

 Summary: Make zookeeper start successfully when the last log file 
is dirty during the restore progress
 Key: ZOOKEEPER-4813
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4813
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.1
Reporter: Yan Zhao
Assignee: Yan Zhao
 Fix For: 3.9.2


When the zookeeper restarts, it will restore the data from the last valid 
snapshot file, and replay txn log to append data.
But if the last log file is empty due to some reason, the restore will fail, 
not make the zookeeper can not restart.



{noformat}
14:12:16.023 [main] INFO  org.apache.zookeeper.server.persistence.SnapStream - 
Invalid snapshot snapshot.188700025d87. len = 761554294, byte = 45
14:12:16.024 [main] INFO  org.apache.zookeeper.server.persistence.FileSnap - 
Reading snapshot /pulsar/data/zookeeper/version-2/snapshot.188700025a05
14:12:17.350 [main] INFO  org.apache.zookeeper.server.DataTree - The digest in 
the snapshot has digest version of 2, with zxid as 0x188700025b07, and digest 
value as 510776662607117
14:12:17.492 [main] ERROR org.apache.zookeeper.server.quorum.QuorumPeer - 
Unable to load database on disk
java.io.EOFException: null
at java.io.DataInputStream.readInt(DataInputStream.java:386) ~[?:?]
at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) 
~[org.apache.zookeeper-zookeeper-jute-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67)
 ~[org.apache.zookeeper-zookeeper-jute-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:725)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:743)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:711)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:792)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:288) 
~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1149)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1135) 
~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91) 
~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
14:12:17.502 [main] INFO  
org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider - Shutdown 
executor service with timeout 1000
14:12:17.508 [main] INFO  org.eclipse.jetty.server.AbstractConnector - Stopped 
ServerConnector@2484f433{HTTP/1.1, (http/1.1)}{0.0.0.0:8000}
14:12:17.510 [main] INFO  org.eclipse.jetty.server.handler.ContextHandler - 
Stopped o.e.j.s.ServletContextHandler@59a67c3a{/,null,STOPPED}
14:12:17.515 [main] ERROR org.apache.zookeeper.server.quorum.QuorumPeerMain - 
Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server 
at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1204)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1135) 
~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1]
at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)
 ~[org.apache.zookeeper-zookeeper-3.9.1.jar:3.9.1

[jira] [Created] (ZOOKEEPER-4812) Another reconfiguration is in progress -- concurrent reconfigs not supported (yet)

2024-02-27 Thread sunfeifei (Jira)
sunfeifei created ZOOKEEPER-4812:


 Summary: Another reconfiguration is in progress -- concurrent 
reconfigs not supported (yet)
 Key: ZOOKEEPER-4812
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4812
 Project: ZooKeeper
  Issue Type: Bug
 Environment: zk版本为3.7.0

# echo mntr |nc 127.0.0.1 2181|grep version
zk_version  3.7.0-e3704b390a6697bfdf4b0bef79e3da7a4f6bac4b, built on 
2021-03-17 09:46 UTC

# echo mntr |nc 127.0.0.1 2181|grep version
zk_version  3.7.0-e3704b390a6697bfdf4b0bef79e3da7a4f6bac4b, built on 
2021-03-17 09:46 UTC
Reporter: sunfeifei
 Attachments: image-2024-02-28-11-49-37-465.png, 
image-2024-02-28-11-49-52-155.png

在使用reconfig命令增加或删除成员时提示
Another reconfiguration is in progress -- concurrent reconfigs not supported 
(yet)
,一直无法操作集群的增减节点。







--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4811) When configuring the IP address of the zk server on the client side to connect to zk, the connection establishment time is high

2024-02-25 Thread yangoofy (Jira)
yangoofy created ZOOKEEPER-4811:
---

 Summary: When configuring the IP address of the zk server on the 
client side to connect to zk, the connection establishment time is high
 Key: ZOOKEEPER-4811
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4811
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.7.1, 3.7.0
Reporter: yangoofy


When configuring the IP address of the zk server on the client side to connect 
to zk, the connection establishment time is high。Mainly because Obtaining the 
hostname of the address takes approximately 5 seconds. 3.4.6 has mechanism to 
safely avoid reverse DNS lookup,but 3.7 don't do that。

1.What's the reasone?

2.Can we modify the method StaticHostProvider/resolve() to avoid reverse DNS 
lookup?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4810) Fix data race in format_endpoint_info()

2024-02-19 Thread fanyang (Jira)
fanyang created ZOOKEEPER-4810:
--

 Summary: Fix data race in format_endpoint_info()
 Key: ZOOKEEPER-4810
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4810
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: fanyang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4809) Fix do_completion() use-after-free when log level is debug

2024-02-19 Thread fanyang (Jira)
fanyang created ZOOKEEPER-4809:
--

 Summary: Fix do_completion() use-after-free when log level is debug
 Key: ZOOKEEPER-4809
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4809
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: fanyang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4808) Fix the log statement in FastLeaderElection

2024-02-12 Thread Li Wang (Jira)
Li Wang created ZOOKEEPER-4808:
--

 Summary: Fix the log statement in FastLeaderElection
 Key: ZOOKEEPER-4808
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4808
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Li Wang


The proposedZxid and proposedEpoch is out of order in the following debug 
statement.

{code:java}

 LOG.debug(
"Sending Notification: {} (n.leader),  0x{} (n.peerEpoch), 0x{} 
(n.zxid), 0x{} (n.round), {} (recipient),"+underlined text+
+ " {} (myid) ",
proposedLeader,
Long.toHexString(proposedZxid),
Long.toHexString(proposedEpoch),
Long.toHexString(logicalclock.get()),
sid,
self.getMyId());

{code}









--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4807) Add sid for the leader goodbyte log

2024-02-09 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4807:
---

 Summary: Add sid for the leader goodbyte log
 Key: ZOOKEEPER-4807
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4807
 Project: ZooKeeper
  Issue Type: Wish
  Components: server
Affects Versions: 3.9.1
Reporter: Yan Zhao
 Fix For: 3.9.2


When a follower disconnects with the leader, the leader will print the remote 
address.

But if the zookeeper is along with istio, the remote address is not right.

2024-02-05T03:23:54,967+ [LearnerHandler-/127.0.0.6:56085] WARN  
org.apache.zookeeper.server.quorum.LearnerHandler - *** GOODBYE 
/127.0.0.6:56085 


We would better print the sid in the goodbye log.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4806) Commits have to be refreshed after merging

2024-02-09 Thread Andor Molnar (Jira)
Andor Molnar created ZOOKEEPER-4806:
---

 Summary: Commits have to be refreshed after merging
 Key: ZOOKEEPER-4806
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4806
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Andor Molnar
Assignee: Szucs Villo


The following error occurs if somebody wants to cherry-pick immediately after 
merge:
{noformat}
All checks have passed on the github.
Pull request #2115 merged. Sha: #18c78cd10bc02d764a46ac1659b263cf69f2671d

Would you like to pick 18c78cd10bc02d764a46ac1659b263cf69f2671d into another 
branch? (y/n): y
Enter a branch name [branch-3.9]:
git fetch apache
>From https://gitbox.apache.org/repos/asf/zookeeper
   72e3d9ce9..e571dd814  master -> apache/master
git checkout -b PR_TOOL_PICK_PR_2115_BRANCH-3.9 apache/branch-3.9
Switched to a new branch 'PR_TOOL_PICK_PR_2115_BRANCH-3.9'
git cherry-pick -sx 18c78cd10bc02d764a46ac1659b263cf69f2671d
fatal: bad object 18c78cd10bc02d764a46ac1659b263cf69f2671d

Error cherry-picking: Command '['git', 'cherry-pick', '-sx', 
'18c78cd10bc02d764a46ac1659b263cf69f2671d']' returned non-zero exit status 
128.{noformat}
The reason for this is, because the local git repo doesn't know about the new 
commit yet.

We should do a {{git fetch}} after successfully merged via GitHub.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4805) Update cwiki page with latest changes

2024-02-09 Thread Andor Molnar (Jira)
Andor Molnar created ZOOKEEPER-4805:
---

 Summary: Update cwiki page with latest changes
 Key: ZOOKEEPER-4805
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4805
 Project: ZooKeeper
  Issue Type: Sub-task
  Components: documentation
Reporter: Andor Molnar
Assignee: Szucs Villo


Update the following wiki page with latest changes and instructions how to use 
the script:

[https://cwiki.apache.org/confluence/display/ZOOKEEPER/Merging+Github+Pull+Requests]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4804) Use daemon threads for Netty client

2024-02-07 Thread Istvan Toth (Jira)
Istvan Toth created ZOOKEEPER-4804:
--

 Summary: Use daemon threads for Netty client
 Key: ZOOKEEPER-4804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4804
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.8.3
Reporter: Istvan Toth


When the Netty client is used, the Java process hangs on System.exit if there 
is an open Zookeeper connection.

This is caused by the non-daemon threads created by Netty.

Exiting without closing the connection is not a good practice, but this hang 
does not happen with the NIO client, and I think ZK should behave the same 
regardless of the client implementation used.

The Netty ThreadFactory implementation is configurable, it shouldn't be too 
hard make sure that daemon threads are created.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4803) Flaky test: QuorumPeerMainTest.testLeaderOutOfView

2024-02-07 Thread Ling Mao (Jira)
Ling Mao created ZOOKEEPER-4803:
---

 Summary: Flaky test: QuorumPeerMainTest.testLeaderOutOfView
 Key: ZOOKEEPER-4803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4803
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.10
Reporter: Ling Mao


{code:java}
[2024-02-08T02:01:19.039Z] [INFO] 
[2024-02-08T02:01:19.039Z] [INFO] Results:
[2024-02-08T02:01:19.039Z] [INFO] 
[2024-02-08T02:01:19.039Z] [ERROR] Failures: 
[2024-02-08T02:01:19.039Z] [ERROR]   QuorumPeerMainTest.testLeaderOutOfView:881 
expected:  but was: 
[2024-02-08T02:01:19.039Z] [INFO] 
[2024-02-08T02:01:19.039Z] [ERROR] Tests run: 3116, Failures: 1, Errors: 0, 
Skipped: 4
[2024-02-08T02:01:19.039Z] [INFO] 
[2024-02-08T02:01:19.039Z] [INFO] 

[2024-02-08T02:01:19.039Z] [INFO] Reactor Summary for Apache ZooKeeper 
3.10.0-SNAPSHOT: {code}
Link: 
https://ci-hadoop.apache.org/blue/organizations/jenkins/zookeeper-precommit-github-pr/detail/PR-2043/4/pipeline



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4802) Flaky test:RestoreQuorumTest.testRestoreAfterQuorumLost

2024-02-07 Thread Ling Mao (Jira)
Ling Mao created ZOOKEEPER-4802:
---

 Summary: Flaky test:RestoreQuorumTest.testRestoreAfterQuorumLost
 Key: ZOOKEEPER-4802
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4802
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.10
Reporter: Ling Mao


{code:java}
[INFO] Running org.apache.zookeeper.server.admin.RestoreQuorumTest
836[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
3.489 s <<< FAILURE! - in org.apache.zookeeper.server.admin.RestoreQuorumTest
837[ERROR] testRestoreAfterQuorumLost  Time elapsed: 3.344 s  <<< ERROR!
838java.net.ConnectException: Connection refused (Connection refused)
839 at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
840 at 
java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
841 at 
java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
842 at 
java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
843 at java.base/java.net.Socket.connect(Socket.java:609)
844 at java.base/java.net.Socket.connect(Socket.java:558)
845 at 
java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527)
846 at 
org.apache.zookeeper.server.admin.SnapshotAndRestoreCommandTest.takeSnapshotAndValidate(SnapshotAndRestoreCommandTest.java:413)
847 at 
org.apache.zookeeper.server.admin.RestoreQuorumTest.testRestoreAfterQuorumLost(RestoreQuorumTest.java:56)
848
849[INFO] Running org.apache.zookeeper.server.admin.CommandResponseTest {code}
Link: 
https://github.com/apache/zookeeper/actions/runs/7812662872/job/21310154983



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4801) Add memory size limitation policy for ZkDataBase#committedLog

2024-02-07 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4801:
---

 Summary: Add memory size limitation policy for 
ZkDataBase#committedLog
 Key: ZOOKEEPER-4801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4801
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.1
Reporter: Yan Zhao
 Fix For: 3.9.2


The ZkDataBase support commit log count to limit the memory, which is not 
precise, some request payloads may be huge, it will cost lots of heap memory.
So support payload size limitation will be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4800) Flaky test:ReconfigRollingRestartCompatibilityTest.testRollingRestartWithExtendedMembershipConfig

2024-02-05 Thread Ling Mao (Jira)
Ling Mao created ZOOKEEPER-4800:
---

 Summary: Flaky 
test:ReconfigRollingRestartCompatibilityTest.testRollingRestartWithExtendedMembershipConfig
 Key: ZOOKEEPER-4800
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4800
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.10
Reporter: Ling Mao


Link: 
https://github.com/apache/zookeeper/actions/runs/7781886238/job/21217202024?pr=1932
{code:java}
[ERROR] Failures: 
1023[ERROR]   
ReconfigRollingRestartCompatibilityTest.testRollingRestartWithExtendedMembershipConfig:263
 waiting for server 2 being up ==> expected:  but was: 
1024[INFO] 
1025[ERROR] Tests run: 3114, Failures: 1, Errors: 0, Skipped: 4
1026[INFO] 
1027[INFO] 
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4799) Refactor ACL check in addWatch command

2024-02-01 Thread Damien Diederen (Jira)
Damien Diederen created ZOOKEEPER-4799:
--

 Summary: Refactor ACL check in addWatch command
 Key: ZOOKEEPER-4799
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4799
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Damien Diederen
Assignee: Damien Diederen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4798) Secure prometheus support

2024-01-29 Thread Purshotam Shah (Jira)
Purshotam Shah created ZOOKEEPER-4798:
-

 Summary: Secure prometheus support
 Key: ZOOKEEPER-4798
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4798
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Purshotam Shah






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4797) Allow for -XX:MaxRamPercentage JVM setting

2024-01-29 Thread Frederiko Costa (Jira)
Frederiko Costa created ZOOKEEPER-4797:
--

 Summary: Allow for -XX:MaxRamPercentage JVM setting
 Key: ZOOKEEPER-4797
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4797
 Project: ZooKeeper
  Issue Type: Improvement
  Components: scripts
Reporter: Frederiko Costa


When running Zk in a containerized environment, it's sometimes desirable to 
express your heap size in terms of percentage of available memory allocated to 
a container.

As it stands, zkEnv.sh forces your to have  -Xmx, which defaults to 1GB. Some 
environments wanted to set it to more, mostly related to the amount of Ram.

This is a request to implement the option of using -XX:MaxRamPercentage option 
when starting zookeeper.

Suggested implementation is to also make a variable ZK_SERVER_MAXRAMPERCENTAGE 
available to be added to SERVER_JVMFLAGS. If the variable is set, 
ZK_HEAP_SERVER is ignored, if no ZK_SERVER_MAXRAMPERCENTAGE, ZK_SERVER_HEAP is 
set as usual.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4796) Requests submitted first may carry a larger xid resulting in ZRUNTIMEINCONSISTENCY

2024-01-29 Thread fanyang (Jira)
fanyang created ZOOKEEPER-4796:
--

 Summary: Requests submitted first may carry a larger xid resulting 
in ZRUNTIMEINCONSISTENCY
 Key: ZOOKEEPER-4796
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4796
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: fanyang


When multiple threads attempt to submit requests, it's possible for a request 
from a thread that acquired its xid earlier to be inserted after a request from 
a thread that acquired its xid later in the submission queue, which causes a 
ZRUNTIMEINCONSISTENCY error.

To fix it, acquires the lock before get_xid() and releases it after request 
submission.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4795) add namespace support for prometheus metrics

2024-01-25 Thread Purshotam Shah (Jira)
Purshotam Shah created ZOOKEEPER-4795:
-

 Summary: add namespace support for prometheus metrics
 Key: ZOOKEEPER-4795
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4795
 Project: ZooKeeper
  Issue Type: Improvement
  Components: metric system
Reporter: Purshotam Shah






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4794) Reduce the ZKDatabase#committedLog memory usage

2024-01-24 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4794:
---

 Summary: Reduce the ZKDatabase#committedLog memory usage
 Key: ZOOKEEPER-4794
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4794
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.1
Reporter: Yan Zhao


In ZKDatabase, after a quorum request is committed successfully, the ZKDatabase 
will wrap the request into a proposal and store it in the committedLog. 
The wrap operation: Serialize the request to a byte array and wrap the byte 
array in the QuorumPacket, so if the request payload size is 1M, the Proposal 
will occupy 2M memory, which will increase the memory pressure.

The committedLog is used for fast follower synchronization, so we can serialize 
the request in the synchronization of the processes, no need to serialize the 
request in advance.

It can reduce half of the memory for committedLog




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4793) The zookeeper server return the wrong response for ruok command.

2024-01-24 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4793:
---

 Summary: The zookeeper server return the wrong response for ruok 
command.
 Key: ZOOKEEPER-4793
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4793
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.1
Reporter: Yan Zhao


Users often use the ruok command to probe the zookeeper server status. If the 
ruok command doesn't return the response, the automation tool will restart the 
zookeeper.

But if the quorum zookeeper server encounters some unexpected error, it changes 
the state to State.ERROR. That means the server can't serve the service 
anymore. But the ruok still returns the `imok`. 

In this case, it should return `This ZooKeeper instance is not currently 
serving requests` like other command(WatchCommand)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4792) Tune the env log at the start of the process

2024-01-24 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4792:
---

 Summary: Tune the env log at the start of the process
 Key: ZOOKEEPER-4792
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4792
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.1
Reporter: Yan Zhao


At the start of the process, it will print the env info in the log. 
There are three logs for the memory metrics.

{code:java}
// Get memory information.
Runtime runtime = Runtime.getRuntime();
int mb = 1024 * 1024;
put(l, "os.memory.free", runtime.freeMemory() / mb + "MB");
put(l, "os.memory.max", runtime.maxMemory() / mb + "MB");
put(l, "os.memory.total", runtime.totalMemory() / mb + "MB");
{code}

https://github.com/apache/zookeeper/blob/9e40464d98319b4553d93b12c6d7db4d240bbce9/zookeeper-server/src/main/java/org/apache/zookeeper/Environment.java#L88-L90

It's misleading for the user, use jvm as the prefix will be better.

Change to:
{code:java}
// Get memory information.
Runtime runtime = Runtime.getRuntime();
int mb = 1024 * 1024;
put(l, "jvm,.memory.free", runtime.freeMemory() / mb + "MB");
put(l, "jvm.memory.max", runtime.maxMemory() / mb + "MB");
put(l, "jvm.memory.total", runtime.totalMemory() / mb + "MB");
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4791) Improve logging when the connection to a remote server is closed

2024-01-24 Thread Jira
Sönke Liebau created ZOOKEEPER-4791:
---

 Summary: Improve logging when the connection to a remote server is 
closed
 Key: ZOOKEEPER-4791
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4791
 Project: ZooKeeper
  Issue Type: New Feature
  Components: server
Affects Versions: 3.9.1
Reporter: Sönke Liebau


When a server closes the connection to a remote server, the logging around why 
the connection is being closed could be improved a bit.

https://github.com/apache/zookeeper/blob/1cc1eb6a2be7323a5c326652d59a070473bb8779/zookeeper-server/src/main/java/org/apache/zookeeper/server/NettyServerCnxn.java#L524

{code:java}
ZooKeeperServer zks = this.zkServer;
if (zks == null || !zks.isRunning()) {
LOG.info("Closing connection to {} because the 
server is not ready",
getRemoteSocketAddress());
close(DisconnectReason.IO_EXCEPTION);
return;
}
{code}

It would be helpful to log what zkServer is, because it can have multiple 
states that would trigger this shutdown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4790) TLS Quorum hostname verification breaks in some scenarios

2024-01-24 Thread Jira
Sönke Liebau created ZOOKEEPER-4790:
---

 Summary: TLS Quorum hostname verification breaks in some scenarios
 Key: ZOOKEEPER-4790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4790
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.9.1
Reporter: Sönke Liebau


Currently, enabling Quorum TLS will make the server validate SANs client 
certificates of connecting quorum peers against their reverse DNS address. 

 We have seen this cause issues when running in Kubernetes, due to ip addresses 
resolving to multiple dns names, when ZooKeeper pods participate in multiple 
services.

Since `InetAddress.getHostAddress()` returns a String, it basically becomes a 
game of chance which dns name is checked against the cert. This has caused 
issues in the Strimzi operator as well (see [this 
issue|https://github.com/strimzi/strimzi-kafka-operator/issues/3099]) - they 
solved this by pretty much adding anything they can find that might be relevant 
to the SAN, and a few wildcards on top of that.

This is both, error prone and doesn't really add any relevant extra amount of 
security, since "This certificate matches the connecting peer" shouldn't 
automatically mean "this peer should be allowed to connect".
 
 There are two (probably more) ways to fix this:

# Retrieve _all_  reverse entries and check against all of them
# The ZK server could verify the SAN against the list of servers 
({{{}servers.N{}}} in the config). A peer should be able to connect on the 
quorum port if and only if at least one SAN matches at least one of the listed 
servers.

I'd argue that the second option is the better one, especially since the java 
api doesn't even seem to have the option of retrieving all dns entries, but 
also because it better matches the expressed intent of the ZK admin.

Additionally, it would be nice to have a "disable client hostname verification" 
option that still leaves server hostname verification enabled. Strictly 
speaking this is a separate issue though, I'd be happy to spin that out into a 
ticket of its own..





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4789) Avoid static blocks in QuorumAuth Tests

2024-01-23 Thread Muthuraj Ramalingakumar (Jira)
Muthuraj Ramalingakumar created ZOOKEEPER-4789:
--

 Summary: Avoid static blocks in QuorumAuth Tests
 Key: ZOOKEEPER-4789
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4789
 Project: ZooKeeper
  Issue Type: Test
Reporter: Muthuraj Ramalingakumar


In QuorumAuth tests test setup code is written in static code blocks {}, 
instead use @BeforeAll if possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4788) high maxlatency

2024-01-21 Thread yangoofy (Jira)
yangoofy created ZOOKEEPER-4788:
---

 Summary: high maxlatency
 Key: ZOOKEEPER-4788
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4788
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.7.1
Reporter: yangoofy


We use Zookeeper as the registration center for microservices. Zookeeper uses 
the default configuration. Snapshot600m, with 70w ephemeral nodes. A single 
Zookeeper server is 56c128g and xmx16g. The number of client connections is 8k, 
and the number of watches is 400w. During the centralized start and stop time 
of the client, the maxlatency of the server is as high as 10 seconds. How can 
me optimize it?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4787) Failed to establish connection between zookeeper

2024-01-17 Thread softrock (Jira)
softrock created ZOOKEEPER-4787:
---

 Summary: Failed to establish connection between zookeeper
 Key: ZOOKEEPER-4787
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4787
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.8.3
 Environment: z/OS{*}{*}
Reporter: softrock


*Problem:*

When run zookeepers version 3.8.3 on z/OS platform,they cannot establish the 
connection

Error:

[2024-01-17 23:06:44,194] INFO Received connection request from 
/xx.xx.xx.xx:23840 (org.apache.zookeeper.server.quorum.QuorumCnxManager)
 [2024-01-17 23:06:44,197] ERROR Initial message parsing error! 
(org.apache.zookeeper.server.quorum.QuorumCnxManager)
 
org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException:
 Badly formed address: K???K???K???z
     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage.parse(QuorumCnxManager.java:271)
     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:607)
     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:555)
     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1085)
     at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1039)
     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
     at java.util.concurrent.FutureTask.run(FutureTask.java:277)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
     at java.lang.Thread.run(Thread.java:825)

*Root cause:*

The receiver cannot resolve the address from the sender requesting a 
connection. This is because the sender sends the address in UTF-8 encoding, but 
the receiver parses the address in IBM-1047 encoding (the default).

*Resolution:*

 Both receiver and sender sides use UTF-8 encoding

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4786) Bug in file "Breadcrumbszookeeper/zookeeper-server/src/main/java/org/apache/zookeeper/server/util /OSMXBean.java"

2024-01-17 Thread Abdelaziz Assem (Jira)
Abdelaziz Assem created ZOOKEEPER-4786:
--

 Summary: Bug in file 
"Breadcrumbszookeeper/zookeeper-server/src/main/java/org/apache/zookeeper/server/util
 /OSMXBean.java"
 Key: ZOOKEEPER-4786
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4786
 Project: ZooKeeper
  Issue Type: Bug
  Components: other
Affects Versions: 3.6.3
 Environment: AIX Servers.
Reporter: Abdelaziz Assem
 Attachments: Exception Error.PNG, Failed to Start.PNG

Function "{*}getOpenFileDescriptorCount{*}" in that file line 98 supports only 
ibmvendor machines and the implantation is done for Linux servers only,

 

Now I am using it on AIX servers and when I tried to start zookeeper it says 
"Failed to Start", because "java.lang.NumberFormatException" which happened in 
line 121 in "{*}OSMXBean.java{*}".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4785) Txn loss due to race condition when follower DIFF sync with leader

2024-01-11 Thread Li Wang (Jira)
Li Wang created ZOOKEEPER-4785:
--

 Summary: Txn loss due to race condition when follower DIFF sync 
with leader
 Key: ZOOKEEPER-4785
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4785
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.9.1, 3.8.2, 3.7.2, 3.8.1, 3.7.1, 3.8.0
Reporter: Li Wang


We had txn loss incident in production recently. The root cause is follower 
writes the current epoch and sends the ACK_LD before successfully persisting 
all the txns from DIFF sync, as persisting txns is handled asynchronously via 
SyncRequestProcessor.






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4784) Token based ASF JIRA authentication

2024-01-09 Thread Szucs Villo (Jira)
Szucs Villo created ZOOKEEPER-4784:
--

 Summary: Token based ASF JIRA authentication
 Key: ZOOKEEPER-4784
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4784
 Project: ZooKeeper
  Issue Type: Sub-task
  Components: tools
Reporter: Szucs Villo
Assignee: Szucs Villo


https://issues.apache.org/jira/browse/SPARK-44802



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4783) leader crash because of zxid 32b rollover but no other server takes the lead

2024-01-04 Thread Jira
Stéphane Loeuillet created ZOOKEEPER-4783:
-

 Summary: leader crash because of zxid 32b rollover but no other 
server takes the lead
 Key: ZOOKEEPER-4783
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4783
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.8.3
 Environment: Linux amd64 Ubuntu 20.04.5

Java OpenJDK17U-jre_x64_linux_hotspot_17.0.8.1_1.tar.gz
Reporter: Stéphane Loeuillet
 Attachments: zookeeper_crash.log

Got a 5 node cluster running on baremetal servers (with NVMe) used by a 
ClickHouse cluster on a separate cluster.

This morning, a crash on the leader did let my clusters unusable as while the 
leader crashed, none of the 4 followers did take the lead

 

zookeeper leader was zookeeper08

05/06/07/09 were the followers

 

Only a restart of zookeeper05 process did unfreeze the whole cluster



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4782) The installation guide in zookeeper-client/zookeeper-client-c is oudated

2023-12-26 Thread yangzhenxing (Jira)
yangzhenxing created ZOOKEEPER-4782:
---

 Summary: The installation guide in 
zookeeper-client/zookeeper-client-c is oudated
 Key: ZOOKEEPER-4782
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4782
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.9.1
Reporter: yangzhenxing


The installation guide in zookeeper-client/zookeeper-client-c is oudated, it 
should point out the current good way to build is contained in 
zookeeper-docs/src/main/resources/markdown/zookeeperProgrammers.md



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4781) ZooKeeper not starting because the accepted epoch is less than the current epoch.

2023-12-21 Thread zhanglu153 (Jira)
zhanglu153 created ZOOKEEPER-4781:
-

 Summary: ZooKeeper not starting because the accepted epoch is less 
than the current epoch.
 Key: ZOOKEEPER-4781
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4781
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.9.1, 3.8.3, 3.7.2, 3.6.4, 3.4.14, 3.5.10
Reporter: zhanglu153


This issue occurred in our abnormal testing environment, where the disk was 
injected with anomalies and frequently filled up.

The the scenario is as follows:
 # Configure three node ZooKeeper cluster, lets say nodes are A, B and C.
 # Start the cluster, and node C becomes the leader.
 # The disk of Node C is injected with a full disk exception.
 # Node C called the org.apache.zookeeper.server.quorum.Leader#lead method.
{code:java}
void lead() throws IOException, InterruptedException {
self.end_fle = Time.currentElapsedTime();
long electionTimeTaken = self.end_fle - self.start_fle;
self.setElectionTimeTaken(electionTimeTaken);
LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken);
self.start_fle = 0;
self.end_fle = 0;

zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean);

try {
self.tick.set(0);
zk.loadData();

leaderStateSummary = new StateSummary(self.getCurrentEpoch(), 
zk.getLastProcessedZxid());

// Start thread that waits for connection requests from 
// new followers.
cnxAcceptor = new LearnerCnxAcceptor();
cnxAcceptor.start();

readyToStart = true;
long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());

zk.setZxid(ZxidUtils.makeZxid(epoch, 0));

synchronized(this){
lastProposed = zk.getZxid();
}

newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(),
null, null);


if ((newLeaderProposal.packet.getZxid() & 0xL) != 0) {
LOG.info("NEWLEADER proposal has Zxid of "
+ Long.toHexString(newLeaderProposal.packet.getZxid()));
}

waitForEpochAck(self.getId(), leaderStateSummary);
self.setCurrentEpoch(epoch);
... {code}

 # Node C, as the leader, will start the LearnerCnxAcceptor thread, and then 
call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. At 
this time, the value of waitingForNewEpoch is true, and the size of 
connectingFollowers is not greater than n/2. Node C directly calls 
connectingFollowers.wait to wait. The maximum waiting time is 
self.getInitLimit()*self.getTickTime() ms.
{code:java}
public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws 
InterruptedException, IOException {
synchronized(connectingFollowers) {
if (!waitingForNewEpoch) {
return epoch;
}
if (lastAcceptedEpoch >= epoch) {
epoch = lastAcceptedEpoch+1;
}
if (isParticipant(sid)) {
connectingFollowers.add(sid);
}
QuorumVerifier verifier = self.getQuorumVerifier();
if (connectingFollowers.contains(self.getId()) && 

verifier.containsQuorum(connectingFollowers)) {
self.setAcceptedEpoch(epoch);
waitingForNewEpoch = false;
connectingFollowers.notifyAll();
} else {
long start = Time.currentElapsedTime();
long cur = start;
long end = start + self.getInitLimit()*self.getTickTime();
while(waitingForNewEpoch && cur < end) {
connectingFollowers.wait(end - cur);
cur = Time.currentElapsedTime();
}
if (waitingForNewEpoch) {
throw new InterruptedException("Timeout while waiting for epoch 
from quorum");
}
}
return epoch;
}
} {code}

 # Node B connects to the 2888 communication port of node C and starts a new 
LeanerHandler thread. 
 # Node A connects to the 2888 communication port of node C and starts a new 
LeanerHandler thread.
 # After node B connects to node C, call the 
org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the 
LearnerHandler thread.At this point, the value of waitingForNewEpoch is true, 
and the size of connectingFollowers is greater than n/2. Then, set the value of 
waitingForNewEpoch to false. Due to the disk of node C being full, calling 
setAcceptedEpoch to write the acceptedEpoch value failed with an IO exception. 
Node C fails to update the acceptedEpoch file and did not successfully call the 
connectingFollowers.notifyAll() method. This will cause node C to wait at 
connectingFollowers.wait, with a maximum wait of 
self.getInitLimit()*self.getTickTime() m

[jira] [Created] (ZOOKEEPER-4780) Avoid creating temporary files in source directory.

2023-12-19 Thread Muthuraj Ramalingakumar (Jira)
Muthuraj Ramalingakumar created ZOOKEEPER-4780:
--

 Summary: Avoid creating temporary files in source directory.
 Key: ZOOKEEPER-4780
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4780
 Project: ZooKeeper
  Issue Type: Test
Reporter: Muthuraj Ramalingakumar


The zookeeper-server module has several unit tests which create temporary 
folders and files in the test directory. And has logic to delete the files 
after test run.

We can use the Junit TempDir annotation to handle tempfile/tempdir creation. 
And dont have to manage that in source code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4779) ZKUtilTest fails to run on WSL

2023-12-16 Thread Muthuraj Ramalingakumar (Jira)
Muthuraj Ramalingakumar created ZOOKEEPER-4779:
--

 Summary: ZKUtilTest fails to run on WSL
 Key: ZOOKEEPER-4779
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4779
 Project: ZooKeeper
  Issue Type: Task
  Components: tests
Reporter: Muthuraj Ramalingakumar


The `ZKUtilTest#testUnreadableFileInput` fails to run when running on WSL 
(Windows subsystem for Linux).

Skip that test when running in WSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4778) Patch jetty, netty, and logback to remove high severity vulnerabilities

2023-12-16 Thread Ivgeni (Jira)
Ivgeni created ZOOKEEPER-4778:
-

 Summary: Patch jetty, netty, and logback to remove high severity 
vulnerabilities
 Key: ZOOKEEPER-4778
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4778
 Project: ZooKeeper
  Issue Type: Improvement
  Components: security
Reporter: Ivgeni


logback-core & logback-classic:

[https://nvd.nist.gov/vuln/detail/CVE-2023-6378]

netty-codec:

[https://nvd.nist.gov/vuln/detail/CVE-2023-44487]

jetty-io:

[https://nvd.nist.gov/vuln/detail/CVE-2023-36478]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4777) Zookeeper becomes unresponsive when using native GSSAPI

2023-12-05 Thread Rickey Visinski (Jira)
Rickey Visinski created ZOOKEEPER-4777:
--

 Summary: Zookeeper becomes unresponsive when using native GSSAPI
 Key: ZOOKEEPER-4777
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4777
 Project: ZooKeeper
  Issue Type: Bug
  Components: kerberos, server
Affects Versions: 3.8.3, 3.8.2, 3.7.2, 3.6.4, 3.7.1, 3.6.2, 3.5.7, 3.5.6, 
3.4.14, 3.4.13
 Environment: RHEL 7 and OpenJDK Runtime Environment (build 
1.8.0_392-b08)

RHEL 8 and OpenJDK Runtime Environment (Red_Hat-17.0.9.0.9-1) (build 
17.0.9+9-LTS)
Reporter: Rickey Visinski


Zookeeper ensemble starts up properly after quorum is made. The leader is 
elected and it starts serving requests. After a while the Leader gets stuck, so 
its just accepting requests but not processing it, same is the case with 
participants. They are accepting requests but since the leader doesn't process 
they keep piling up.

This causes an issue with sudden increase on the no. of CLOSE_WAIT connections 
on the zookeeper servers. When this happens, the ensemble is completely 
unresponsive causing connection loss/timeouts. Once the CLOSE_WAIT start the 
number of open connections on each server spike as high as 10 from a mere 
200 connections within a few minutes.

A pattern was found in thread dump where we always saw {{NIOServerCxnFactory}} 
selector thread blocked on a lock waiting in 
{{org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer}}
{code:java}
tdump_zkdev14.i.ia55.net_1694037623.logs-"NIOServerCxnFactory.SelectorThread-0" 
#16 daemon prio=5 os_prio=0 cpu=9126323.70ms elapsed=25935.16s 
tid=0x7f9118702320 nid=0x20ed94 waiting for monitor entry  
[0x7f907e635000]
tdump_zkdev14.i.ia55.net_1694037623.logs:   java.lang.Thread.State: BLOCKED (on 
object monitor)
tdump_zkdev14.i.ia55.net_1694037623.logs-   at 
org.apache.zookeeper.server.ZooKeeperSaslServer.createSaslServer(ZooKeeperSaslServer.java:42)
tdump_zkdev14.i.ia55.net_1694037623.logs-   - waiting to lock 
<0x000700391098> (a org.apache.zookeeper.Login)
tdump_zkdev14.i.ia55.net_1694037623.logs-   at 
org.apache.zookeeper.server.ZooKeeperSaslServer.(ZooKeeperSaslServer.java:38)
 {code}
{{}}

Seems to be related to https://issues.apache.org/jira/browse/ZOOKEEPER-2230

 

Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4776) CVE-2023-36478 | org.eclipse.jetty_jetty-io

2023-12-04 Thread Aayush Suri (Jira)
Aayush Suri created ZOOKEEPER-4776:
--

 Summary: CVE-2023-36478 | org.eclipse.jetty_jetty-io
 Key: ZOOKEEPER-4776
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4776
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.9.1
Reporter: Aayush Suri


{*}Vulnerability summary{*}: Eclipse Jetty provides a web server and servlet 
container. In versions 11.0.0 through 11.0.15, 10.0.0 through 10.0.15, and 
9.0.0 through 9.4.52, an integer overflow in `MetaDataBuilder.checkSize` allows 
for HTTP/2 HPACK header values to exceed their size limit. 
`MetaDataBuilder.java` determines if a header name or value exceeds the size 
limit, and throws an exception if the limit is exceeded. However, when length 
is very large and huffman is true, the multiplication by 4 in line 295 will 
overflow, and length will become negative. `(_size+length)` will now be 
negative, and the check on line 296 will not be triggered. Furthermore, 
`MetaDataBuilder.checkSize` allows for user-entered HPACK header value sizes to 
be negative, potentially leading to a very large buffer allocation later on 
when the user-entered size is multiplied by 2. This means that if a user 
provides a negative length value (or, more precisely, a length value which, 
when multiplied by the 4/3 fudge factor, is negative), and this length value is 
a very large positive number when multiplied by 2, then the user can cause a 
very large buffer to be allocated on the server. Users of HTTP/2 can be 
impacted by a remote denial of service attack. The issue has been fixed in 
versions 11.0.16, 10.0.16, and 9.4.53. There are no known workarounds.

Looking for a version the fixes this vulnerability. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4775) Add a version of check_zookeeper that works with Python 3

2023-11-30 Thread Enrico Olivelli (Jira)
Enrico Olivelli created ZOOKEEPER-4775:
--

 Summary: Add a version of check_zookeeper that works with Python 3
 Key: ZOOKEEPER-4775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4775
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Enrico Olivelli






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4774) zookeeper.ssl.context.supplier.class is not working in 3.9

2023-11-29 Thread Jerry Chung (Jira)
Jerry Chung created ZOOKEEPER-4774:
--

 Summary: zookeeper.ssl.context.supplier.class is not working in 3.9
 Key: ZOOKEEPER-4774
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4774
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.9.1
Reporter: Jerry Chung


hi,

 

{{zookeeper.ssl.context.supplier.class}} was working since 3.6 - 3.8, but it's 
not working anymore in 3.9.

Is this a permanent decision? The document is still mentioning it.

 

Thanks.

jerry



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4773) Ephemeral node is not deleted when all followers are blocked with leader

2023-11-27 Thread May (Jira)
May created ZOOKEEPER-4773:
--

 Summary: Ephemeral node is not deleted when all followers are 
blocked with leader
 Key: ZOOKEEPER-4773
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4773
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.9.1, 3.8.3
Reporter: May


The test case EphemeralNodeDeletionTest describes that a follower loses its 
connection with leader when the client writes an ephemeral node, and it should 
delete the node after the client closed. However, the case fails when I make 
all followers lose connections.


To reproduce the bug, I simply modified testEphemeralNodeDeletion() as 
following:
{code:java}
// 2: inject network problem in two followers
ArrayList followers = getFollowers();
for (CustomQuorumPeer follower : followers) {
follower.setInjectError(true);
}
//CustomQuorumPeer follower = (CustomQuorumPeer) getByServerState(mt, 
ServerState.FOLLOWING);
//follower.setInjectError(true);

// 3: close the session so that ephemeral node is deleted
zk.close();

// remove the error
//follower.setInjectError(false);
for (CustomQuorumPeer follower : followers) {
follower.setInjectError(false);
assertTrue(ClientBase.waitForServerUp("127.0.0.1:" + 
follower.getClientPort(), CONNECTION_TIMEOUT),
"Faulted Follower should have joined quorum by now");
}
{code}
And here is added method getFollowers():
{code:java}
private ArrayList getFollowers() {
ArrayList followers = new ArrayList<>();
for (int i = 0; i <= mt.length - 1; i++) {
QuorumPeer quorumPeer = mt[i].getQuorumPeer();
if (null != quorumPeer && ServerState.FOLLOWING == 
quorumPeer.getPeerState()) {
followers.add((CustomQuorumPeer)quorumPeer);
}
}
return followers;
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4772) Wrong sync logic in LearnerHandler when sync (0,0) to a new epoch follower

2023-11-27 Thread May (Jira)
May created ZOOKEEPER-4772:
--

 Summary: Wrong sync logic in LearnerHandler when sync (0,0) to a 
new epoch follower
 Key: ZOOKEEPER-4772
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4772
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.9.1, 3.8.3, 3.7.2
Reporter: May


Current LearnerHandler's syncFollower does not consider the situation that the 
proposal (0,0) is committed and snaped. It will not use snap to sync when 
minCommittedLog is 0.

The bug can be reproduced by modifying testNewEpochZxid in LearnerHandlerTest:

{code:java}
public void testNewEpochZxid() throws Exception {
long peerZxid;
db.txnLog.add(createProposal(getZxid(0, 0))); // Added
db.txnLog.add(createProposal(getZxid(0, 1)));
db.txnLog.add(createProposal(getZxid(1, 1)));
db.txnLog.add(createProposal(getZxid(1, 2)));

// After leader election, lastProcessedZxid will point to new epoch
db.lastProcessedZxid = getZxid(2, 0);
db.committedLog.add(createProposal(getZxid(0, 0))); // Added
db.committedLog.add(createProposal(getZxid(1, 1)));
db.committedLog.add(createProposal(getZxid(1, 2)));

// Peer has zxid of epoch 0
peerZxid = getZxid(0, 0);
// We should get snap, we can do better here, but the main logic is
// that we should never send diff if we have never seen any txn older
// than peer zxid
assertTrue(learnerHandler.syncFollower(peerZxid, leader)); // Fail here
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4771) Fast leader election taking too long

2023-11-22 Thread Ivo Vrdoljak (Jira)
Ivo Vrdoljak created ZOOKEEPER-4771:
---

 Summary: Fast leader election taking too long
 Key: ZOOKEEPER-4771
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.10
Reporter: Ivo Vrdoljak
 Attachments: zookeeper10.log, zookeeper11.log, zookeeper12.log, 
zookeeper20.log, zookeeper21.log

Hello ZooKeeper Community,
 
Background:
We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper 
servers running, that are distributed across 2 clusters of servers.
In the first cluster, we have 3 Zookeeper servers, each deployed on its own 
machine, and in the second cluster we have 2 Zookeeper servers, also each on 
its own machine. Zookeeper servers that are distributed on the same cluster 
communicate through the local network, and with the servers on the remote 
cluster through an external network.
The situation is the following:

 
{code:java}
Cluster 1
Zookeeper server 10
Zookeeper server 11
Zookeeper server 12 -> Leader
Cluster 2
Zookeeper server 20
Zookeeper server 21
{code}
 

Problem:
We have an issue with Fast Leader Election when we kill the ZooKeeper leader 
process.
After the leader (server 12) is killed and leader election starts, we can see 
in Zookeeper logs that voting notifications are exchanged from each Zookeeper 
server that remained alive towards all the others. Notification on Zookeeper 
servers located in the same cluster (communicating over the local network) are 
successfully exchanged. The problem seems to be for Zookeeper server sending 
votes over external network as according to the logs they are only sent in one 
direction.
 
Logs from zookeeper server 10:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election 
protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), 
0xe9c97 (newzxid), 12 (oldleader), 0xd1380 (oldzxid)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  New election. My id =  10, proposed 
zxid=0xe9c97
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=10, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed 
leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed 
leader=11, proposed zxid=0xe9c97, proposed election epoch=0x11{code}
Logs from zookeeper server 20:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO  LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election 
protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe 
(n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 
0xe9c97 (n.zxid), 0x11 (n.round),

[jira] [Created] (ZOOKEEPER-4770) zkSnapshotRecursiveSummaryToolkit.sh Error: Could not find or load main class

2023-11-17 Thread nailcui (Jira)
nailcui created ZOOKEEPER-4770:
--

 Summary: zkSnapshotRecursiveSummaryToolkit.sh Error: Could not 
find or load main class
 Key: ZOOKEEPER-4770
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4770
 Project: ZooKeeper
  Issue Type: Bug
  Components: scripts, tools
Affects Versions: 3.9.1
 Environment: CentOS Linux release 7.4.1708
Reporter: nailcui
 Fix For: 3.9.2


When I execute the following code to analyze the snapshot file:
{code:java}
./bin/zkSnapshotRecursiveSummaryToolkit.sh /data/version-2/snapshot.c0009 / 
2 {code}
Getting this error:

 
{code:java}
Error: Could not find or load main class {code}
I checked the source code and found that $JVMFLAGS was surrounded by quotation 
marks. This problem occurs when the variable $JVMFLAGS is empty.
{code:java}
"$JAVA" -cp "$CLASSPATH" "$JVMFLAGS" \
     org.apache.zookeeper.server.SnapshotRecursiveSummary "$@" {code}
The correct code should be like this

 
{code:java}
"$JAVA" -cp "$CLASSPATH" $JVMFLAGS \
     org.apache.zookeeper.server.SnapshotRecursiveSummary "$@"{code}
Thank you, I will solve it.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4769) Update plugin for SBOM generation to 2.7.10

2023-11-09 Thread Vinod Anandan (Jira)
Vinod Anandan created ZOOKEEPER-4769:


 Summary: Update plugin for SBOM generation to 2.7.10
 Key: ZOOKEEPER-4769
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4769
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Vinod Anandan


Update the CycloneDX Maven plugin for SBOM generation to 2.7.10



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4768) Flaky test org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo

2023-11-02 Thread Yike Xiao (Jira)
Yike Xiao created ZOOKEEPER-4768:


 Summary: Flaky test 
org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo
 Key: ZOOKEEPER-4768
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4768
 Project: ZooKeeper
  Issue Type: Test
  Components: metric system
Affects Versions: 3.10.0
Reporter: Yike Xiao


If the {{io.prometheus.client.hotspot.DefaultExports#initialize}} method has 
been executed by other test cases before running the 
{{org.apache.zookeeper.metrics.prometheus.ExportJvmInfoTest#exportInfo}} test 
case, then this test case will fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4767) New implementation of prometheus qunatile metrics based on DataSketches

2023-11-01 Thread Yike Xiao (Jira)
Yike Xiao created ZOOKEEPER-4767:


 Summary: New implementation of prometheus qunatile metrics based 
on DataSketches
 Key: ZOOKEEPER-4767
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767
 Project: ZooKeeper
  Issue Type: Improvement
  Components: metric system
Reporter: Yike Xiao


If the built-in Prometheus metrics feature introduced after version 3.6 is 
enabled, under high-load scenarios (such as when there are a large number of 
read requests), the percentile metrics (Summary) used to collect request 
latencies can easily become a bottleneck and impact the service itself. This is 
because the internal implementation of Summary involves the overhead of lock 
operations. In scenarios with a large number of requests, lock contention can 
lead to a dramatic deterioration in request latency. The details of this issue 
and related profiling can be viewed in ZOOKEEPER-4741.

In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a 
separate thread pool. While this approach avoids the overhead of lock 
contention caused by multiple threads updating Summary simultaneously, it 
introduces the operational overhead of the thread pool queue and additional 
garbage collection (GC) overhead. Especially when the thread pool queue is 
full, a large number of RejectedExecutionException instances will be thrown, 
further increasing the pressure on GC.

To address problems above, I have implemented an almost lock-free solution 
based on DataSketches. Benchmark results show that it offers over a 10x speed 
improvement compared to version 3.9.1 and avoids frequent GC caused by creating 
a large number of temporary objects. The trade-off is that the latency 
percentiles will be displayed with a relative delay (default is 60 seconds), 
and each Summary metric will have a certain amount of permanent memory overhead.

This solution refers to Matteo Merli's optimization work on the percentile 
latency metrics for Bookkeeper, as detailed in 
https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4766) Ensure leader election time does not unnecessarily scale with tree size due to snapshotting

2023-10-30 Thread Rishabh Rai (Jira)
Rishabh Rai created ZOOKEEPER-4766:
--

 Summary: Ensure leader election time does not unnecessarily scale 
with tree size due to snapshotting
 Key: ZOOKEEPER-4766
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4766
 Project: ZooKeeper
  Issue Type: Improvement
  Components: leaderElection
Affects Versions: 3.8.3, 3.5.9
 Environment: General behavior, should occur in all environments
Reporter: Rishabh Rai
 Fix For: 3.8.3, 3.5.9


Hi ZK community, this is regarding a fix for a behavior that is causing the 
leader election time to unnecessarily scale with the amount of data in the ZK 
data tree.



*tl;dr:* During leader election, the leader always saves a snapshot when 
loading its data tree. This snapshot seems unnecessary, even in the case where 
the leader needs to send an updated SNAP to a learner, since it serializes the 
tree before sending anyway. Snapshotting slows down leader election and 
increases ZK downtime significantly as more data is stored in the tree. This 
improvement is to avoid taking a snapshot so that this unnecessary downtime is 
avoided.


During leader election, when the [data is 
loaded|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L601]
 by the tentatively elected (i.e. pre-finalized quorum) leader server, a 
[snapshot of the tree is always 
taken|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L540].
 The loadData method is called from multiple places, but specifically in the 
context of leader election, it seems like the snapshotting step is unnecessary 
for the leader when loading data:
 * Because it has loaded the tree at this point, we know that if the leader 
were to go down again, it would still be able to recover back to the current 
state at which we are snapshotting without using the snapshot that we are 
taking in loadData()
 * There are no ongoing transactions until leader election is completed and the 
ZK ensemble is back up, so no data would be lost after the point at which the 
data tree is loaded
 * Once the ensemble is healthy and the leader is handling transactions again, 
any new transactions are being logged and when needed the log is being rolled 
over when needed anyway, so if the leader is recovering from a failure, the 
snapshot taken during loadData() does not afford us any additional benefits 
over the initial snapshot (if it existed) and transaction log that the leader 
used to load its data from in loadData()
 * When the leader is deciding to send a SNAP or a DIFF to a learner, a [SNAP 
is 
serialized|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L582]
 and sent [if and only if it is 
needed|https://github.com/apache/zookeeper/blob/79f1f71a9a76689065c14d0846a69d0d71d3586e/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java#L562].
 The snapshot taken in loadData() again does not seem to be beneficial here.

The PR for this fix only skips this snapshotting step in loadData() during 
leader election. The behavior of the function remains the same for other 
usages. With this change, during leader election the data tree would only be 
serialized when sending a SNAP to a learner. In other scenarios, no data tree 
serialization would be needed at all. In both cases, there is a significant in 
the time spent in leader election.

If my understanding of any of this is incorrect, or if I'm failing to consider 
some other aspect of the process, please let me know. The PR for the change can 
also be changed to enable/disable this behavior via a java property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4765) high maxlatency & xid out of order

2023-10-29 Thread yangoofy (Jira)
yangoofy created ZOOKEEPER-4765:
---

 Summary: high maxlatency & xid out of order
 Key: ZOOKEEPER-4765
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4765
 Project: ZooKeeper
  Issue Type: Wish
Reporter: yangoofy


1. We use Zookeeper as the registration center for microservices. Zookeeper 
uses the default configuration. Snapshot600m, with 70w ephemeral nodes. A 
single Zookeeper server is 56c128g and xmx16g. The number of client connections 
is 8k, and the number of watches is 400w. During the centralized start and stop 
time of the client, the maxlatency of the server is as high as 10 seconds. How 
can I optimize it?

2. Is it necessary for the Zookeeper client to receive a response before 
verifying the request sent first, otherwise an error will be reported as' xid 
out of order '? Can we modify it to send requests and responses based on xid 
matching instead of strong verification order?




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4764) Tune the log of refuse session request.

2023-10-26 Thread Yan Zhao (Jira)
Yan Zhao created ZOOKEEPER-4764:
---

 Summary: Tune the log of refuse session request.
 Key: ZOOKEEPER-4764
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4764
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.1, 3.8.3, 3.7.2
Reporter: Yan Zhao
 Fix For: 3.7.3, 3.8.4, 3.9.2


The log:
Refusing session request for client as it has seen zxid our last zxid is 0x0 
client must try another server (org.apache.zookeeper.server.ZooKeeperServer)

We would better print the sessionId in the content.

After improvement:
Refusing session(0xab) request for client as it has seen zxid our last zxid is 
0x0 client must try another server (org.apache.zookeeper.server.ZooKeeperServer)





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4763) Logback dependency should be scope provided/test

2023-10-20 Thread Thomas Mortagne (Jira)
Thomas Mortagne created ZOOKEEPER-4763:
--

 Summary: Logback dependency should be scope provided/test
 Key: ZOOKEEPER-4763
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4763
 Project: ZooKeeper
  Issue Type: Bug
  Components: other
Affects Versions: 3.8.0
Reporter: Thomas Mortagne


In general, a library which use SLF4J is not supposed to impose implementation 
(logback, log4j2, etc.) to use which remain a choice of the final runtime/WAR.

The logback dependency in zookeeper should be set to scope "provided" or "test" 
so that Maven does not follow it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4762) Update netty jars to 4.1.99+ to fix CVE-2023-4586

2023-10-16 Thread Dhoka Pramod (Jira)
Dhoka Pramod created ZOOKEEPER-4762:
---

 Summary: Update netty jars to 4.1.99+ to fix CVE-2023-4586
 Key: ZOOKEEPER-4762
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4762
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.8.3
Reporter: Dhoka Pramod
 Fix For: 3.8.4


[https://nvd.nist.gov/vuln/detail/CVE-2023-4586]
A vulnerability was found in the Hot Rod client. This security issue occurs as 
the Hot Rod client does not enable hostname validation when using TLS, possibly 
resulting in a man-in-the-middle (MITM) attack.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4761) Cli tool read saved clientid fail

2023-10-15 Thread Yang Guo (Jira)
Yang Guo created ZOOKEEPER-4761:
---

 Summary:  Cli tool read saved clientid fail
 Key: ZOOKEEPER-4761
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4761
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Yang Guo


A very simple bug when reading saved clientid using fread. It causes the cli 
tool fail to recover from the last session connected to zookeeper server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4760) Add support for filename to get and set cli commands

2023-10-12 Thread Soumitra Kumar (Jira)
Soumitra Kumar created ZOOKEEPER-4760:
-

 Summary: Add support for filename to get and set cli commands
 Key: ZOOKEEPER-4760
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4760
 Project: ZooKeeper
  Issue Type: Improvement
  Components: tools
Reporter: Soumitra Kumar


CLI supports get and set commands to read and write data. Add support for:
 # reading input data for set command from a file, and
 # writing output data in get command to a file

This will help in dealing with arbitrary byte arrays and also scripting 
read/write to large number of znodes using CLI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4759) Handle Netty CVE-2023-44487 and CVE-2023-39325

2023-10-12 Thread Jira
Aurélien Pupier created ZOOKEEPER-4759:
--

 Summary: Handle Netty CVE-2023-44487 and CVE-2023-39325 
 Key: ZOOKEEPER-4759
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4759
 Project: ZooKeeper
  Issue Type: Task
Reporter: Aurélien Pupier


https://netty.io/news/2023/10/10/4-1-100-Final.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4758) Upgrade snappy-java to 1.1.10.4 to fix CVE-2023-43642

2023-10-11 Thread Dhoka Pramod (Jira)
Dhoka Pramod created ZOOKEEPER-4758:
---

 Summary: Upgrade snappy-java to 1.1.10.4 to fix CVE-2023-43642
 Key: ZOOKEEPER-4758
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4758
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.8.3
Reporter: Dhoka Pramod
 Fix For: 3.8.4


The SnappyInputStream was found to be vulnerable to Denial of Service (DoS) 
attacks when decompressing data with a too large chunk size. Due to missing 
upper bound check on chunk length, an unrecoverable fatal error can occur. All 
versions of snappy-java including the latest released version 1.1.10.3 are 
vulnerable to this issue. A fix has been introduced in commit `9f8c3cf74` which 
will be included in the 1.1.10.4 release. Users are advised to upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4757) Support JSON format logging

2023-10-10 Thread Jira
Jan Høydahl created ZOOKEEPER-4757:
--

 Summary: Support JSON format logging
 Key: ZOOKEEPER-4757
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4757
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Jan Høydahl


More and more enterprise users request structured JSON format logging for their 
applications. This removes the need for configuring custom log line parsers for 
every application when collecting logs centrally.

Zookeeper has flexible logging through Slf4j and Logback, for which there are 
several ways to achieve JSON logging. But for end users (such as helm chart 
user) it is very difficult to achieve. It should ideally be as simple as a 
configuration option.

OpenTelemetry is a CNCF project that has become the defacto standard for 
metrics and traces collection. They also have a logging standard, and they 
recently [standardized on ECS JSON 
format|https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/] as 
their log schema for OTEL-logging. Although there are other JSON formats in 
use, a pragmatic option is to only support ECS.

Proposed way to enable JSON logging:
{code:java}
export ZOO_LOG_FORMAT=json
bin/zkServer.sh start

# OR

bin/zkServer.sh start -Dzookeeper.log.format=json{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4756) Merge script should use GitHub api to merge pull requests

2023-10-06 Thread Andor Molnar (Jira)
Andor Molnar created ZOOKEEPER-4756:
---

 Summary: Merge script should use GitHub api to merge pull requests
 Key: ZOOKEEPER-4756
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4756
 Project: ZooKeeper
  Issue Type: Improvement
  Components: tools
Affects Versions: 3.9.0
Reporter: Andor Molnar


Github merge script (zk-merge-pr.py) is a nice tool which does a lot of 
housekeeping tasks when merging a PR including fixing the commit message or 
closing the Jira. Merging on the Github UI is also possible, but could lead to 
mistakes like leaving the commit message without the Jira id.

Unfortunately when the script merges the PR it does that without Github and 
leaving the PR in 'Closed' rather than 'Merged'. This is misleading. Let's 
improve the script to use Github API for merging PRs and possibly disable 
merging on the Github UI.

Email thread:

[https://lists.apache.org/thread/cbmktklydtlylkybvq6jrx5m4l8b2cm5]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4755) Handle Netty CVE-2023-4586

2023-10-03 Thread Damien Diederen (Jira)
Damien Diederen created ZOOKEEPER-4755:
--

 Summary: Handle Netty CVE-2023-4586
 Key: ZOOKEEPER-4755
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4755
 Project: ZooKeeper
  Issue Type: Task
Reporter: Damien Diederen
Assignee: Damien Diederen


The {{dependency-check:check}}... check currently fails with the following:

{noformat}
[ERROR] netty-handler-4.1.94.Final.jar: CVE-2023-4586(6.5)
{noformat}

According to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-4586 , 
CVE-2023-4586 is reserved.  No fix or additional information is available as of 
the creation of this ticket.

We have to:

# Temporarily suppress the check;
# Monitor CVE-2023-4586 and apply the remediation as soon as it becomes 
available.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4754) Update Jetty to avoid CVE-2023-36479, CVE-2023-40167, and CVE-2023-41900

2023-10-03 Thread Damien Diederen (Jira)
Damien Diederen created ZOOKEEPER-4754:
--

 Summary: Update Jetty to avoid CVE-2023-36479, CVE-2023-40167, and 
CVE-2023-41900
 Key: ZOOKEEPER-4754
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4754
 Project: ZooKeeper
  Issue Type: Task
Reporter: Damien Diederen
Assignee: Damien Diederen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4753) Explicit handling of DIGEST-MD5 vs GSSAPI in quorum auth

2023-10-03 Thread Damien Diederen (Jira)
Damien Diederen created ZOOKEEPER-4753:
--

 Summary: Explicit handling of DIGEST-MD5 vs GSSAPI in quorum auth
 Key: ZOOKEEPER-4753
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4753
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.9.0
Reporter: Damien Diederen
Assignee: Damien Diederen


The SASL-based quorum authorizer does not explicitly distinguish between the 
DIGEST-MD5 and GSSAPI mechanisms: it is simply relying on {{NameCallback}} and 
{{PasswordCallback}} for authentication with the former and examining Kerberos 
principals in {{AuthorizeCallback}} for the latter.

It turns out that some SASL/DIGEST-MD5 configurations cause authentication and 
authorization IDs not to match the expected format, and the DIGEST-MD5-based 
portions of the quorum test suite to fail with obscure errors. (They can be 
traced to failures to join the quorum, but only by looking into detailed logs.)

We can use the login module name to determine whether DIGEST-MD5 or GSSAPI is 
used, and relax the authentication ID check for the former.  As a cleanup, we 
can keep the password-based credential map empty when Kerberos principals are 
expected.  Finally, we can adapt tests to ensure "weirdly-shaped" credentials 
only cause authentication failures in the GSSAPI case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4752) Remove version files in zookeeper-server/src/main from .gitignore

2023-10-03 Thread Istvan Toth (Jira)
Istvan Toth created ZOOKEEPER-4752:
--

 Summary: Remove version files  in zookeeper-server/src/main from 
.gitignore
 Key: ZOOKEEPER-4752
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4752
 Project: ZooKeeper
  Issue Type: Bug
  Components: build
Affects Versions: 3.8.2
Reporter: Istvan Toth


The Info.java and VersionInfoMain.java files are currently generated into the 
target/generated-sources directory. 

However .gitignore still includes the following lines for the main src 
directory.
{noformat}
zookeeper-server/src/main/java/org/apache/zookeeper/version/Info.java
zookeeper-server/src/main/java/org/apache/zookeeper/version/VersionInfoMain.java
{noformat}
Let's remove them.

I've just spent two hours trying to debug  mysterious build failures, which 
were caused by an old Info.java file in src, which didn't show up in git status 
because of those out-of-date .gitignore entries.



 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4751) Update snappy-java to 1.1.10.5 to address CVE-2023-43642

2023-09-30 Thread Lari Hotari (Jira)
Lari Hotari created ZOOKEEPER-4751:
--

 Summary: Update snappy-java to 1.1.10.5 to address CVE-2023-43642
 Key: ZOOKEEPER-4751
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4751
 Project: ZooKeeper
  Issue Type: Task
Reporter: Lari Hotari


snappy-java 1.1.10.1 contains CVE-2023-43642 . Upgrade the dependency to 
1.1.10.5 to get rid of the CVE.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4750) RequestPathMetricsCollector does not align with FinalRequestProcessor

2023-09-30 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4750:
-

 Summary: RequestPathMetricsCollector does not align with 
FinalRequestProcessor
 Key: ZOOKEEPER-4750
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4750
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.9.0
Reporter: Kezhu Wang


For example, it does not handle {{createTTL}}.

{noformat}
2023-09-30 17:46:59,212 [myid:] - ERROR 
[SyncThread:0:o.a.z.s.u.RequestPathMetricsCollector@216] - We should not handle 
21
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4749) Request timeout is not respected for asynchronous api

2023-09-27 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4749:
-

 Summary: Request timeout is not respected for asynchronous api
 Key: ZOOKEEPER-4749
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4749
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.9.0
Reporter: Kezhu Wang


"zookeeper.request.timeout" is only consulted in synchronous code path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4748) quorum.QuorumCnxManager: BufferUnderflowException

2023-09-25 Thread Ke Han (Jira)
Ke Han created ZOOKEEPER-4748:
-

 Summary: quorum.QuorumCnxManager: BufferUnderflowException
 Key: ZOOKEEPER-4748
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4748
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum
Reporter: Ke Han
 Attachments: hbase--zookeeper-8db357045302.log, persistent.tar.gz

When running zookeeper (3.5.7, integrated in HBase-2.4.7), I met the following 
error message.
{code:java}
2023-09-25T11:24:41,326 ERROR [SendWorker:1] quorum.QuorumCnxManager: 
BufferUnderflowException
java.nio.BufferUnderflowException: null
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:155) ~[?:1.8.0_362]
        at java.nio.ByteBuffer.get(ByteBuffer.java:723) ~[?:1.8.0_362]
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.send(QuorumCnxManager.java:1083)
 ~[zookeeper-3.5.7.jar:3.5.7]
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1115)
 ~[zookeeper-3.5.7.jar:3.5.7] {code}
Here's the structure of my cluster
N0: ZK0, HMaster
N1: ZK1, Regionserver1
N1: ZK2, Regionserver2
N_100: HDFS

This error happen when I upgrade the HBase cluster, the zookeeper cluster also 
gets a restart. 
The error message happens rarely. Considering its ERROR level, I am not sure 
whether it will cause other issues. But the cluster still seems to be working 
correctly. I noticed that the send() code remains the same in the new version. 
I suspect it might also happen in the latest version. If it's benign, would it 
be better to be output as WARN level?

I have attached my full logs (persistent.tar.gz). The specific error occurred 
in hbase--zookeeper-8db357045302.log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4747) Java api lacks synchronous version of sync() call

2023-09-24 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4747:
-

 Summary: Java api lacks synchronous version of sync() call
 Key: ZOOKEEPER-4747
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4747
 Project: ZooKeeper
  Issue Type: New Feature
  Components: java client
Reporter: Kezhu Wang
Assignee: Kezhu Wang
 Fix For: 3.10.0


Ideally, it should be redundant just as what [~breed] says in ZOOKEEPER-1167.

{quote}
it wasn't an oversight. there is no reason for a synchronous version. because 
of the ordering guarantees, if you issue an asynchronous sync, the next call, 
whether synchronous or asynchronous will see the updated state.
{quote}

But in case of connection loss and absent of ZOOKEEPER-22, client has to check 
result of asynchronous sync before next call. So, currently, we can't simply 
issue an fire-and-forget asynchronous sync and an read to gain strong 
consistent. Then in a synchronous call chain, client has to convert 
asynchronous {{sync}} to synchronous to gain strong consistent. This is what I 
do in 
[EagerACLFilterTest::syncClient|https://github.com/apache/zookeeper/blob/f42c01de73867ffbc12707b3e9f9cd7f847fe462/zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/EagerACLFilterTest.java#L98],
 it is apparently unfriendly to end users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4746) cppunit tests hang and cancelled

2023-09-20 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4746:
-

 Summary: cppunit tests hang and cancelled
 Key: ZOOKEEPER-4746
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4746
 Project: ZooKeeper
  Issue Type: Test
  Components: tests
Affects Versions: 3.10.0
Reporter: Kezhu Wang


* https://github.com/apache/zookeeper/actions/runs/6007712384/job/16337953123
* https://github.com/apache/zookeeper/actions/runs/6047057349/job/16409786315
* https://github.com/apache/zookeeper/actions/runs/6195151365/job/16819317479
* https://github.com/apache/zookeeper/actions/runs/6196548582/job/16823409398

Hang too long to be cancelled by runner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4745) End to End tests fail occasionally

2023-09-20 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4745:
-

 Summary: End to End tests fail occasionally
 Key: ZOOKEEPER-4745
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4745
 Project: ZooKeeper
  Issue Type: Test
  Components: tests
Affects Versions: 3.10.0
Reporter: Kezhu Wang


I saw:
* https://github.com/apache/zookeeper/actions/runs/5587157838/job/15131211778
* https://github.com/kezhuw/zookeeper/actions/runs/5251205631/job/14209201285
* 
https://github.com/kezhuw/zookeeper/actions/runs/6198985701/job/16830576384#step:9:38
* 
https://github.com/apache/zookeeper/actions/runs/6244974218/job/16952757583#step:11:44

{noformat}
2023-07-18 12:08:34,046 [myid:] - ERROR [main:o.a.z.u.ServiceUtils@48] - 
Exiting JVM with code 1
ZooKeeper JMX enabled by default
Using config: 
/home/runner/work/zookeeper/zookeeper/apache-zookeeper-3.7.0-bin/bin/../conf/zoo_sample.cfg
Stopping zookeeper ... STOPPED
Traceback (most recent call last):
  File "/home/runner/work/zookeeper/zookeeper/tools/ci/test-connectivity.py", 
line 48, in 
subprocess.run([f'{client_binpath}', 'sync', '/'], check=True)
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 
'['/home/runner/work/zookeeper/zookeeper/bin/zkCli.sh', 'sync', '/']' returned 
non-zero exit status 1.
Error: Process completed with exit code 1.
{noformat}

I guess it could cause by asynchronous start.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4744) Zookeeper fails to start after power failure

2023-09-15 Thread Maria Ramos (Jira)
Maria Ramos created ZOOKEEPER-4744:
--

 Summary: Zookeeper fails to start after power failure
 Key: ZOOKEEPER-4744
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4744
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.7.1
 Environment: These are the configurations of the ZooKeeper cluster 
(omitting IPs):

{{tickTime=2000}}
{{dataDir=/home/data/zk37}}
{{clientPort=2181}}
{{maxClientCnxns=60}}
{{initLimit=100}}
{{syncLimit=100}}
{{server.1=[IP1]:2888:3888}}
{{server.2=[IP2]:2888:3888}}
{{server.3=[IP3]:2888:3888}}
Reporter: Maria Ramos
 Attachments: reported_error.txt

The underlying issue stems from consecutive writes to the log file that are not 
interleaved with {{fsync}} operations. This is a well-documented behavior of 
operating systems, and there are several references addressing this problem:
 - 
[https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai]
 - [https://dl.acm.org/doi/pdf/10.1145/2872362.2872406]
 - [https://mariadb.com/kb/en/atomic-write-support/]
 - [https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf] (Page 9)

This issue can be replicated using 
[LazyFS|https://github.com/dsrhaslab/lazyfs], a file system capable of 
simulating power failures and exhibiting the OS behavior mentioned above, i.e., 
the out-of-order file writes at the disk level. LazyFS persists these writes 
out of order and then crashes to simulate a power failure.

To reproduce this problem, one can follow these steps:

   {*}1{*}. Mount LazyFS on a directory where ZooKeeper data will be saved, 
with a specified root directory. Assuming the data path for ZooKeeper is 
{{/home/data/zk}} and the root directory is {{{}/home/data/zk-root{}}}, add the 
following lines to the default configuration file (located in the 
{{config/default.toml}} directory):

{{[[injection]] }}
{{type="reorder" }}
{{occurrence=1 }}
{{op="write" }}
{{file="/home/data/zk-root/version-2/log.10001" }}
{{persist=[3]}}

These lines define a fault to be injected. A power failure will be simulated 
after the third write to the {{/home/data/zk-root/version-2/log.10001}} 
file. The `occurrence` parameter allows specifying that this is the first group 
where this happens, as there might be more than one group of consecutive writes.

   {*}2{*}. Start LazyFS as the underlying file system of a node_ in the 
cluster with the following command:

{{     ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/zk -r 
/home/data/zk-root -f}}


   {*}3{*}. Start ZooKeeper with the command:
{{     apache-zookeeper-3.7.1-bin/bin/zkServer.sh start-foreground}}


   {*}4{*}. Connect a client to the node that has LazyFS as the underlying file 
system:

          {{apache-zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181}}

Immediately after this step, LazyFS will be unmounted, simulating a power 
failure, and ZooKeeper will keep printing error messages in the terminal, 
requiring a forced shutdown.
At this point, one can analyze the logs produced by LazyFS to examine the 
system calls issued up to the moment of the fault. Here is a simplified version 
of the log:

{'syscall': 'create', 'path': 
'/home/gsd/data/zk37-root/version-2/log.10001', 'mode': 'O_TRUNC'} 
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 
'size': '16', 'off': '0'} 
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 
'size': '1', 'off': '67108879'} 
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 
'size': '67108863', 'off': '16'} 
{'syscall': 'write', 'path': '/home/data/zk37-root/version-2/log.10001', 
'size': '61', 'off': '16'} 

Note that the third write is issued by LazyFS for padding.

 
   {*}5{*}. Remove the fault from the configuration file, unmount the file 
system with

           {{fusermount -uz /home/data/zk}}

   {*}6{*}. Mount LazyFS again with the previously provided command.

   {*}7{*}. Attempt to start ZooKeeper (it fails).

By following these steps, one can replicate the issue and analyze the effects 
of the power failure on ZooKeeper's restart process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4743) Jump from -2 to 0 when zookeeper increments znode dataVersion

2023-09-15 Thread HaiyuanZhao (Jira)
HaiyuanZhao created ZOOKEEPER-4743:
--

 Summary: Jump from -2 to 0 when zookeeper increments znode 
dataVersion
 Key: ZOOKEEPER-4743
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4743
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: HaiyuanZhao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4742) Config watch path get truncated abnormally for chroot "/zoo" or alikes

2023-09-15 Thread Kezhu Wang (Jira)
Kezhu Wang created ZOOKEEPER-4742:
-

 Summary: Config watch path get truncated abnormally for chroot 
"/zoo" or alikes
 Key: ZOOKEEPER-4742
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4742
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.8.2, 3.9.0, 3.7.1
Reporter: Kezhu Wang
Assignee: Kezhu Wang


This is a leftover of ZOOKEEPER-4565 and splitted from 
[pr#1996|https://github.com/apache/zookeeper/pull/1996] to make ZOOKEEPER-4601 
concentrate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4741) High latency under heavy load when prometheus metrics enabled

2023-09-11 Thread Yike Xiao (Jira)
Yike Xiao created ZOOKEEPER-4741:


 Summary: High latency under heavy load when prometheus metrics 
enabled
 Key: ZOOKEEPER-4741
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.8.2, 3.6.4
 Environment: zookeeper version: 3.6.4

kernel: 3.10.0-1160.95.1.el7.x86_64

java version "1.8.0_111"

 

metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
Reporter: Yike Xiao
 Attachments: 32010.threaddump.001.txt, 32010.wallclock.profile.html, 
image-2023-09-11-16-17-21-166.png

In our production, we use zookeeper built-in PrometheusMetricsProvider to 
monitor zookeeper status, recently we observed very high latency in one of our 
zookeeper cluster which serve heavy load.

Measured in a heavy client side, the latency could more than 25 seconds. 

!image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q 
on the server side. 

CommitProcWorkThread *BLOCKED* in 
{{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}{{{}{}}}
"CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x7f86d804a000 
nid=0x6bca waiting for monitor entry [0x7f86deb95000]   
java.lang.Thread.State: BLOCKED (on object monitor)at 
io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91)- 
waiting to lock <0x000784dd1a18> (a io.prometheus.client.CKMSQuantiles) 
   at 
io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38)
at io.prometheus.client.Summary$Child.observe(Summary.java:281)at 
io.prometheus.client.Summary.observe(Summary.java:307)at 
org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355)
at 
org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) 
   at 
org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669)
at 
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585)
at 
org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545)
at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:748)
The wall lock profile shows that there is lock contention within the 
{{CommitProcWorkThread}} threads.

!https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png!

{{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4740) I want to use kerberos for Zookeeper, but my authentication has been unsuccessful

2023-09-01 Thread LiJie2023 (Jira)
LiJie2023 created ZOOKEEPER-4740:


 Summary: I want to use kerberos for Zookeeper, but my 
authentication has been unsuccessful
 Key: ZOOKEEPER-4740
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4740
 Project: ZooKeeper
  Issue Type: Wish
  Components: kerberos
Affects Versions: 3.5.9
Reporter: LiJie2023
 Attachments: image-2023-09-01-16-37-20-848.png

zookeeper_jaas.conf
{code:java}
Server {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 storeKey=true
 useTicketCache=false
 keyTab="/opt/test2.keytab"
 principal="test2/bigdata.hadoop.master01";
};Client {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 keyTab="/opt/test2.keytab"
 principal="test2/bigdata.hadoop.master01"
 useTicketCache=false
 debug=true;
}; {code}
[root@bigdata conf]# cat java.env
{code:java}
export 
JVMFLAGS="-Djava.security.auth.login.config=/usr/lib/zookeeper/conf/zookeeper_jaas.conf"
 {code}
/etc/krb5.conf
{code:java}
# Configuration snippets may be placed in this directory as well
includedir /etc/krb5.conf.d/[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log[libdefaults]
 dns_lookup_realm = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true
 rdns = false
 default_realm = EXAMPLE.COM
 default_ccache_name = KEYRING:persistent:%{uid}[realms]
 EXAMPLE.COM = {
  kdc = bigdata.hadoop.master01
  admin_server = bigdata.hadoop.master01
 }[domain_realm]
.bigdata.hadoop.master01 = EXAMPLE.COM
bigdata.hadoop.master01 = EXAMPLE.COM {code}
!image-2023-09-01-16-37-20-848.png!

 

 

When I use a client connection:
{code:java}
zookeeper-client -server localhost:12181 {code}
Connecting to localhost:12181 2023-09-01 16:38:05,528 - INFO  
[main:Environment@109] - Client 
environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, 
built on 10/25/2022 23:07 GMT 2023-09-01 16:38:05,530 - INFO  
[main:Environment@109] - Client environment:host.name=bigdata.hadoop.master01 
2023-09-01 16:38:05,530 - INFO  [main:Environment@109] - Client 
environment:java.version=1.8.0_351 2023-09-01 16:38:05,532 - INFO  
[main:Environment@109] - Client environment:java.vendor=Oracle Corporation 
2023-09-01 16:38:05,532 - INFO  [main:Environment@109] - Client 
environment:java.home=/usr/java/jdk1.8.0_351-amd64/jre 2023-09-01 16:38:05,532 
- INFO  [main:Environment@109] - Client 
environment:java.class.path=/usr/lib/zookeeper/bin/../zookeeper-server/target/classes:/usr/lib/zookeeper/bin/../build/classes:/usr/lib/zookeeper/bin/../zookeeper-server/target/lib/*.jar:/usr/lib/zookeeper/bin/../build/lib/*.jar:/usr/lib/zookeeper/bin/../lib/zookeeper-jute-3.5.9.jar:/usr/lib/zookeeper/bin/../lib/zookeeper-3.5.9.jar:/usr/lib/zookeeper/bin/../lib/slf4j-log4j12-1.7.25.jar:/usr/lib/zookeeper/bin/../lib/slf4j-api-1.7.25.jar:/usr/lib/zookeeper/bin/../lib/netty-transport-native-unix-common-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-transport-native-epoll-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-transport-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-resolver-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-handler-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-common-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-codec-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/netty-buffer-4.1.50.Final.jar:/usr/lib/zookeeper/bin/../lib/log4j-1.2.17.jar:/usr/lib/zookeeper/bin/../lib/json-simple-1.1.1.jar:/usr/lib/zookeeper/bin/../lib/jline-2.14.6.jar:/usr/lib/zookeeper/bin/../lib/jetty-util-ajax-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-util-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-servlet-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-server-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-security-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-io-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/jetty-http-9.4.35.v20201120.jar:/usr/lib/zookeeper/bin/../lib/javax.servlet-api-3.1.0.jar:/usr/lib/zookeeper/bin/../lib/jackson-databind-2.10.5.1.jar:/usr/lib/zookeeper/bin/../lib/jackson-core-2.10.5.jar:/usr/lib/zookeeper/bin/../lib/jackson-annotations-2.10.5.jar:/usr/lib/zookeeper/bin/../lib/commons-cli-1.2.jar:/usr/lib/zookeeper/bin/../lib/audience-annotations-0.5.0.jar:/usr/lib/zookeeper/bin/../zookeeper-jute.jar:/usr/lib/zookeeper/bin/../zookeeper-jute-3.5.9.jar:/usr/lib/zookeeper/bin/../zookeeper-3.5.9.jar:/usr/lib/zookeeper/bin/../zookeeper-server/src/main/resources/lib/*.jar:/etc/zookeeper/conf::/etc/zookeeper/conf:/usr/lib/zookeeper/zookeeper-3.5.9.jar:/usr/lib/zookeeper/zookeeper-jute-3.5.9.jar:/usr/lib/zookeeper/zookeeper-jute.jar:/usr/lib/zookeeper/zookeeper.jar:/usr/lib/zookeeper/lib/audience-annotations-0.5.0.jar:/usr/lib/zookeeper/lib/commons-cli-

[jira] [Created] (ZOOKEEPER-4739) Disable netty-tcnative for s390x arch

2023-08-31 Thread Vibhuti Sawant (Jira)
Vibhuti Sawant created ZOOKEEPER-4739:
-

 Summary: Disable netty-tcnative for s390x arch
 Key: ZOOKEEPER-4739
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4739
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.9.0, 3.9.1
Reporter: Vibhuti Sawant


As netty-tc-native is not supported for s390x arch, there were TC failures 
observed in ClientSSLTest, hence skipping this TC for s390x arch only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4738) Clean test cases by refactoring assertFalse(equals()) with assertNotEquals

2023-08-29 Thread Taher Ghaleb (Jira)
Taher Ghaleb created ZOOKEEPER-4738:
---

 Summary: Clean test cases by refactoring assertFalse(equals()) 
with assertNotEquals
 Key: ZOOKEEPER-4738
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4738
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Taher Ghaleb






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4737) Error occurs with the zookeeper_interest() function in version 3.5.8 of zookeeper-client-c

2023-08-20 Thread wangyuzhi (Jira)
wangyuzhi created ZOOKEEPER-4737:


 Summary: Error occurs with the zookeeper_interest() function in 
version 3.5.8 of zookeeper-client-c
 Key: ZOOKEEPER-4737
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4737
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.5.8
 Environment: zookeeper server version: 3.4.12
Reporter: wangyuzhi


"I encountered an intermittent error while using zookeeper-client-c:
/lib64/libc.so.6(cfree+0x1c) [0x7f3f9e74f4dc]
/lib64/libc.so.6(_IO_free_backup_area+0x1a) [0x7f3f9e74713a]
/lib64/libc.so.6(_IO_file_overflow+0x1d5) [0x7f3f9e7468d5]
/lib64/libc.so.6(_IO_file_xsputn+0xb1) [0x7f3f9e745651]
/lib64/libc.so.6(_IO_vfprintf+0x151d) [0x7f3f9e71769d]
/lib64/libc.so.6(_IO_fprintf+0x87) [0x7f3f9e720827]
proxy(log_message+0x20c) [0x8c39d2]
proxy(zookeeper_interest+0x16b) [0x8b2c5c]
We are only using ZooKeeper for service discovery, but this error occurs every 
once in a while."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4736) socket fd leak

2023-08-14 Thread lchq (Jira)
lchq created ZOOKEEPER-4736:
---

 Summary: socket fd leak
 Key: ZOOKEEPER-4736
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4736
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client, server
Affects Versions: 3.9.0, 3.8.0, 3.7.0, 3.6.3
 Environment: zookeeper 3.6.3  !4cea510a57af58c08e73d146e8535ee4.jpg!
Reporter: lchq
 Attachments: 4cea510a57af58c08e73d146e8535ee4.jpg, 
IMG_20230815_114433.jpg

if network service is unavailable, as "ifdown eth0" or "service network stop" 
and so on, zk-client process running on this node will   experience fd leakage 
。it happens for invoking  "new Zookeeper(..)". 

when   network service is unavailable, ClientCnxn::SendThread::run() method 
will continuely do startConnect(),and suffer exception "SocketException: 
Network is unreachable". Exception handlers catch this exception and do 
SendThread::cleanup() to do some clean operation,but because  in 
ClientCnxnSocketNIO::registerAndConnect method socket is registed to selector 
firstly and do  sock.connect operation leading the fd of sock can't be closed.

Changing the order of sock.connect and sock.register can solve this issue,and 
it will not affect Original sense because of the sock.reister take effect when  
selector.select(waitTimeOut) is triggered



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4735) set the RMI port to address issues with monitoring Zookeeper running in containers

2023-08-14 Thread Enrico Olivelli (Jira)
Enrico Olivelli created ZOOKEEPER-4735:
--

 Summary: set the RMI port to address issues with monitoring 
Zookeeper running in containers
 Key: ZOOKEEPER-4735
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4735
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Reporter: Enrico Olivelli
 Fix For: 3.10.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4734) FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears

2023-08-11 Thread Haoze Wu (Jira)
Haoze Wu created ZOOKEEPER-4734:
---

 Summary: FuzzySnapshotRelatedTest becomes flaky when transient 
disk failure appears
 Key: ZOOKEEPER-4734
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4734
 Project: ZooKeeper
  Issue Type: Bug
  Components: tests
Affects Versions: 3.6.0
Reporter: Haoze Wu


In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and 
restarted to test for loading snapshots. However, during restarting of quorum 
server, we would call into ZkDataBase#loadDataBase(), from in which an 
IOException could be thrown because of transient disk failure. 
{code:java}
public long loadDataBase() throws IOException {
    long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   
commitProposalPlaybackListener); // line 240 and IOException thrown here
    initialized = true;
    return zxid;
} {code}
In FileTxnSnapLog#restore

 
{code:java}
public long restore(DataTree dt, Map sessions,
                    PlayBackListener listener) throws IOException {
    long deserializeResult = snapLog.deserialize(dt, sessions); // IOException  
   
...
}{code}
Here is the stacktrace: 
{code:java}
        at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
        at 
org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
        at 
org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
        at java.lang.Thread.run(Thread.java:748) {code}
Finally, because of this IOException, restart would be failed and test failed. 

In terms of the fix, we could either retry the test like the one proposed by 
ZOOKEEPER-3157 or we could add some configurable retry mechanism to 
ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ZOOKEEPER-4733) non-return function error and asan error in CPPUNIT TESTs

2023-08-04 Thread whyer (Jira)
whyer created ZOOKEEPER-4733:


 Summary:  non-return function error and asan error in CPPUNIT TESTs
 Key: ZOOKEEPER-4733
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4733
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.8.2
 Environment: gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Reporter: whyer


when enable Werror=return-type check in gcc, the following error occurs:
{quote}zookeeper/zookeeper-client/zookeeper-client-c/tests/ZooKeeperQuorumServer.cc:
 In static member function ‘static std::vector 
ZooKeeperQuorumServer::getCluster(uint32_t, 
ZooKeeperQuorumServer::tConfigPairs, std::__cxx11::string)’:
zookeeper/zookeeper-client/zookeeper-client-c/tests/ZooKeeperQuorumServer.cc:230:1:
 error: control reaches end of non-void function [-Werror=return-type]
 }{quote}

when enable asan option on cppunit test, the following error occurs:
{quote}1: 
Zookeeper_reconfig::testMigrationCycle=
1: ==415554==ERROR: AddressSanitizer: heap-buffer-overflow on address 
0x60d0cf6f at pc 0x560905cbbd12 bp 0x7ffe32d10af0 sp 0x7ffe32d10ae8
1: READ of size 1 at 0x60d0cf6f thread T0
1: #0 0x560905cbbd11 in Zookeeper_reconfig::testMigrationCycle() 
zookeeper/zookeeper-client/zookeeper-client-c/tests/TestReconfig.cc:502
1: #1 0x560905cc21a1 in CppUnit::TestCaller::runTest() 
/usr/include/cppunit/TestCaller.h:166
1: #2 0x7fb8248815b1 in CppUnit::TestCaseMethodFunctor::operator()() const 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x235b1)
1: #3 0x7fb824877eb2 in CppUnit::DefaultProtector::protect(CppUnit::Functor 
const&, CppUnit::ProtectorContext const&) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x19eb2)
1: #4 0x7fb82487e7e1 in CppUnit::ProtectorChain::protect(CppUnit::Functor 
const&, CppUnit::ProtectorContext const&) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x207e1)
1: #5 0x7fb824886e4f in CppUnit::TestResult::protect(CppUnit::Functor 
const&, CppUnit::Test*, std::__cxx11::basic_string, std::allocator > const&) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x28e4f)
1: #6 0x7fb82488138f in CppUnit::TestCase::run(CppUnit::TestResult*) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x2338f)
1: #7 0x7fb8248818e2 in 
CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x238e2)
1: #8 0x7fb8248817fd in CppUnit::TestComposite::run(CppUnit::TestResult*) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x237fd)
1: #9 0x7fb8248818e2 in 
CppUnit::TestComposite::doRunChildTests(CppUnit::TestResult*) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x238e2)
1: #10 0x7fb8248817fd in CppUnit::TestComposite::run(CppUnit::TestResult*) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x237fd)
1: #11 0x7fb824886d71 in CppUnit::TestResult::runTest(CppUnit::Test*) 
(/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x28d71)
1: #12 0x7fb82488947d in CppUnit::TestRunner::run(CppUnit::TestResult&, 
std::__cxx11::basic_string, std::allocator > 
const&) (/usr/lib/x86_64-linux-gnu/libcppunit-1.13.so.0+0x2b47d)
1: #13 0x560905c918a5 in main 
zookeeper/zookeeper-client/zookeeper-client-c/tests/TestDriver.cc:152
1: #14 0x7fb8230b42e0 in __libc_start_main 
(/lib/x86_64-linux-gnu/libc.so.6+0x202e0)
1: #15 0x560905c914c9 in _start 
(build/zookeeper/zookeeper-client/zookeeper-client-c/zktest+0x154c9)
1: 
1: 0x60d0cf6f is located 1 bytes to the left of 138-byte region 
[0x60d0cf70,0x60d0cffa)
1: allocated by thread T0 here:
1: #0 0x7fb824b5fbf0 in operator new(unsigned long) 
(/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc2bf0)
1: #1 0x7fb823a5f1f6  (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xf6)
1: #2 0x7fb823a5bcb9 in std::ostream& std::ostream::_M_insert(unsigned long) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x10dcb9)
1: #3 0x7ffe32d1090f  ()
1: 
1: SUMMARY: AddressSanitizer: heap-buffer-overflow 
zookeeper/zookeeper-client/zookeeper-client-c/tests/TestReconfig.cc:502 in 
Zookeeper_reconfig::testMigrationCycle()
1: Shadow bytes around the buggy address:
1:   0x0c1a7fff9990: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1:   0x0c1a7fff99a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1:   0x0c1a7fff99b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1:   0x0c1a7fff99c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1:   0x0c1a7fff99d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1: =>0x0c1a7fff99e0: fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]00 00
1:   0x0c1a7fff99f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02
1:   0x0c1a7fff9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1:   0x0c1a7fff9a10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
1:   0x0c1a7fff9a20: fa fa fa fa fa fa fa f

[jira] [Created] (ZOOKEEPER-4732) improve Reproducible Builds

2023-08-04 Thread Herve Boutemy (Jira)
Herve Boutemy created ZOOKEEPER-4732:


 Summary: improve Reproducible Builds
 Key: ZOOKEEPER-4732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4732
 Project: ZooKeeper
  Issue Type: Improvement
  Components: build
Affects Versions: 3.9.0
Reporter: Herve Boutemy


rebuilding Zookeeper 3.9.0 shows that it's only partially reproducible: 
https://github.com/jvm-repo-rebuild/reproducible-central/blob/master/content/org/apache/zookeeper/README.md

analysis the root cause, there are 2 issues:
1. a few old plugins to upgrade (easy)
2. code generated that contains build timestamp: replacing with git commit 
timestamp would make the build reproducible (or even removing this, but 
removing is a bigger change as it impacts API)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >