[jira] [Created] (ZOOKEEPER-4828) Minor 3.9 broke custom TLS setup with ssl.context.supplier.class

2024-04-22 Thread Jon Marius Venstad (Jira)
Jon Marius Venstad created ZOOKEEPER-4828:
-

 Summary: Minor 3.9 broke custom TLS setup with 
ssl.context.supplier.class
 Key: ZOOKEEPER-4828
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4828
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Jon Marius Venstad


We run embedded ZooKeeper in Vespa, and use a custom TLS stack, where we, e.g., 
do additional validation and authorisation in our TLS trust manager, for client 
certificates.
The changes in 
https://github.com/apache/zookeeper/commit/4a794276d3d371071c31f86c14da824fdd2e53c0,
 done for ZOOKEEPER-4622,
broke the `ssl.context.supplier.class configuration parameter`, documented in 
the ZK admin guide
(https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_configuration).
Consequently, current code (3.9.2) now enforces a file-based key and trust 
store for _any TLS_, which is not an option for us.

I looked at two ways to fix this:
1. Add new configuration parameters for _key_ and _trust_ store _suppliers_, as 
an alternative to the key and trust store _files_ required in the new (with 
3.9.0)
   ClientX509Util code—this adds another pair of config options, of which there 
are already plenty, and the user is stuck with
   the default JDK `Provider` (optional argument to 
SSLContext.getInstance(protocols, provider); it also lets users with a
   custom key and trust store use the native SSL support of Netty.
   Oh, and, Netty provides the option to specify a JDK `Provider` in the 
SslContextBuilder, too, so that _could_ be made configurable as well.
2. Restore the option of specifying a custom SSL context, and prefer this over 
using the Netty SslContextBuilder in the new
   ClientX509Util code, when present—this lets users specify a JDK `Provider`, 
but file based key and trust stores will be required for the native SSL added 
in 3.9.0.

I don't have a strong opinion on which option is better. I can also contribute 
a code change with either.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ZOOKEEPER-4541) Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-09-23 Thread Jon Marius Venstad (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17608752#comment-17608752
 ] 

Jon Marius Venstad commented on ZOOKEEPER-4541:
---

After some thorough investigation, we have a good idea of what has happened. 
There are two faults in the ZK codebase, both of which are attempted fixed in 
[https://github.com/apache/zookeeper/pull/1925]. The details of our findings 
are as follows:

 

First, some background: we do rolling restarts of embedded ZK clusters of size 
3, roughly 50 times a day, across many data centres. 

A few times a week, we find a ZK server with a different view of the world, 
from what the other servers in the cluster have. 

As far as we could ascertain, this always happens as a result of the leader 
restarting, that is, the enclosing JVM shuts down (cleanly, and we call 
{{{}QuorumPeer.shutdown(){}}}), and start up again a few minutes later. 

Looking at the data directories of the servers after a recent episode, we found 
two interesting things, of which the second is the critical:

Transaction logs had segments of transactions that were repeated, both 2 and 3 
times. This was found on all servers, but it seems to be harmless, despite the 
ERROR logging it produces when the data directory is loaded on ZK server start. 
It causes a digest mismatch, which isn't acted upon. It also becomes invisible 
when a new snapshot is taken. This masked the real error, though, because we 
were looking for digest mismatches, and the digest already failed prior to what 
was the second finding: 

_The transaction log_ on the server that did _not_ agree with the others (in 
particular, it typically insist there are ephemeral nodes present, but for 
sessions which died a long time ago!) was _missing entries at the end of one of 
the log files!_ The transaction log in question was the active one at the end 
of the epoch of the leader which restarted.

We've also noticed this error is more prevalent on machines with slower disks, 
where we sometimes see complaints from ZK about slow fsyncing. 

 

Putting this all together, we finally discovered that the following happens:

 

ZK3 is leader, ZK1 and ZK2 are followers. X < Z are transaction IDs. 

ZK1 is busy fsyncing transactions <= X.

ZK3 proposes Z, and ZK1 and ZK2 receive the proposal.

ZK1 enqueues the proposal, but does not ACK it. ZK2 ACKs the proposal. 

ZK3 observes quorum for Z, and sends a COMMIT to ZK1 and ZK2. 

ZK1 applies Z to its data tree, but it is still not processed by the 
SyncRequestProcessor, which is busy fsyncing. 

ZK3 shuts down, closing the socket to ZK1 and ZK2.

ZK1 stops following the dead leader, and shuts down its {{{}Follower{}}}, to 
prepare for a new leader election and epoch.

ZK1 joins a new leader election, using Z as its {{{}lastSeenZxid{}}}, because 
that was COMMITTed, and applied to its data tree. 

ZK1 completes fsync up to X, and then proceeds to ACK those transactions, but 
the socket is gone, and the sync thread dies from an unexpected NPE. The 
remaining transaction Z is not written to disk. 

ZK1 and ZK2 agree that Z is the newest transaction, ZK2 becomes the new leader, 
and ZK3 eventually rejoins as a follower. 

ZK1 has a data tree which is correct, and new transactions come in, and are 
written to the transaction log of the new epoch. 

ZK1 restarts. When it comes back up, Z is missing from its transaction log, and 
its data is inconsistent with that of the other servers. It seems we never hit 
SNAP syncs, unless we intervene, so it stays this way indefinitely. 

 

So, expecting the socket of the learner to possibly be {{{}null{}}}, in 
{{{}SendAckRequestProcessor{}}}, is enough to fix this problem in most cases, 
because that seems to be the common reason for the sync thread to die—had it 
lived on, it would eventually have synced Z as well; nevertheless, there would 
be a thread leak, and there could be other failure scenarios that I'm not aware 
of as well, both now, and certainly later. 

The attempted fix therefore also refactors the shutdown for child classes of 
{{{}ZooKeeperServer{}}}, such that they all override the {{shutdown(boolean)}} 
method, which is ultimately the one that is called. I also moved the shutdown 
(and sync) of the {{SyncRequestProcessor}} to be before the shutdown of the 
parent {{{}ZooKeeperServer{}}}, because of the {{fastForwardFromEdits()}} 
that's called in the latter in an attempt to be up to date when the leader 
election starts; obviously, this makes more sense if the pending transactions 
are actually written first, and I believe the missing synchronisation with the 
persistence thread is also what caused the duplicate series of transactions in 
the transaction logs: inflight transactions (not yet committed) could be 
written to the log by the {{SyncRequestProcessor}} in parallel with a leader 
election using an older Zxid than the tail of