[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1471531108 @eolivelli any thoughts on what to do next here? I think it would be good to conclude this work soon. It's getting a bit stale :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1415721176 > As a meta note, I wonder if it's worth splitting this PR into multiple PRs because it fixes distinct bugs (ZOOKEEPER-4409, ZOOKEEPER-4502, ZOOKEEPER-4394, ZOOKEEPER-4541). Likely not worth the effort, but perhaps it will get through review more easily that way ;) Oof, would probably have been a good idea, but one thing led to another, and now it'd be a lot of work to split this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1369710624 Any further thoughts on this @breed, @eolivelli ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1283316836 So ... you may be right we don't need to keep all this auxiliary structure during sync, but I believe we need to _if we want to precisely preserve today's behaviour_ (except what's needed to fixi those bugs, obviously). What complicates the sync is that some transactions aren't logged and ack'ed; and possibly also that state is used while syncing. I can't say whether the first behaviour is needed, and whether the second is actually the case, without a much deeper dive into all of this, but I do see tests failing when I change this behaviour. Perhaps it is a good idea to first fix these bugs, and then possibly look for ways to simplify? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1280715961 Hehe, yes, it's too complex. I'll try to extract that sync state state to a helper class, with descriptive method names. That should hopefully clarify the meaning of those two queues. And, yes, we need them both. I won't offer an explanation here, but rather hope the next commit makes that superfluous. Regarding the `isPreZAB1_0`: that must have been broken for some time as well—there's no `NEWLEADER` in the diff there, so we never start the ZK server (after that code was split and moved into the `NEWLEADER` handling. I can remove it in a separate commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1275933384 We've had this (the last commit, see referencing PR) running across hundreds of clusters for some time now, with more than a thousand rolling restarts concurrently with read and write traffic, and not detected any further issues. Also, it seems @hanm is no longer active? @eolivelli would you like a different reviewer as well, or are you satisfied? @jeffrey-xiao seems capable, imo :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265729118 (Pushed again after fixing a complaint from code analysis, about an always true condition.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265720403 Hmm, that other PR doesn't look right to me. It fails to store the TXNs that aren't already committed before ACKing the NEWLEADER, which was what ZOOKEEPER-3911 was all about, in the first place. Agree? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265515004 There ... Not pretty, but seems like the only way to make this right is by making it possible to delay the `ACK`s that would otherwise be sent once the `SyncRequestProcessor` flushes, so they arrive after the `ACK` og the `NEWLEADER`. This is implemented by adding additional special requests that can be enqueued by the sync processor—one to rendezvous with it, from the syncing thread at startup, to ensure TXNs are actually flushed; and two to toggle whether to delay forwarding to the `SendAckRequestProcessor`, used by the `FollowerZooKeeperServer`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265325492 Hmm, no, the code has a race, as it is now. The `LeaderHandler` expects the first `ACK` after starting a `DIFF` sync to be the `NEWLEADER` ack, but if there are lots of `PROPOSAL`s in the diff, before the `NEWLEADER`, then this may also cause an `ACK` to be sent, which will crash the `Leader.wantForNewLeaderAck`. Ensuring the transactions are indeed flushed (through the usual request processor pipeline) _guarantees_ these `ACK`s, and thus always crashes the leader. Meh. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265250318 Hmm, no, this isn't quite right either, although none of the tests fail. Not sure why the rendezvous-with-sync-thread didn't work, and that's probably the right way to do this. It could be just insufficient test-setup, of course. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1264342280 We raced indeed :) Given that this PR addresses some additional concerns, I'd vote for it to be merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1263585329 The symptom of the bug fixed in the first of these two commits is that the learner crashes with an NPE during sync, and then, when it restarts, it typically writes a duplicated series of transactions to its transaction log, complains about that, and later may observe a digest mismatch when replaying that transaction log from file, during startup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1263579500 We've had this running for almost a week now, without any issues, and the data inconsistencies have not been observed. The sample size isn't large enough to conclude, though :) Anyway, we saw some other digest mismatches, and I started digging around for their cause. I found one problem introduced with [this commit](https://github.com/apache/zookeeper/commit/b978dfb949e4ac4d703e956c6ef811415c831bcd), fixed in [8121711](https://github.com/apache/zookeeper/pull/1925/commits/81217117c272b91cd7a91d06568adf6b53047801). The problem was the a `COMMIT` between `NEWLEADER` (which flushes the `packetsNotCommited`) and `UPTODATE` would crash the learner, which would peek at this queue and expect entries in it. This is fixed by passing on the entries, but not removing them; instead, the already written entries are simply skipped when updating the log after `UPTODATE`. Working on the above, I also found the fix for reconfig between `NEWLEADER` and `UPTODATE`, in [this commit](https://github.com/apache/zookeeper/commit/c38787f355b6dcd612fc57db0202fc68a01108f7), to be incomplete: since the `packetsNotCommited` is no longer emptied after `NEWLEADER`, the head doesn't change, and if there are other `PROPOSAL`s between the `NEWLEADER` and the `PROPOSAL` that the `COMMITANDACTIVATE` is meant for, then reconfig still doesn't happen. The unit test added back then was insufficient to test this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1256365917 Thanks for the quick reply! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1256364188 > LGTM > > Great work. Did you experiment this fix in some test environment ? Not yet, but we may do that before we merge, if you wish. We will find a way to run a patched 3.8.0, probably early next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers
jonmv commented on PR #1925: URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1256258545 @hanm I believe you review a [related PR](https://github.com/apache/zookeeper/pull/1224) earlier, so perhaps you're the right reviewer here as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org