[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2023-03-16 Thread via GitHub


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1471531108

   @eolivelli any thoughts on what to do next here? I think it would be good to 
conclude this work soon. It's getting a bit stale :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2023-02-03 Thread via GitHub


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1415721176

   > As a meta note, I wonder if it's worth splitting this PR into multiple PRs 
because it fixes distinct bugs (ZOOKEEPER-4409, ZOOKEEPER-4502, ZOOKEEPER-4394, 
ZOOKEEPER-4541). Likely not worth the effort, but perhaps it will get through 
review more easily that way ;)
   
   Oof, would probably have been a good idea, but one thing led to another, and 
now it'd be a lot of work to split this  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2023-01-03 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1369710624

   Any further thoughts on this @breed, @eolivelli ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-18 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1283316836

   So ... you may be right we don't need to keep all this auxiliary structure 
during sync, but I believe we need to _if we want to precisely preserve today's 
behaviour_ (except what's needed to fixi those bugs, obviously). 
   What complicates the sync is that some transactions aren't logged and 
ack'ed; and possibly also that state is used while syncing. I can't say whether 
the first behaviour is needed, and whether the second is actually the case, 
without a much deeper dive into all of this, but I do see tests failing when I 
change this behaviour. 
   Perhaps it is a good idea to first fix these bugs, and then possibly look 
for ways to simplify? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-17 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1280715961

   Hehe, yes, it's too complex. I'll try to extract that sync state state to a 
helper class, with descriptive method names. That should hopefully clarify the 
meaning of those two queues. And, yes, we need them both. I won't offer an 
explanation here, but rather hope the next commit makes that superfluous. 
   
   Regarding the `isPreZAB1_0`: that must have been broken for some time as 
well—there's no `NEWLEADER` in the diff there, so we never start the ZK server 
(after that code was split and moved into the `NEWLEADER` handling. I can 
remove it in a separate commit. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-12 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1275933384

   We've had this (the last commit, see referencing PR) running across hundreds 
of clusters for some time now, with more than a thousand rolling restarts 
concurrently with read and write traffic, and not detected any further issues.
   
   Also, it seems @hanm is no longer active? @eolivelli would you like a 
different reviewer as well, or are you satisfied? @jeffrey-xiao seems capable, 
imo :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-05 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265729118

   (Pushed again after fixing a complaint from code analysis, about an always 
true condition.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-05 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265720403

   Hmm, that other PR doesn't look right to me. It fails to store the TXNs that 
aren't already committed before ACKing the NEWLEADER, which was what 
ZOOKEEPER-3911 was all about, in the first place. Agree?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-05 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265515004

   There ... Not pretty, but seems like the only way to make this right is by 
making it possible to delay the `ACK`s that would otherwise be sent once the 
`SyncRequestProcessor` flushes, so they arrive after the `ACK` og the 
`NEWLEADER`. 
   This is implemented by adding additional special requests that can be 
enqueued by the sync processor—one to rendezvous with it, from the syncing 
thread at startup, to ensure TXNs are actually flushed; and two to toggle 
whether to delay forwarding to the `SendAckRequestProcessor`, used by the 
`FollowerZooKeeperServer`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-05 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265325492

   Hmm, no, the code has a race, as it is now. The `LeaderHandler` expects the 
first `ACK` after starting a `DIFF` sync to be the `NEWLEADER` ack, but if 
there are lots of `PROPOSAL`s in the diff, before the `NEWLEADER`, then this 
may also cause an `ACK` to be sent, which will crash the 
`Leader.wantForNewLeaderAck`. 
   
   Ensuring the transactions are indeed flushed (through the usual request 
processor pipeline) _guarantees_ these `ACK`s, and thus always crashes the 
leader. Meh. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-05 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1265250318

   Hmm, no, this isn't quite right either, although none of the tests fail. 
   Not sure why the rendezvous-with-sync-thread didn't work, and that's 
probably the right way to do this. It could be just insufficient test-setup, of 
course. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-10-01 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1264342280

   We raced indeed :) Given that this PR addresses some additional concerns, 
I'd vote for it to be merged. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-09-30 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1263585329

   The symptom of the bug fixed in the first of these two commits is that the 
learner crashes with an NPE during sync, and then, when it restarts, it 
typically writes a duplicated series of transactions to its transaction log, 
complains about that, and later may observe a digest mismatch when replaying 
that transaction log from file, during startup. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-09-30 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1263579500

   We've had this running for almost a week now, without any issues, and the 
data inconsistencies have not been observed. The sample size isn't large enough 
to conclude, though :)
   
   
   Anyway, we saw some other digest mismatches, and I started digging around 
for their cause. I found one problem introduced with [this 
commit](https://github.com/apache/zookeeper/commit/b978dfb949e4ac4d703e956c6ef811415c831bcd),
 fixed in 
[8121711](https://github.com/apache/zookeeper/pull/1925/commits/81217117c272b91cd7a91d06568adf6b53047801).
   The problem was the a `COMMIT` between `NEWLEADER` (which flushes the 
`packetsNotCommited`) and `UPTODATE` would crash the learner, which would peek 
at this queue and expect entries in it. This is fixed by passing on the 
entries, but not removing them; instead, the already written entries are simply 
skipped when updating the log after `UPTODATE`. 
   
   Working on the above, I also found the fix for reconfig between `NEWLEADER` 
and `UPTODATE`, in [this 
commit](https://github.com/apache/zookeeper/commit/c38787f355b6dcd612fc57db0202fc68a01108f7),
 to be incomplete: since the `packetsNotCommited` is no longer emptied after 
`NEWLEADER`, the head doesn't change, and if there are other `PROPOSAL`s 
between the `NEWLEADER` and the `PROPOSAL` that the `COMMITANDACTIVATE` is 
meant for, then reconfig still doesn't happen. The unit test added back then 
was insufficient to test this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-09-23 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1256365917

   Thanks for the quick reply!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-09-23 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1256364188

   > LGTM
   > 
   > Great work. Did you experiment this fix in some test environment ?
   
   Not yet, but we may do that before we merge, if you wish. We will find a way 
to run a patched 3.8.0, probably early next week. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [zookeeper] jonmv commented on pull request #1925: ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers

2022-09-23 Thread GitBox


jonmv commented on PR #1925:
URL: https://github.com/apache/zookeeper/pull/1925#issuecomment-1256258545

   @hanm I believe you review a [related 
PR](https://github.com/apache/zookeeper/pull/1224) earlier, so perhaps you're 
the right reviewer here as well? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org