[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. KUDU-1678: Race during abort of pending operations during raft shutdown When a tablet replica is shutting down, the following race can occur: 0: The replica receives an ALTER_SCHEMA op adding the column 'foo'. 1: The replica receives a WRITE_OP inserting a row with column 'foo' present. 2: The replica starts to abort its pending operations because it is shutting down. The ALTER_SCHEMA is aborted. 3: Before the WRITE_OP can be aborted, it replicates. 4: The tablet server crashes as a result, with a message like: F1023 20:49:18.088703 1409 transaction_driver.cc:382] T 3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 6309182492379934720: Cannot cancel transactions that have already replicated: Invalid argument: Client provided column c587 INT32 NOT NULL not present in tablet The tablet server will crash with this same message every time it tries to bootstrap. Other tablet servers hosting replicas may crash if they replicated the bad write. The problem is that transactions need to be aborted in reverse order of their index, since later transactions may depend on earlier ones. This bug reproduced about 1% of the time in alter_table-randomized-test when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000 runs. With this change, I saw 0 failures in 1000 runs. Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Reviewed-on: http://gerrit.cloudera.org:8080/11759 Reviewed-by: Alexey Serbin Tested-by: Will Berkeley --- M src/kudu/consensus/pending_rounds.cc M src/kudu/integration-tests/alter_table-randomized-test.cc 2 files changed, 23 insertions(+), 13 deletions(-) Approvals: Alexey Serbin: Looks good to me, approved Will Berkeley: Verified -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 3 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Will Berkeley
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Patch Set 2: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 2 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 23 Oct 2018 23:25:18 + Gerrit-HasComments: No
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Will Berkeley has removed Kudu Jenkins from this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Removed reviewer Kudu Jenkins with the following votes: * Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteReviewer Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 2 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Will Berkeley
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 2 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 23 Oct 2018 22:45:47 + Gerrit-HasComments: No
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@9 PS1, Line 9: When a tablet replica is shutting down, the following race can occur: > I'm kind of surprised that we don't "quiesce" a tablet during shutdown so t Yeah, that would work too. http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@17 PS1, Line 17: 4: The tablet server crashes as a result, with a message like: > Specifically, a different tserver crashes, right? The one hosting the FOLLO I think it can cause multiple to crash-- which servers end up with the bad write in their log. http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc File src/kudu/consensus/pending_rounds.cc: http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@66 PS1, Line 66: tions in reverse index order. See KUDU-1678. > nit: why not to use crbegin() and crend() if the elements of the container Done http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@67 PS1, Line 67: pending_txns > nit: does txn->second look any better? Done -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 2 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 23 Oct 2018 22:41:55 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Hello Mike Percy, Alexey Serbin, Kudu Jenkins, Andrew Wong, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11759 to look at the new patch set (#2). Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. KUDU-1678: Race during abort of pending operations during raft shutdown When a tablet replica is shutting down, the following race can occur: 0: The replica receives an ALTER_SCHEMA op adding the column 'foo'. 1: The replica receives a WRITE_OP inserting a row with column 'foo' present. 2: The replica starts to abort its pending operations because it is shutting down. The ALTER_SCHEMA is aborted. 3: Before the WRITE_OP can be aborted, it replicates. 4: The tablet server crashes as a result, with a message like: F1023 20:49:18.088703 1409 transaction_driver.cc:382] T 3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 6309182492379934720: Cannot cancel transactions that have already replicated: Invalid argument: Client provided column c587 INT32 NOT NULL not present in tablet The tablet server will crash with this same message every time it tries to bootstrap. Other tablet servers hosting replicas may crash if they replicated the bad write. The problem is that transactions need to be aborted in reverse order of their index, since later transactions may depend on earlier ones. This bug reproduced about 1% of the time in alter_table-randomized-test when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000 runs. With this change, I saw 0 failures in 1000 runs. Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 --- M src/kudu/consensus/pending_rounds.cc M src/kudu/integration-tests/alter_table-randomized-test.cc 2 files changed, 23 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/59/11759/2 -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 2 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Tue, 23 Oct 2018 22:40:34 + Gerrit-HasComments: No
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc File src/kudu/consensus/pending_rounds.cc: http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@66 PS1, Line 66: pending_txns_.rbegin(); txn != pending_txns_.rend() nit: why not to use crbegin() and crend() if the elements of the container are not changed anyway? http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@67 PS1, Line 67: (*txn).second nit: does txn->second look any better? -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Tue, 23 Oct 2018 22:35:38 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11759 ) Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. Patch Set 1: Code-Review+2 (2 comments) http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@9 PS1, Line 9: When a tablet replica is shutting down, the following race can occur: I'm kind of surprised that we don't "quiesce" a tablet during shutdown so that its pending transactions can't make forward progress as we cancel them. http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@17 PS1, Line 17: 4: The tablet server crashes as a result, with a message like: Specifically, a different tserver crashes, right? The one hosting the FOLLOWER of this tablet, the recipient of the replication? Although the example in the bug report doesn't involve replication; just scheduling of the uncanceled write on the local tserver's prepare threadpool. -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Tue, 23 Oct 2018 22:33:25 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown
Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11759 Change subject: KUDU-1678: Race during abort of pending operations during raft shutdown .. KUDU-1678: Race during abort of pending operations during raft shutdown When a tablet replica is shutting down, the following race can occur: 0: The replica receives an ALTER_SCHEMA op adding the column 'foo'. 1: The replica receives a WRITE_OP inserting a row with column 'foo' present. 2: The replica starts to abort its pending operations because it is shutting down. The ALTER_SCHEMA is aborted. 3: Before the WRITE_OP can be aborted, it replicates. 4: The tablet server crashes as a result, with a message like: F1023 20:49:18.088703 1409 transaction_driver.cc:382] T 3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 6309182492379934720: Cannot cancel transactions that have already replicated: Invalid argument: Client provided column c587 INT32 NOT NULL not present in tablet The tablet server will crash with this same message every time it tries to bootstrap. The problem is that transactions need to be aborted in reverse order of their index, since later transactions may depend on earlier ones. This bug reproduced about 1% of the time in alter_table-randomized-test when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000 runs. With this change, I saw 0 failures in 1000 runs. Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 --- M src/kudu/consensus/pending_rounds.cc M src/kudu/integration-tests/alter_table-randomized-test.cc 2 files changed, 22 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/59/11759/1 -- To view, visit http://gerrit.cloudera.org:8080/11759 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103 Gerrit-Change-Number: 11759 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley