[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Will Berkeley (Code Review)
Will Berkeley has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..

KUDU-1678: Race during abort of pending operations during raft shutdown

When a tablet replica is shutting down, the following race can occur:

0: The replica receives an ALTER_SCHEMA op adding the column 'foo'.
1: The replica receives a WRITE_OP inserting a row with column 'foo'
   present.
2: The replica starts to abort its pending operations because it is
   shutting down. The ALTER_SCHEMA is aborted.
3: Before the WRITE_OP can be aborted, it replicates.
4: The tablet server crashes as a result, with a message like:

F1023 20:49:18.088703  1409 transaction_driver.cc:382] T 
3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 
6309182492379934720: Cannot cancel transactions that have already replicated: 
Invalid argument: Client provided column c587 INT32 NOT NULL not present in 
tablet

The tablet server will crash with this same message every time it tries
to bootstrap. Other tablet servers hosting replicas may crash if they
replicated the bad write.

The problem is that transactions need to be aborted in reverse order of
their index, since later transactions may depend on earlier ones.

This bug reproduced about 1% of the time in alter_table-randomized-test
when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000
runs. With this change, I saw 0 failures in 1000 runs.

Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Reviewed-on: http://gerrit.cloudera.org:8080/11759
Reviewed-by: Alexey Serbin 
Tested-by: Will Berkeley 
---
M src/kudu/consensus/pending_rounds.cc
M src/kudu/integration-tests/alter_table-randomized-test.cc
2 files changed, 23 insertions(+), 13 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Will Berkeley: Verified

--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Patch Set 2: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 23 Oct 2018 23:25:18 +
Gerrit-HasComments: No


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Will Berkeley (Code Review)
Will Berkeley has removed Kudu Jenkins from this change.  ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 23 Oct 2018 22:45:47 +
Gerrit-HasComments: No


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@9
PS1, Line 9: When a tablet replica is shutting down, the following race can 
occur:
> I'm kind of surprised that we don't "quiesce" a tablet during shutdown so t
Yeah, that would work too.


http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@17
PS1, Line 17: 4: The tablet server crashes as a result, with a message like:
> Specifically, a different tserver crashes, right? The one hosting the FOLLO
I think it can cause multiple to crash-- which servers end up with the bad 
write in their log.


http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc
File src/kudu/consensus/pending_rounds.cc:

http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@66
PS1, Line 66: tions in reverse index order. See KUDU-1678.
> nit: why not to use crbegin() and crend() if the elements of the container
Done


http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@67
PS1, Line 67:  pending_txns
> nit: does txn->second look any better?
Done



--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 23 Oct 2018 22:41:55 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Will Berkeley (Code Review)
Hello Mike Percy, Alexey Serbin, Kudu Jenkins, Andrew Wong, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/11759

to look at the new patch set (#2).

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..

KUDU-1678: Race during abort of pending operations during raft shutdown

When a tablet replica is shutting down, the following race can occur:

0: The replica receives an ALTER_SCHEMA op adding the column 'foo'.
1: The replica receives a WRITE_OP inserting a row with column 'foo'
   present.
2: The replica starts to abort its pending operations because it is
   shutting down. The ALTER_SCHEMA is aborted.
3: Before the WRITE_OP can be aborted, it replicates.
4: The tablet server crashes as a result, with a message like:

F1023 20:49:18.088703  1409 transaction_driver.cc:382] T 
3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 
6309182492379934720: Cannot cancel transactions that have already replicated: 
Invalid argument: Client provided column c587 INT32 NOT NULL not present in 
tablet

The tablet server will crash with this same message every time it tries
to bootstrap. Other tablet servers hosting replicas may crash if they
replicated the bad write.

The problem is that transactions need to be aborted in reverse order of
their index, since later transactions may depend on earlier ones.

This bug reproduced about 1% of the time in alter_table-randomized-test
when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000
runs. With this change, I saw 0 failures in 1000 runs.

Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
---
M src/kudu/consensus/pending_rounds.cc
M src/kudu/integration-tests/alter_table-randomized-test.cc
2 files changed, 23 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/59/11759/2
--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Tue, 23 Oct 2018 22:40:34 +
Gerrit-HasComments: No


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc
File src/kudu/consensus/pending_rounds.cc:

http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@66
PS1, Line 66: pending_txns_.rbegin(); txn != pending_txns_.rend()
nit: why not to use crbegin() and crend() if the elements of the container are 
not changed anyway?


http://gerrit.cloudera.org:8080/#/c/11759/1/src/kudu/consensus/pending_rounds.cc@67
PS1, Line 67: (*txn).second
nit: does txn->second look any better?



--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Tue, 23 Oct 2018 22:35:38 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11759 )

Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..


Patch Set 1: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@9
PS1, Line 9: When a tablet replica is shutting down, the following race can 
occur:
I'm kind of surprised that we don't "quiesce" a tablet during shutdown so that 
its pending transactions can't make forward progress as we cancel them.


http://gerrit.cloudera.org:8080/#/c/11759/1//COMMIT_MSG@17
PS1, Line 17: 4: The tablet server crashes as a result, with a message like:
Specifically, a different tserver crashes, right? The one hosting the FOLLOWER 
of this tablet, the recipient of the replication? Although the example in the 
bug report doesn't involve replication; just scheduling of the uncanceled write 
on the local tserver's prepare threadpool.



--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Tue, 23 Oct 2018 22:33:25 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1678: Race during abort of pending operations during raft shutdown

2018-10-23 Thread Will Berkeley (Code Review)
Will Berkeley has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/11759


Change subject: KUDU-1678: Race during abort of pending operations during raft 
shutdown
..

KUDU-1678: Race during abort of pending operations during raft shutdown

When a tablet replica is shutting down, the following race can occur:

0: The replica receives an ALTER_SCHEMA op adding the column 'foo'.
1: The replica receives a WRITE_OP inserting a row with column 'foo'
   present.
2: The replica starts to abort its pending operations because it is
   shutting down. The ALTER_SCHEMA is aborted.
3: Before the WRITE_OP can be aborted, it replicates.
4: The tablet server crashes as a result, with a message like:

F1023 20:49:18.088703  1409 transaction_driver.cc:382] T 
3fe5651e31d8486780d0e480f8748ead P 12e38846d8c648df8a11b9d8da2ad407 S R-NP Ts 
6309182492379934720: Cannot cancel transactions that have already replicated: 
Invalid argument: Client provided column c587 INT32 NOT NULL not present in 
tablet

The tablet server will crash with this same message every time it tries
to bootstrap.

The problem is that transactions need to be aborted in reverse order of
their index, since later transactions may depend on earlier ones.

This bug reproduced about 1% of the time in alter_table-randomized-test
when run in DEBUG mode with 16 stress threads: I saw it 10 times in 1000
runs. With this change, I saw 0 failures in 1000 runs.

Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
---
M src/kudu/consensus/pending_rounds.cc
M src/kudu/integration-tests/alter_table-randomized-test.cc
2 files changed, 22 insertions(+), 13 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/59/11759/1
--
To view, visit http://gerrit.cloudera.org:8080/11759
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Idde75bd1fe966a1a3d53aa1e5de6a01a48ff1103
Gerrit-Change-Number: 11759
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley