[kudu-CR] master: add read-write lock to serialize operations around elections
Adar Dembo has submitted this change and it was merged. Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. I was hoping this lock could replace the fencing done by leader_ready_term_, but eventually reasoned that we need both; without leader_ready_term_ fencing, the master's consensus state machine could fool an operation into thinking the master became the leader before the metadata was reloaded. Three other things of note here: - The new lock is acquired via TryLock() so that, if the lock could not be acquired, the RPC will fail rather than block. A future patch modifies TSHeartbeat() to partially accept heartbeats even if the master is a follower; TryLock() means that a transitioning leader that is pelted with RPCs won't fill up its service queue and can still process heartbeats. - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Reviewed-on: http://gerrit.cloudera.org:8080/3550 Reviewed-by: Dan BurkertTested-by: Adar Dembo --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 334 insertions(+), 144 deletions(-) Approvals: Dan Burkert: Looks good to me, approved Adar Dembo: Verified -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Dan Burkert has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 10: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 10: Build Started http://104.196.14.100/job/kudu-gerrit/2464/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Dan Burkert has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Hello Dan Burkert, Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3550 to look at the new patch set (#8). Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. I was hoping this lock could replace the fencing done by leader_ready_term_, but eventually reasoned that we need both; without leader_ready_term_ fencing, the master's consensus state machine could fool an operation into thinking the master became the leader before the metadata was reloaded. Three other things of note here: - The new lock is acquired via TryLock() so that, if the lock could not be acquired, the RPC will fail rather than block. A future patch modifies TSHeartbeat() to partially accept heartbeats even if the master is a follower; TryLock() means that a transitioning leader that is pelted with RPCs won't fill up its service queue and can still process heartbeats. - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 334 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/8 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 8: Build Started http://104.196.14.100/job/kudu-gerrit/2424/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Hello Dan Burkert, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3550 to look at the new patch set (#7). Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. I was hoping this lock could replace the fencing done by leader_ready_term_, but eventually reasoned that we need both; without leader_ready_term_ fencing, the master's consensus state machine could fool an operation into thinking the master became the leader before the metadata was reloaded. Three other things of note here: - The new lock is acquired via TryLock() so that, if the lock could not be acquired, the RPC will fail rather than block. A future patch modifies TSHeartbeat() to partially accept heartbeats even if the master is a follower; TryLock() means that a transitioning leader that is pelted with RPCs won't fill up its service queue and can still process heartbeats. - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 331 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/7 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Todd Lipcon has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 6: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/3550/6//COMMIT_MSG Commit Message: Line 30: RPCs won't fill up its service queue and can still process heartbeats. ah yes, we had basically this problem with the original HDFS HA implementation! -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] master: add read-write lock to serialize operations around elections
Dan Burkert has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 6: Build Started http://104.196.14.100/job/kudu-gerrit/2317/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Dan Burkert has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 5: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 5: Build Started http://104.196.14.100/job/kudu-gerrit/2296/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3550 to look at the new patch set (#5). Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. I was hoping this lock could replace the fencing done by leader_ready_term_, but eventually reasoned that we need both; without leader_ready_term_ fencing, the master's consensus state machine could fool an operation into thinking the master became the leader before the metadata was reloaded. Three other things of note here: - The new lock is acquired via TryLock() so that, if the lock could not be acquired, the RPC will fail rather than block. A future patch modifies TSHeartbeat() to partially accept heartbeats even if the master is a follower; TryLock() means that a transitioning leader that is pelted with RPCs won't fill up its service queue and can still process heartbeats. - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 331 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/5 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Adar Dembo has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/3550/4/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 358: // If there is an error (e.g., we are not the leader) abort this task > Should this comment be update? Not necessarily; this master could lose its leadership after taking the lock. It'll find out when trying to replicate to the followers, and that error will propagate up here. Line 3147: std::lock_guard l(catalog_->state_lock_); > I expected these checks to be made 'on demand' in CheckIsInitializedOrRespo For one, some users don't use the CheckIs...() methods; they call the status() accessors only. For two, I modeled this class after lock_guard and shared_lock; I want its users to think that the resource is taken when the class is declared, and released when the instance goes out of scope. -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] master: add read-write lock to serialize operations around elections
Dan Burkert has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/3550/4/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 358: // If there is an error (e.g., we are not the leader) abort this task Should this comment be update? Line 3147: std::lock_guard l(catalog_->state_lock_); I expected these checks to be made 'on demand' in CheckIsInitializedOrRespond/CheckIsInitializedAndIsLeaderOrRespond. Any reason not to do it there? -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 4: Build Started http://104.196.14.100/job/kudu-gerrit/2262/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3550 to look at the new patch set (#4). Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. I was hoping this lock could replace the fencing done by leader_ready_term_, but eventually reasoned that we need both; without leader_ready_term_ fencing, the master's consensus state machine could fool an operation into thinking the master became the leader before the metadata was reloaded. Three other things of note here: - The new lock is acquired via TryLock() so that, if the lock could not be acquired, the RPC will fail rather than block. A future patch modifies TSHeartbeat() to partially accept heartbeats even if the master is a follower; TryLock() means that a transitioning leader that is pelted with RPCs won't fill up its service queue and can still process heartbeats. - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 331 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/4 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Adar Dembo has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/3550/3/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS3, Line 345: l.c > could be more descriptive here? Done PS3, Line 367: aborting the current task... > hmm, this (old) log message is nonsense, right? Yeah. I'll just remove the whole thing. Not useful and adds noise to the control flow. Line 657: e.second->WaitTasksCompletion(); > does this put a restriction on the tasks themselves that they may not block Correct. As it turns out, there was a separate deadlock issue which I've fixed: the task callbacks ran on the same singleton thread pool as this function. Now none of the tasks touch the leader lock. http://gerrit.cloudera.org:8080/#/c/3550/3/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: PS3, Line 304: General status > a specific example of what types of "bad status" this might represent would Done PS3, Line 305: leadership_status > you mean leader_status()? Yes to both. PS3, Line 308: Leadership status > would a simple bool suffice? CheckIsInitializedAndIsLeaderOrRespond() needs access to the status itself because it writes the message into the response, but that doesn't preclude the leader_status() accessor from being a bool. But, this way it's more consistent with catalog_status(), which I like. Do you feel strongly? PS3, Line 316: '. > and returns false Done PS3, Line 691: etc > what about read operations? should you specifically say write operations? a As you observed, reads should also be guarded by this lock. It may be possible for reads to get by with the existing locks, but that means blocking for a long period of time on a spinlock, which isn't ideal. I went back and forth on this, but in the end reasoned that it's far simpler for _every_ operation to acquire it. -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] master: add read-write lock to serialize operations around elections
Todd Lipcon has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/3550/3/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS3, Line 345: l.c could be more descriptive here? PS3, Line 367: aborting the current task... hmm, this (old) log message is nonsense, right? Line 657: e.second->WaitTasksCompletion(); does this put a restriction on the tasks themselves that they may not block on the leader lock? otherwise is there a deadlock case? http://gerrit.cloudera.org:8080/#/c/3550/3/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: PS3, Line 304: General status a specific example of what types of "bad status" this might represent would be useful PS3, Line 305: leadership_status you mean leader_status()? should we DCHECK(catalog_status.ok()) within leader_status() in this case? PS3, Line 308: Leadership status would a simple bool suffice? PS3, Line 316: '. and returns false PS3, Line 691: etc what about read operations? should you specifically say write operations? are read operations already protected by the tableinfo locks, etc? or should we acquire in them too to prevent a race where the Visit*() functions clear the maps just as a read request arrives (but after the read request checks leadership) -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 3: Build Started http://104.196.14.100/job/kudu-gerrit/2151/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3550 to look at the new patch set (#3). Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. I was hoping this lock could replace the fencing done by leader_ready_term_, but eventually reasoned that we need both; without leader_ready_term_ fencing, the master's consensus state machine could fool an operation into thinking the master became the leader before the metadata was reloaded. Two other things of note here: - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 294 insertions(+), 136 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/3 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 2: Build Started http://104.196.14.100/job/kudu-gerrit/2150/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Adar Dembo has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 1: I've been reviewing the test failure and reproducing it locally. I think I need to continue incorporating leader_term_ into the check in some way; at least checking that it's not -1, I believe. Otherwise it's possible for a caller to grab a ScopedLeaderSharedLock just as the master is starting up, determine that the master is the leader, and let go of the lock, after which VisitTablesAndTablets grabs the lock and begins loading metadata. -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] master: add read-write lock to serialize operations around elections
Hello Dan Burkert, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/3550 to review the following change. Change subject: master: add read-write lock to serialize operations around elections .. master: add read-write lock to serialize operations around elections This rigmarole began with an investigation into a test failure [1], which led to a new integration test that hammers VisitTablesAndTablets() while creating tables. That test revealed other locking issues, which brings us to where we are now. This patch introduces a read-write lock to serialize all master operations so that they fall on one side or the other of a leader election. The idea is to avoid performing operations concurrently with a reload of the master metadata; doing so can lead to problems in Shutdown() and (very rarely, perhaps only conceptually) to inconsistent on-disk state. Two other things of note here: - TableInfo's AddTask() and RemoveTask() methods now don't hold the table's lock when adding and removing refs from the task respectively. This is the fix for the original test failure. - When reloading metadata, we now abort all outstanding table tasks to avoid orphaning them. 1. http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001 Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 --- M src/kudu/integration-tests/create-table-stress-test.cc M src/kudu/integration-tests/mini_cluster.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master.cc M src/kudu/master/master_service.cc 6 files changed, 292 insertions(+), 146 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/1 -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Todd Lipcon
[kudu-CR] master: add read-write lock to serialize operations around elections
Kudu Jenkins has posted comments on this change. Change subject: master: add read-write lock to serialize operations around elections .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2143/ -- To view, visit http://gerrit.cloudera.org:8080/3550 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No