[kudu-CR](branch-1.4.x) KUDU-2193 (part 2): avoid holding TSTabletManager::lock for a long time

2018-01-16 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8528 )

Change subject: KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a 
long time
..

KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a long time

This changes tablet report generation to only hold the TSTabletManager lock
long enough to copy a list of refs to the relevant tablets.

Even though the lock is a reader-writer lock, a long read-side critical
section can end up blocking other readers as long as any writer shows up.
I saw the following in a cluster:

- election storm ongoing
- T1 generating a tablet report (thus holding the reader lock)
  - blocks for a long time getting ConsensusState from some tablets currently
mid-fsync.
- T2 handling a DeleteTablet or CreateTablet call (waiting for writer lock)
- T3 through T20: blocking on reader lock in LookupTablet()

The effect here is that all other threads end up blocked until T1 finishes
its tablet report generation, which in this case can be tens of seconds due
to all of the fsync traffic. These blocked threads then contribute to the
ongoing election storm since they may delay timely responses to vote requests
from other tablets.

I tested this and the previous patch on a cluster that was previously
experiencing the issue. I triggered some elections by kill -STOPping some
servers and resuming them a few seconds later. Without these patches,
other servers started doing lots of context switches (due to the spinlock
used prior to this patch series) and I saw lots of blocked threads in
pstack. With the patch, things seem cleaner.

The following screenshot shows before/after voluntary_context_switch rate:

https://i.imgur.com/7zcbImw.png

(the missing data around 5pm is my restarting the servers with this patch)

Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681
Reviewed-on: http://gerrit.cloudera.org:8080/8346
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon 
(cherry picked from commit f2c21b4fac15fcaa5f53a6c7960ecd19a0664c45)
Reviewed-on: http://gerrit.cloudera.org:8080/8528
Reviewed-by: Adar Dembo 
---
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
2 files changed, 31 insertions(+), 18 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Adar Dembo: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: merged
Gerrit-Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681
Gerrit-Change-Number: 8528
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR](branch-1.4.x) KUDU-2193 (part 2): avoid holding TSTabletManager::lock for a long time

2018-01-16 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8528 )

Change subject: KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a 
long time
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681
Gerrit-Change-Number: 8528
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 16 Jan 2018 19:10:36 +
Gerrit-HasComments: No


[kudu-CR](branch-1.4.x) KUDU-2193 (part 2): avoid holding TSTabletManager::lock for a long time

2017-11-12 Thread Todd Lipcon (Code Review)
Hello Kudu Jenkins,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/8528

to review the following change.


Change subject: KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a 
long time
..

KUDU-2193 (part 2): avoid holding TSTabletManager::lock_ for a long time

This changes tablet report generation to only hold the TSTabletManager lock
long enough to copy a list of refs to the relevant tablets.

Even though the lock is a reader-writer lock, a long read-side critical
section can end up blocking other readers as long as any writer shows up.
I saw the following in a cluster:

- election storm ongoing
- T1 generating a tablet report (thus holding the reader lock)
  - blocks for a long time getting ConsensusState from some tablets currently
mid-fsync.
- T2 handling a DeleteTablet or CreateTablet call (waiting for writer lock)
- T3 through T20: blocking on reader lock in LookupTablet()

The effect here is that all other threads end up blocked until T1 finishes
its tablet report generation, which in this case can be tens of seconds due
to all of the fsync traffic. These blocked threads then contribute to the
ongoing election storm since they may delay timely responses to vote requests
from other tablets.

I tested this and the previous patch on a cluster that was previously
experiencing the issue. I triggered some elections by kill -STOPping some
servers and resuming them a few seconds later. Without these patches,
other servers started doing lots of context switches (due to the spinlock
used prior to this patch series) and I saw lots of blocked threads in
pstack. With the patch, things seem cleaner.

The following screenshot shows before/after voluntary_context_switch rate:

https://i.imgur.com/7zcbImw.png

(the missing data around 5pm is my restarting the servers with this patch)

Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681
Reviewed-on: http://gerrit.cloudera.org:8080/8346
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon 
(cherry picked from commit f2c21b4fac15fcaa5f53a6c7960ecd19a0664c45)
---
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
2 files changed, 31 insertions(+), 18 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/8528/1
--
To view, visit http://gerrit.cloudera.org:8080/8528
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0f645775e8347a18af112b308dba56a3b4a2c681
Gerrit-Change-Number: 8528
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Kudu Jenkins