[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Jean-Daniel Cryans has submitted this change and it was merged. Change subject: [java client] RPCs can get lost in a TabletClient race .. [java client] RPCs can get lost in a TabletClient race We saw hangs after running ITBLL for hours. Turns out that the recent fixes in TabletClient introduced a new race condition. rpcs_inflight is being cleaned in cleanup() by copying all the elements from it and then calling clear(). Even though this is done under a lock, that lock isn't protecting rpcs_inflight so it's possible to clear() rpcs that were not copied out. I haven't been able to recreate this race in unit tests, but it fixed ITBLL. Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Reviewed-on: http://gerrit.cloudera.org:8080/3541 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo --- M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java 1 file changed, 20 insertions(+), 11 deletions(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Adar Dembo has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Jean-Daniel Cryans has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 1: > (4 comments) > > What releases (if any) are affected by this? None. > > We've been applying bandaid after bandaid for these concurrency > issues, and the lack of regression tests is increasingly > concerning. Is there something missing from the client that'd make > writing such tests easier? I was hoping ITClient would be that thing, but I think it's missing enough concurrency to trigger most issues we see. You're right that this is concerning. -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3541 to look at the new patch set (#3). Change subject: [java client] RPCs can get lost in a TabletClient race .. [java client] RPCs can get lost in a TabletClient race We saw hangs after running ITBLL for hours. Turns out that the recent fixes in TabletClient introduced a new race condition. rpcs_inflight is being cleaned in cleanup() by copying all the elements from it and then calling clear(). Even though this is done under a lock, that lock isn't protecting rpcs_inflight so it's possible to clear() rpcs that were not copied out. I haven't been able to recreate this race in unit tests, but it fixed ITBLL. Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 --- M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java 1 file changed, 20 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3541/3 -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Kudu Jenkins has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 3: Build Started http://104.196.14.100/job/kudu-gerrit/2109/ -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Jean-Daniel Cryans has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/3541/1/java/kudu-client/src/main/java/org/kududb/client/TabletClient.java File java/kudu-client/src/main/java/org/kududb/client/TabletClient.java: PS1, Line 187: We got disconnected, cleanup ran, then we added the rpc to rpcs_inflight, so we need to : // manually failOrRetry it. > I think this is too vague. Multiple threads are at work here, right? It's n Done PS1, Line 678: // Removing from the iterator because ConcurrentHashMap doesn't have a way to retrieve and : // remove all the entries atomically. : for (Iterator> iterator = rpcs_inflight.values().iterator(); iterator.hasNext();) { : KuduRpc rpc = iterator.next(); : rpcs.add(rpc); : iterator.remove(); : } > I don't get it; how is this any more "atomic" than the previous code? What It's not more atomic, we get around this by assuming we might be missing some rpcs that got added while iterating, hence the check in sendRpc above. I removed the comment, it's not really making things clearer. Line 685: // After this, rpcs_inflight might still have entries since it could have been added > Nit: it -> they Done Line 686: // concurrently, and those RPCs will be handled by their called in sendRpc. > Nit: "their called"? Done -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Kudu Jenkins has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 2: Build Started http://104.196.14.100/job/kudu-gerrit/2108/ -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Adar Dembo has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 1: (4 comments) What releases (if any) are affected by this? We've been applying bandaid after bandaid for these concurrency issues, and the lack of regression tests is increasingly concerning. Is there something missing from the client that'd make writing such tests easier? http://gerrit.cloudera.org:8080/#/c/3541/1/java/kudu-client/src/main/java/org/kududb/client/TabletClient.java File java/kudu-client/src/main/java/org/kududb/client/TabletClient.java: PS1, Line 187: We got disconnected, cleanup ran, then we added the rpc to rpcs_inflight, so we need to : // manually failOrRetry it. I think this is too vague. Multiple threads are at work here, right? It's not clear what the race is from this terse description. Can you rewrite with a concrete example? PS1, Line 678: // Removing from the iterator because ConcurrentHashMap doesn't have a way to retrieve and : // remove all the entries atomically. : for (Iterator> iterator = rpcs_inflight.values().iterator(); iterator.hasNext();) { : KuduRpc rpc = iterator.next(); : rpcs.add(rpc); : iterator.remove(); : } I don't get it; how is this any more "atomic" than the previous code? What am I missing? Line 685: // After this, rpcs_inflight might still have entries since it could have been added Nit: it -> they Line 686: // concurrently, and those RPCs will be handled by their called in sendRpc. Nit: "their called"? Maybe just rewrite this paragraph to make it more clear. -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3541 to look at the new patch set (#2). Change subject: [java client] RPCs can get lost in a TabletClient race .. [java client] RPCs can get lost in a TabletClient race We saw hangs after running ITBLL for hours. Turns out that the recent fixes in TabletClient introduced a new race condition. rpcs_inflight is being cleaned in cleanup() by copying all the elements from it and then calling clear(). Even though this is done under a lock, that lock isn't protecting rpcs_inflight so it's possible to clear() rpcs that were not copied out. I haven't been able to recreate this race in unit tests, but it fixed ITBLL. Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 --- M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java 1 file changed, 20 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3541/2 -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Dan Burkert has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3541/1/java/kudu-client/src/main/java/org/kududb/client/TabletClient.java File java/kudu-client/src/main/java/org/kududb/client/TabletClient.java: Line 686: // concurrently, and those RPCs will be handled by their called in sendRpc. I think "handled by their called in sendRpc" is a typo -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Kudu Jenkins has posted comments on this change. Change subject: [java client] RPCs can get lost in a TabletClient race .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2106/ -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] [java client] RPCs can get lost in a TabletClient race
Jean-Daniel Cryans has uploaded a new change for review. http://gerrit.cloudera.org:8080/3541 Change subject: [java client] RPCs can get lost in a TabletClient race .. [java client] RPCs can get lost in a TabletClient race We saw hangs after running ITBLL for hours. Turns out that the recent fixes in TabletClient introduced a new race condition. rpcs_inflight is being cleaned in cleanup() by copying all the elements from it and then calling clear(). Even though this is done under a lock, that lock isn't protecting rpcs_inflight so it's possible to clear() rpcs that were not copied out. I haven't been able to recreate this race in unit tests, but it fixed ITBLL. Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 --- M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java 1 file changed, 20 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3541/1 -- To view, visit http://gerrit.cloudera.org:8080/3541 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Jean-Daniel Cryans