[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-30 Thread Jean-Daniel Cryans (Code Review)
Jean-Daniel Cryans has submitted this change and it was merged.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


[java client] RPCs can get lost in a TabletClient race

We saw hangs after running ITBLL for hours. Turns out that the recent
fixes in TabletClient introduced a new race condition. rpcs_inflight
is being cleaned in cleanup() by copying all the elements from it
and then calling clear(). Even though this is done under a lock, that
lock isn't protecting rpcs_inflight so it's possible to clear() rpcs
that were not copied out.

I haven't been able to recreate this race in unit tests, but it fixed
ITBLL.

Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Reviewed-on: http://gerrit.cloudera.org:8080/3541
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo 
---
M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
1 file changed, 20 insertions(+), 11 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-30 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-30 Thread Jean-Daniel Cryans (Code Review)
Jean-Daniel Cryans has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 1:

> (4 comments)
 > 
 > What releases (if any) are affected by this?

None.

 > 
 > We've been applying bandaid after bandaid for these concurrency
 > issues, and the lack of regression tests is increasingly
 > concerning. Is there something missing from the client that'd make
 > writing such tests easier?

I was hoping ITClient would be that thing, but I think it's missing enough 
concurrency to trigger most issues we see. You're right that this is concerning.

-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Jean-Daniel Cryans (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3541

to look at the new patch set (#3).

Change subject: [java client] RPCs can get lost in a TabletClient race
..

[java client] RPCs can get lost in a TabletClient race

We saw hangs after running ITBLL for hours. Turns out that the recent
fixes in TabletClient introduced a new race condition. rpcs_inflight
is being cleaned in cleanup() by copying all the elements from it
and then calling clear(). Even though this is done under a lock, that
lock isn't protecting rpcs_inflight so it's possible to clear() rpcs
that were not copied out.

I haven't been able to recreate this race in unit tests, but it fixed
ITBLL.

Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
---
M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
1 file changed, 20 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3541/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 3:

Build Started http://104.196.14.100/job/kudu-gerrit/2109/

-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Jean-Daniel Cryans (Code Review)
Jean-Daniel Cryans has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3541/1/java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
File java/kudu-client/src/main/java/org/kududb/client/TabletClient.java:

PS1, Line 187: We got disconnected, cleanup ran, then we added the rpc to 
rpcs_inflight, so we need to
 :   // manually failOrRetry it.
> I think this is too vague. Multiple threads are at work here, right? It's n
Done


PS1, Line 678: // Removing from the iterator because ConcurrentHashMap doesn't 
have a way to retrieve and
 :   // remove all the entries atomically.
 :   for (Iterator> iterator = 
rpcs_inflight.values().iterator(); iterator.hasNext();) {
 : KuduRpc rpc = iterator.next();
 : rpcs.add(rpc);
 : iterator.remove();
 :   }
> I don't get it; how is this any more "atomic" than the previous code? What 
It's not more atomic, we get around this by assuming we might be missing some 
rpcs that got added while iterating, hence the check in sendRpc above.

I removed the comment, it's not really making things clearer.


Line 685:   // After this, rpcs_inflight might still have entries since it 
could have been added
> Nit: it -> they
Done


Line 686:   // concurrently, and those RPCs will be handled by their called 
in sendRpc.
> Nit: "their called"?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 2:

Build Started http://104.196.14.100/job/kudu-gerrit/2108/

-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 1:

(4 comments)

What releases (if any) are affected by this?

We've been applying bandaid after bandaid for these concurrency issues, and the 
lack of regression tests is increasingly concerning. Is there something missing 
from the client that'd make writing such tests easier?

http://gerrit.cloudera.org:8080/#/c/3541/1/java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
File java/kudu-client/src/main/java/org/kududb/client/TabletClient.java:

PS1, Line 187: We got disconnected, cleanup ran, then we added the rpc to 
rpcs_inflight, so we need to
 :   // manually failOrRetry it.
I think this is too vague. Multiple threads are at work here, right? It's not 
clear what the race is from this terse description. Can you rewrite with a 
concrete example?


PS1, Line 678: // Removing from the iterator because ConcurrentHashMap doesn't 
have a way to retrieve and
 :   // remove all the entries atomically.
 :   for (Iterator> iterator = 
rpcs_inflight.values().iterator(); iterator.hasNext();) {
 : KuduRpc rpc = iterator.next();
 : rpcs.add(rpc);
 : iterator.remove();
 :   }
I don't get it; how is this any more "atomic" than the previous code? What am I 
missing?


Line 685:   // After this, rpcs_inflight might still have entries since it 
could have been added
Nit: it -> they


Line 686:   // concurrently, and those RPCs will be handled by their called 
in sendRpc.
Nit: "their called"?

Maybe just rewrite this paragraph to make it more clear.


-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Jean-Daniel Cryans (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3541

to look at the new patch set (#2).

Change subject: [java client] RPCs can get lost in a TabletClient race
..

[java client] RPCs can get lost in a TabletClient race

We saw hangs after running ITBLL for hours. Turns out that the recent
fixes in TabletClient introduced a new race condition. rpcs_inflight
is being cleaned in cleanup() by copying all the elements from it
and then calling clear(). Even though this is done under a lock, that
lock isn't protecting rpcs_inflight so it's possible to clear() rpcs
that were not copied out.

I haven't been able to recreate this race in unit tests, but it fixed
ITBLL.

Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
---
M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
1 file changed, 20 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3541/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3541/1/java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
File java/kudu-client/src/main/java/org/kududb/client/TabletClient.java:

Line 686:   // concurrently, and those RPCs will be handled by their called 
in sendRpc.
I think "handled by their called in sendRpc" is a typo


-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: [java client] RPCs can get lost in a TabletClient race
..


Patch Set 1:

Build Started http://104.196.14.100/job/kudu-gerrit/2106/

-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] [java client] RPCs can get lost in a TabletClient race

2016-06-29 Thread Jean-Daniel Cryans (Code Review)
Jean-Daniel Cryans has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3541

Change subject: [java client] RPCs can get lost in a TabletClient race
..

[java client] RPCs can get lost in a TabletClient race

We saw hangs after running ITBLL for hours. Turns out that the recent
fixes in TabletClient introduced a new race condition. rpcs_inflight
is being cleaned in cleanup() by copying all the elements from it
and then calling clear(). Even though this is done under a lock, that
lock isn't protecting rpcs_inflight so it's possible to clear() rpcs
that were not copied out.

I haven't been able to recreate this race in unit tests, but it fixed
ITBLL.

Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
---
M java/kudu-client/src/main/java/org/kududb/client/TabletClient.java
1 file changed, 20 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3541/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3541
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Iaff89eb832d0d6f0dede198661856fae1a8585a0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans