[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-24 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692169#comment-13692169
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

If the machine disappears for 5 minutes we hope the recovery can proceed in 2~ 
minutes, so we only need 2-3 minutes of timeout, hopefully.
Agree that 128s. fallback is a little bit too long, maybe the last one should 
be 64sec?

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.10

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-24 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692177#comment-13692177
 ] 

stack commented on HBASE-8776:
--

bq. ...HBase contract is 'any operation will eventually succeed'

Stating above helps here.  That said, 5minutes is good, ten minutes at a 
stretch.  40 minutes is abusive; ops won't be able to tell difference between 
this and hung process I'd say.

I'd be good w/ coming down from 128s for last one to 64s.



 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.10

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-24 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692220#comment-13692220
 ] 

Lars Hofhansl commented on HBASE-8776:
--

+1 to what Stack said.
I'd be happy with 5 mins.


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.10

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-24 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692368#comment-13692368
 ] 

stack commented on HBASE-8776:
--

+1 on patch.  Add note that it makes for about 5.5minutes in total on commit.  
Please do same for trunk.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.10

 Attachments: HBASE-8776-v0.patch, HBASE-8776-v1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692369#comment-13692369
 ] 

Hadoop QA commented on HBASE-8776:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589480/HBASE-8776-v1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/6128//console

This message is automatically generated.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.10

 Attachments: HBASE-8776-v0.patch, HBASE-8776-v1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690119#comment-13690119
 ] 

Lars Hofhansl commented on HBASE-8776:
--

Was late yesterday evening (in Germany). So you want to match the 180s ZK 
timeout, to cover RS failure detection?
The default is already awfully long when ZK or the HMaster/RS is actually down. 
I prefer letting the caller know quickly rather retrying endlessly; and just 
make long enough to ride over a split/regionmove.
It looks like HBASE-8723 was specifically about an integration test, so maybe 
just increase it for the test?

Count me +0. If you think we need this change in 0.94, it's fine to commit.


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690508#comment-13690508
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

Default ZK timeout is 40s due to default in ZK config trumping out 180s. 
default.
The problem, with 40s and especially 180s, is that with current retries we 
cannot even ride over one RS crash if it goes down in a bad way (w/o closing 
the socket to ZK resulting in immediate recovery).
This is not specific to integration test.


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690509#comment-13690509
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

Interesting fact about integration test is that by default, when CM kills RS, 
socket is closed to ZK so ZK session insta-terminates, and master 
insta-recovers. A more realistic scenario for e.g. network issues is that ZK 
session timeout actually takes place, so recovery is delayed by 40s., and put 
is highly likely to fail because current retries are no more than ~71sec. long, 
and last server selection is done at 39sec. in 94

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690568#comment-13690568
 ] 

Lars Hofhansl commented on HBASE-8776:
--

I guess the argument is whether by default we should be able to ride over a RS 
crash, or just over regular splits/moves.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690739#comment-13690739
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

I'd say we should :)

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690765#comment-13690765
 ] 

Lars Hofhansl commented on HBASE-8776:
--

So is 39 or 71 right now? (Is the initial call the first retry or not)
Setting this the way you have it sets the timeout over 10mins. I would guess 
that a client that waits for 10mins for a call to finish will cause bad 
problems for the caller.

Maybe we ware starting from a different premise. I think the defaults should be 
appropriate for a client running in an AppServer (where even the current 
setting too long).
Are you mostly mostly thinking about M/R workloads?


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690843#comment-13690843
 ] 

Lars Hofhansl commented on HBASE-8776:
--

:)

I'm from the old Unix world, man. Bubble errors up to the caller immediately 
instead of trying to recover, because only the caller typically has the 
semantic context to perform the appropriate recovery action.
Bump it up to 15 then. At Salesforce we override it to 3 anyway :)

Any comments from anybody else.


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Devaraj Das (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690841#comment-13690841
 ] 

Devaraj Das commented on HBASE-8776:


Rather than increasing the defaults, couldn't we increase the site specific 
config values to get an increased timeout. I am a little concerned that 
increasing the defaults might have far reaching effects on existing 
applications that uses defaults (they will take more time to time-out etc.). I 
am +0 on this as well.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690826#comment-13690826
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

I am thinking about not failing requests unless necessary ;)
If requests fail in normal situation clients will retry externally to HBase 
client, which is something we want to avoid if possible imo. So, by default 
requests should rather take long than fail... at least long enough to cover 
common transient failures. It's configurable so clients can set to whatever 
value if they want to.
It's 30 retries on trunk btw, which is 40+min :)

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690827#comment-13690827
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

And was 20 until the recent bump

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690847#comment-13690847
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

In case of RS errors HBase client actually has more context on whether we can 
recover, that was my point. The less error handling we'll do for known errors, 
the more each user will have to do, badly

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690850#comment-13690850
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

I was actually thinking that maybe I should come back to HBASE-7659 and 
introduce strict timeout, both in config and as an optional RPC argument. All 
this retry count stuff is not very intuitive...

Bottom line on this one, this change is definitely in line with the state of 
trunk...
[~eclark] wdyt?

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13690851#comment-13690851
 ] 

Lars Hofhansl commented on HBASE-8776:
--

Devaraj beat me on question for more comments :)


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691044#comment-13691044
 ] 

stack commented on HBASE-8776:
--

40minutes is crazy.  Can we change that?  10minutes is bad too.  We are just 
going to piss people off w/ timeouts like this.

I'd say default should ride over a RS crash (Interesting that in hbase-it, the 
kills are clean).

40s was the timeout on zk but now it is whatever maxSessionTimeout is (180s in 
0.94 and 90s in 0.95).

Where is our recovery from crashed regionserver in one minute! I thought I saw 
a talk on that recently (smile).  So we have to timeout zk -- 90s in trunk -- 
then do recovery a minute or so hopefully -- and then assign (tens of seconds) 
for a total of 5minutes max (default should be sluggish in favor of things 
continuing to chug along I'd say).



 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691046#comment-13691046
 ] 

stack commented on HBASE-8776:
--

[~nkeywal] Comments?

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-21 Thread Nicolas Liochon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691053#comment-13691053
 ] 

Nicolas Liochon commented on HBASE-8776:


This 2 lines patch touches many different topics :-).
1) On ZooKeeper: any default between 30s and 90s is fine imho. Less can become 
an issue for some environments. More is a little bit ridiculous.
2) by default we should be able to ride over a RS crash: I really think it's 
mandatory. I'm currently running tests on AWS. So far my stats say that a given 
machine will disappear for 5 minutes once per week. We must handle that well.
2.1) We can have rack wide failure as well. A rack hardware will need around 5 
minutes to recover. We must support that too imho (at least in our timeouts, we 
would have hard time recovering such a failure today).
3) cluster wide Fail fast vs. retry. I personally think that HBase contract is 
'any operation will eventually succeed', so I'm ok with more retries and longer 
timeouts, allowing to manage multiple failures in a row. So 40 minutes is fine. 
4) The final backoff time or 128 seconds seems huge to me, but I'm not against 
it.

So I'm totally +1 for the HBASE-8723 patch.
Then for 0.94... I think we could just do it, change all the settings like this 
one (i.e. zk timeout to 90s as trunk), and do a nice release notes. If we do 
that plus some communication when we release the next .94 we will be fine imho.

= +1 if we do a release notes and change the zk setting.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-20 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689781#comment-13689781
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

[~lhofhansl] are you ok with this change to client retries?

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689783#comment-13689783
 ] 

Hadoop QA commented on HBASE-8776:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12588947/HBASE-8776-v0.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/6086//console

This message is automatically generated.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-20 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689801#comment-13689801
 ] 

Lars Hofhansl commented on HBASE-8776:
--

Don't the current defaults already add up to 47? 1+1+1+2+2+4+4+8+8+16 = 47
10 seems good enough, unless I am missing something. Will check the original 
jira tomorrow.


 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8776) port HBASE-8723 to 0.94

2013-06-20 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13689861#comment-13689861
 ] 

Sergey Shelukhin commented on HBASE-8776:
-

there's only one 8, and 32.
The problem is that we determine the server before delay, so recovery has to 
happen before the delay for last retry (I filed a JIRA for that). 
1+1+1+2+2+4+4+8+16 = 39. Recovery after zk timeout is also not instant.

 port HBASE-8723 to 0.94
 ---

 Key: HBASE-8776
 URL: https://issues.apache.org/jira/browse/HBASE-8776
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.8
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin
 Fix For: 0.94.9

 Attachments: HBASE-8776-v0.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira