[jira] [Commented] (CASSANDRA-14297) Startup checker should wait for count rather than percentage

2018-11-12 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684231#comment-16684231
 ] 

Joseph Lynch commented on CASSANDRA-14297:
--

Sweet, thanks! Yea I was holding off on adding the NEWs/CHANGES entries until 
you marked it ready to commit. In the future I'll include it with the dtest run.

Thanks for all the great feedback, I think this feature is much more valuable 
now to users.

> Startup checker should wait for count rather than percentage
> 
>
> Key: CASSANDRA-14297
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14297
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
>  Labels: 4.0-feature-freeze-review-requested, PatchAvailable, 
> pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> As I commented in CASSANDRA-13993, the current wait for functionality is a 
> great step in the right direction, but I don't think that the current setting 
> (70% of nodes in the cluster) is the right configuration option. First I 
> think this because 70% will not protect against errors as if you wait for 70% 
> of the cluster you could still very easily have {{UnavailableException}} or 
> {{ReadTimeoutException}} exceptions. This is because if you have even two 
> nodes down in different racks in a Cassandra cluster these exceptions are 
> possible (or with the default {{num_tokens}} setting of 256 it is basically 
> guaranteed). Second I think this option is not easy for operators to set, the 
> only setting I could think of that would "just work" is 100%.
> I proposed in that ticket instead of having `block_for_peers_percentage` 
> defaulting to 70%, we instead have `block_for_peers` as a count of nodes that 
> are allowed to be down before the starting node makes itself available as a 
> coordinator. Of course, we would still have the timeout to limit startup time 
> and deal with really extreme situations (whole datacenters down etc).
> I started working on a patch for this change [on 
> github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993], 
> and am happy to finish it up with unit tests and such if someone can 
> review/commit it (maybe [~aweisberg]?).
> I think the short version of my proposal is we replace:
> {noformat}
> block_for_peers_percentage: 
> {noformat}
> with either
> {noformat}
> block_for_peers: 
> {noformat}
> or, if we want to do even better imo and enable advanced operators to finely 
> tune this behavior (while still having good defaults that work for almost 
> everyone):
> {noformat}
> block_for_peers_local_dc:  
> block_for_peers_each_dc: 
> block_for_peers_all_dcs: 
> {noformat}
> For example if an operator knows that they must be available at 
> {{LOCAL_QUORUM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{EACH_QUOURM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{QUORUM}} (RF=3, dcs=2) they would set {{block_for_peers_all_dcs=2}}. 
> Naturally everything would of course have a timeout to prevent startup taking 
> too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14297) Startup checker should wait for count rather than percentage

2018-11-12 Thread Ariel Weisberg (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684136#comment-16684136
 ] 

Ariel Weisberg commented on CASSANDRA-14297:


+1  it.

There is one unused import in 
[StartupClusterConnectivityCheckerTest.java|https://github.com/apache/cassandra/pull/212/files#diff-c74adeeae072ee4af35c12a157cd7d61L26]
 I'll fix on commit.

> Startup checker should wait for count rather than percentage
> 
>
> Key: CASSANDRA-14297
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14297
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
>  Labels: 4.0-feature-freeze-review-requested, PatchAvailable, 
> pull-request-available
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> As I commented in CASSANDRA-13993, the current wait for functionality is a 
> great step in the right direction, but I don't think that the current setting 
> (70% of nodes in the cluster) is the right configuration option. First I 
> think this because 70% will not protect against errors as if you wait for 70% 
> of the cluster you could still very easily have {{UnavailableException}} or 
> {{ReadTimeoutException}} exceptions. This is because if you have even two 
> nodes down in different racks in a Cassandra cluster these exceptions are 
> possible (or with the default {{num_tokens}} setting of 256 it is basically 
> guaranteed). Second I think this option is not easy for operators to set, the 
> only setting I could think of that would "just work" is 100%.
> I proposed in that ticket instead of having `block_for_peers_percentage` 
> defaulting to 70%, we instead have `block_for_peers` as a count of nodes that 
> are allowed to be down before the starting node makes itself available as a 
> coordinator. Of course, we would still have the timeout to limit startup time 
> and deal with really extreme situations (whole datacenters down etc).
> I started working on a patch for this change [on 
> github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993], 
> and am happy to finish it up with unit tests and such if someone can 
> review/commit it (maybe [~aweisberg]?).
> I think the short version of my proposal is we replace:
> {noformat}
> block_for_peers_percentage: 
> {noformat}
> with either
> {noformat}
> block_for_peers: 
> {noformat}
> or, if we want to do even better imo and enable advanced operators to finely 
> tune this behavior (while still having good defaults that work for almost 
> everyone):
> {noformat}
> block_for_peers_local_dc:  
> block_for_peers_each_dc: 
> block_for_peers_all_dcs: 
> {noformat}
> For example if an operator knows that they must be available at 
> {{LOCAL_QUORUM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{EACH_QUOURM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{QUORUM}} (RF=3, dcs=2) they would set {{block_for_peers_all_dcs=2}}. 
> Naturally everything would of course have a timeout to prevent startup taking 
> too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14297) Startup checker should wait for count rather than percentage

2018-11-07 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678920#comment-16678920
 ] 

Joseph Lynch commented on CASSANDRA-14297:
--

[~aweisberg] Awesome I just rebased and merged in your suggestions:

||trunk||
|[eb05d461|https://github.com/apache/cassandra/pull/212/commits/eb05d461224d753d5253ade6634d69a1ae91c437]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14297.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14297]
 |

Dtests are running.

> Startup checker should wait for count rather than percentage
> 
>
> Key: CASSANDRA-14297
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14297
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
>  Labels: 4.0-feature-freeze-review-requested, PatchAvailable, 
> pull-request-available
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> As I commented in CASSANDRA-13993, the current wait for functionality is a 
> great step in the right direction, but I don't think that the current setting 
> (70% of nodes in the cluster) is the right configuration option. First I 
> think this because 70% will not protect against errors as if you wait for 70% 
> of the cluster you could still very easily have {{UnavailableException}} or 
> {{ReadTimeoutException}} exceptions. This is because if you have even two 
> nodes down in different racks in a Cassandra cluster these exceptions are 
> possible (or with the default {{num_tokens}} setting of 256 it is basically 
> guaranteed). Second I think this option is not easy for operators to set, the 
> only setting I could think of that would "just work" is 100%.
> I proposed in that ticket instead of having `block_for_peers_percentage` 
> defaulting to 70%, we instead have `block_for_peers` as a count of nodes that 
> are allowed to be down before the starting node makes itself available as a 
> coordinator. Of course, we would still have the timeout to limit startup time 
> and deal with really extreme situations (whole datacenters down etc).
> I started working on a patch for this change [on 
> github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993], 
> and am happy to finish it up with unit tests and such if someone can 
> review/commit it (maybe [~aweisberg]?).
> I think the short version of my proposal is we replace:
> {noformat}
> block_for_peers_percentage: 
> {noformat}
> with either
> {noformat}
> block_for_peers: 
> {noformat}
> or, if we want to do even better imo and enable advanced operators to finely 
> tune this behavior (while still having good defaults that work for almost 
> everyone):
> {noformat}
> block_for_peers_local_dc:  
> block_for_peers_each_dc: 
> block_for_peers_all_dcs: 
> {noformat}
> For example if an operator knows that they must be available at 
> {{LOCAL_QUORUM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{EACH_QUOURM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{QUORUM}} (RF=3, dcs=2) they would set {{block_for_peers_all_dcs=2}}. 
> Naturally everything would of course have a timeout to prevent startup taking 
> too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org