[jira] [Commented] (CASSANDRA-14317) Auditing Plug-in for Cassandra

2018-10-25 Thread Laxmikant Upadhyay (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664656#comment-16664656
 ] 

Laxmikant Upadhyay commented on CASSANDRA-14317:


FYI. Apache Cassandra users who can’t wait for the future release with audit 
feature and don’t want to take the overhead of back porting the 4.x audit 
feature to existing Cassandra releases can use the open source Cassandra Audit 
plugin - [ecaudit |https://github.com/Ericsson/ecaudit] for getting the 
auditing functionality and later switch to the Auditing feature available in 
future Cassandra releases.

> Auditing Plug-in for Cassandra
> --
>
> Key: CASSANDRA-14317
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14317
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Tools
> Environment: Cassandra 3.11.x
>Reporter: Anuj Wadehra
>Priority: Major
>  Labels: security
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Cassandra lacks database auditing feature. Till the new feature is 
> implemented as part of CASSANDRA-12151, a database auditing plug-in can be 
> built. The plug-in can be implemented and plugged into Cassandra by 
> customizing components such as Query Handler , Authenticator and Role 
> Manager. The Auditing plug-in shall log all CQL queries and user logins. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14297) Optional startup delay for peers should wait for count rather than percentage

2018-10-25 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664534#comment-16664534
 ] 

Joseph Lynch commented on CASSANDRA-14297:
--

Ok I've uploaded a patch that does what was asked in IRC I believe. Let me know 
if it looks good and I can run dtests and such against it.

> Optional startup delay for peers should wait for count rather than percentage
> -
>
> Key: CASSANDRA-14297
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14297
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
>  Labels: 4.0-feature-freeze-review-requested, PatchAvailable, 
> pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> As I commented in CASSANDRA-13993, the current wait for functionality is a 
> great step in the right direction, but I don't think that the current setting 
> (70% of nodes in the cluster) is the right configuration option. First I 
> think this because 70% will not protect against errors as if you wait for 70% 
> of the cluster you could still very easily have {{UnavailableException}} or 
> {{ReadTimeoutException}} exceptions. This is because if you have even two 
> nodes down in different racks in a Cassandra cluster these exceptions are 
> possible (or with the default {{num_tokens}} setting of 256 it is basically 
> guaranteed). Second I think this option is not easy for operators to set, the 
> only setting I could think of that would "just work" is 100%.
> I proposed in that ticket instead of having `block_for_peers_percentage` 
> defaulting to 70%, we instead have `block_for_peers` as a count of nodes that 
> are allowed to be down before the starting node makes itself available as a 
> coordinator. Of course, we would still have the timeout to limit startup time 
> and deal with really extreme situations (whole datacenters down etc).
> I started working on a patch for this change [on 
> github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993], 
> and am happy to finish it up with unit tests and such if someone can 
> review/commit it (maybe [~aweisberg]?).
> I think the short version of my proposal is we replace:
> {noformat}
> block_for_peers_percentage: 
> {noformat}
> with either
> {noformat}
> block_for_peers: 
> {noformat}
> or, if we want to do even better imo and enable advanced operators to finely 
> tune this behavior (while still having good defaults that work for almost 
> everyone):
> {noformat}
> block_for_peers_local_dc:  
> block_for_peers_each_dc: 
> block_for_peers_all_dcs: 
> {noformat}
> For example if an operator knows that they must be available at 
> {{LOCAL_QUORUM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{EACH_QUOURM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{QUORUM}} (RF=3, dcs=2) they would set {{block_for_peers_all_dcs=2}}. 
> Naturally everything would of course have a timeout to prevent startup taking 
> too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14297) Optional startup delay for peers should wait for count rather than percentage

2018-10-25 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664534#comment-16664534
 ] 

Joseph Lynch edited comment on CASSANDRA-14297 at 10/26/18 1:59 AM:


Ok I've uploaded a patch to my branch that does what was asked in IRC I 
believe. Let me know if it looks good and I can run dtests and such against it.


was (Author: jolynch):
Ok I've uploaded a patch that does what was asked in IRC I believe. Let me know 
if it looks good and I can run dtests and such against it.

> Optional startup delay for peers should wait for count rather than percentage
> -
>
> Key: CASSANDRA-14297
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14297
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
>  Labels: 4.0-feature-freeze-review-requested, PatchAvailable, 
> pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> As I commented in CASSANDRA-13993, the current wait for functionality is a 
> great step in the right direction, but I don't think that the current setting 
> (70% of nodes in the cluster) is the right configuration option. First I 
> think this because 70% will not protect against errors as if you wait for 70% 
> of the cluster you could still very easily have {{UnavailableException}} or 
> {{ReadTimeoutException}} exceptions. This is because if you have even two 
> nodes down in different racks in a Cassandra cluster these exceptions are 
> possible (or with the default {{num_tokens}} setting of 256 it is basically 
> guaranteed). Second I think this option is not easy for operators to set, the 
> only setting I could think of that would "just work" is 100%.
> I proposed in that ticket instead of having `block_for_peers_percentage` 
> defaulting to 70%, we instead have `block_for_peers` as a count of nodes that 
> are allowed to be down before the starting node makes itself available as a 
> coordinator. Of course, we would still have the timeout to limit startup time 
> and deal with really extreme situations (whole datacenters down etc).
> I started working on a patch for this change [on 
> github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993], 
> and am happy to finish it up with unit tests and such if someone can 
> review/commit it (maybe [~aweisberg]?).
> I think the short version of my proposal is we replace:
> {noformat}
> block_for_peers_percentage: 
> {noformat}
> with either
> {noformat}
> block_for_peers: 
> {noformat}
> or, if we want to do even better imo and enable advanced operators to finely 
> tune this behavior (while still having good defaults that work for almost 
> everyone):
> {noformat}
> block_for_peers_local_dc:  
> block_for_peers_each_dc: 
> block_for_peers_all_dcs: 
> {noformat}
> For example if an operator knows that they must be available at 
> {{LOCAL_QUORUM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{EACH_QUOURM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{QUORUM}} (RF=3, dcs=2) they would set {{block_for_peers_all_dcs=2}}. 
> Naturally everything would of course have a timeout to prevent startup taking 
> too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14798) Improve wording around partitioner selection

2018-10-25 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664309#comment-16664309
 ] 

Dinesh Joshi commented on CASSANDRA-14798:
--

LGTM

> Improve wording around partitioner selection
> 
>
> Key: CASSANDRA-14798
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14798
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation and Website
>Reporter: Aaron Ploetz
>Assignee: Aaron Ploetz
>Priority: Trivial
> Fix For: 4.0
>
> Attachments: 14798-trunk.patch
>
>
> Given some recent community interactions on Stack Overflow, Nate McCall asked 
> me provide some stronger wording on partitioner selection.  Specifically, in 
> further discouraging people from using the other partitioners (namely, the 
> ByteOrderedPartitioner).
> Right now, this is the language that I'm leaning toward:
> {{# The partitioner is responsible for distributing groups of rows (by}}
> {{# partition key) across nodes in the cluster. The partitioner can NOT be}}
> {{# changed without reloading all data.  If you are upgrading, you should set 
> this}}
> {{# to the same partitioner that you are currently using.}}
> {{#}}
> {{# The default partitioner is the Murmur3Partitioner. Older partitioners}}
> {{# such as the RandomPartitioner, ByteOrderedPartitioner, and}}
> {{# OrderPreservingPartitioner have been included for backward compatibility 
> only.}}
> {{# For new clusters, you should NOT change this value.}}
> {{#}}
> {{partitioner: org.apache.cassandra.dht.Murmur3Partitioner  }}
> I'm open to suggested improvements.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14849) some empty/invalid bounds aren't caught by SelectStatement

2018-10-25 Thread Blake Eggleston (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Eggleston updated CASSANDRA-14849:

Reviewer: Aleksey Yeschenko

> some empty/invalid bounds aren't caught by SelectStatement
> --
>
> Key: CASSANDRA-14849
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14849
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Nonsensical clustering bounds like "c >= 100 AND c < 100" aren't converted to 
> Slices.NONE like they should be. Although this seems to be completely benign, 
> it is technically incorrect and complicates some testing since it can cause 
> memtables and sstables to return different results for the same data for 
> these bounds in some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14297) Optional startup delay for peers should wait for count rather than percentage

2018-10-25 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663035#comment-16663035
 ] 

Joseph Lynch edited comment on CASSANDRA-14297 at 10/25/18 8:59 PM:


Alright, per the discussion on 
[IRC|https://wilderness.apache.org/channels/?f=cassandra-dev/2018-10-17#1539793033]
 with Ariel and Jason, we've decided that instead of counts we should always 
wait for all but a single local DC node and replace the percentage option with:
{noformat}
block_for_remote_dcs: 
{noformat}
The startup connectivity checker will wait for all but a single node in the 
local datacenter, and if you want to block startup on every datacenter having 
only a single node down you can set this to true.

The timeout will be the fallback for when multiple nodes are down in a local DC.


was (Author: jolynch):
Alright, per the discussion on 
[IRC|https://wilderness.apache.org/channels/?f=cassandra-dev/2018-10-17#1539793033]
 with Ariel and Jason, we've decided that instead of counts we should always 
wait for all but a single node and have the additional option of:

{noformat}
wait_for_remote_dcs: 
{noformat}

The startup connectivity checker will wait for all but a single node in the 
local datacenter, and if you want to block startup on every datacenter having 
only a single node down you can set this to true.

> Optional startup delay for peers should wait for count rather than percentage
> -
>
> Key: CASSANDRA-14297
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14297
> Project: Cassandra
>  Issue Type: Bug
>  Components: Lifecycle
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
>  Labels: 4.0-feature-freeze-review-requested, PatchAvailable, 
> pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> As I commented in CASSANDRA-13993, the current wait for functionality is a 
> great step in the right direction, but I don't think that the current setting 
> (70% of nodes in the cluster) is the right configuration option. First I 
> think this because 70% will not protect against errors as if you wait for 70% 
> of the cluster you could still very easily have {{UnavailableException}} or 
> {{ReadTimeoutException}} exceptions. This is because if you have even two 
> nodes down in different racks in a Cassandra cluster these exceptions are 
> possible (or with the default {{num_tokens}} setting of 256 it is basically 
> guaranteed). Second I think this option is not easy for operators to set, the 
> only setting I could think of that would "just work" is 100%.
> I proposed in that ticket instead of having `block_for_peers_percentage` 
> defaulting to 70%, we instead have `block_for_peers` as a count of nodes that 
> are allowed to be down before the starting node makes itself available as a 
> coordinator. Of course, we would still have the timeout to limit startup time 
> and deal with really extreme situations (whole datacenters down etc).
> I started working on a patch for this change [on 
> github|https://github.com/jasobrown/cassandra/compare/13993...jolynch:13993], 
> and am happy to finish it up with unit tests and such if someone can 
> review/commit it (maybe [~aweisberg]?).
> I think the short version of my proposal is we replace:
> {noformat}
> block_for_peers_percentage: 
> {noformat}
> with either
> {noformat}
> block_for_peers: 
> {noformat}
> or, if we want to do even better imo and enable advanced operators to finely 
> tune this behavior (while still having good defaults that work for almost 
> everyone):
> {noformat}
> block_for_peers_local_dc:  
> block_for_peers_each_dc: 
> block_for_peers_all_dcs: 
> {noformat}
> For example if an operator knows that they must be available at 
> {{LOCAL_QUORUM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{EACH_QUOURM}} they would set {{block_for_peers_local_dc=1}}, if they use 
> {{QUORUM}} (RF=3, dcs=2) they would set {{block_for_peers_all_dcs=2}}. 
> Naturally everything would of course have a timeout to prevent startup taking 
> too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14798) Improve wording around partitioner selection

2018-10-25 Thread Aaron Ploetz (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Ploetz updated CASSANDRA-14798:
-
Status: Patch Available  (was: In Progress)

Patch attached to main discussion.

> Improve wording around partitioner selection
> 
>
> Key: CASSANDRA-14798
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14798
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Documentation and Website
>Reporter: Aaron Ploetz
>Assignee: Aaron Ploetz
>Priority: Trivial
> Fix For: 4.0
>
> Attachments: 14798-trunk.patch
>
>
> Given some recent community interactions on Stack Overflow, Nate McCall asked 
> me provide some stronger wording on partitioner selection.  Specifically, in 
> further discouraging people from using the other partitioners (namely, the 
> ByteOrderedPartitioner).
> Right now, this is the language that I'm leaning toward:
> {{# The partitioner is responsible for distributing groups of rows (by}}
> {{# partition key) across nodes in the cluster. The partitioner can NOT be}}
> {{# changed without reloading all data.  If you are upgrading, you should set 
> this}}
> {{# to the same partitioner that you are currently using.}}
> {{#}}
> {{# The default partitioner is the Murmur3Partitioner. Older partitioners}}
> {{# such as the RandomPartitioner, ByteOrderedPartitioner, and}}
> {{# OrderPreservingPartitioner have been included for backward compatibility 
> only.}}
> {{# For new clusters, you should NOT change this value.}}
> {{#}}
> {{partitioner: org.apache.cassandra.dht.Murmur3Partitioner  }}
> I'm open to suggested improvements.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14849) some empty/invalid bounds aren't caught by SelectStatement

2018-10-25 Thread Blake Eggleston (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Eggleston updated CASSANDRA-14849:

Status: Patch Available  (was: Open)

[trunk|https://github.com/bdeggleston/cassandra/tree/14849-trunk]
[circle|https://circleci.com/workflow-run/8f1492ae-d04e-4880-a93b-b9ff891d855d]

> some empty/invalid bounds aren't caught by SelectStatement
> --
>
> Key: CASSANDRA-14849
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14849
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Blake Eggleston
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Nonsensical clustering bounds like "c >= 100 AND c < 100" aren't converted to 
> Slices.NONE like they should be. Although this seems to be completely benign, 
> it is technically incorrect and complicates some testing since it can cause 
> memtables and sstables to return different results for the same data for 
> these bounds in some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-14849) some empty/invalid bounds aren't caught by SelectStatement

2018-10-25 Thread Blake Eggleston (JIRA)
Blake Eggleston created CASSANDRA-14849:
---

 Summary: some empty/invalid bounds aren't caught by SelectStatement
 Key: CASSANDRA-14849
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14849
 Project: Cassandra
  Issue Type: Bug
Reporter: Blake Eggleston
Assignee: Blake Eggleston
 Fix For: 4.0


Nonsensical clustering bounds like "c >= 100 AND c < 100" aren't converted to 
Slices.NONE like they should be. Although this seems to be completely benign, 
it is technically incorrect and complicates some testing since it can cause 
memtables and sstables to return different results for the same data for these 
bounds in some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2018-10-25 Thread Matt Byrd (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663994#comment-16663994
 ] 

Matt Byrd edited comment on CASSANDRA-11748 at 10/25/18 4:46 PM:
-

I think it would be great to try and fix these related issues in the 4.0 
timeframe. I'd be keen on trying the above outlined approach, I'll have a go at 
sketching it out in a PR to see what folks think.
To reiterate what I believe to be fundamental problem:
The way we tee up a schema pull whenever a relevant gossip event shows a node 
with a different schema version,
results in far too many superfluous pulls for the same schema contents. When 
there are sufficient endpoints and a sufficiently large schema doing so can 
lead to the instance OOMing.

The above proposed solution solves this by decoupling the schema pulls from the 
incoming gossip messages and instead using gossip to update the nodes view of 
which other nodes have which schema version and then having a thread 
periodically check and attempt to resolve any inconsistencies.
There are some details to flesh out and I think an important part will be to 
ensure we have tests to demonstrate the issues and demonstrate we've fixed them.
I'm hoping that we can perhaps leverage 
[CASSANDRA-14821|https://issues.apache.org/jira/browse/CASSANDRA-14821] to do 
so. 
Though we may want to augment this with dtests or something else.
Let me know if you have any thoughts on the above approach, perhaps a sketch in 
code will help better illuminate it and help flush out potential problems. 
[~iamaleksey] / [~spo...@gmail.com] / [~michael.fong] / [~jjirsa] 


was (Author: mbyrd):
I think it would be great to try and fix these related issues in the 4.0 
timeframe. I'd be keen on trying the above outlined approach, I'll have a go at 
sketching it out in a PR to see what folks think.
To reiterate what I believe to be fundamental problem:
The way we tee up a schema pull whenever a relevant gossip event shows a node 
with a different schema version,
results in far too many superfluous pulls for the same schema contents. When 
there are sufficient endpoints and a sufficiently large schema doing so can 
lead to the instance OOMing.

The above proposed solution solves this by decoupling the schema pulls from the 
incoming gossip messages and instead using gossip to update the nodes view of 
which other nodes have which schema version and then having a thread 
periodically check and attempt to resolve any inconsistencies.
There are some details to flesh out and I think an important part will be to 
ensure we have tests to demonstrate the issues and demonstrate we've fixed them.
I'm hoping that we can perhaps leverage 
[CASSANDRA-14821|https://issues.apache.org/jira/browse/CASSANDRA-14821] to do 
so. 
Though we may want to augment this with dtests or something else.
Let me know if you have any thoughts on the above approach, perhaps a sketch in 
code will help better illuminate it and help flush out potential problems. 
[~iamaleksey][~spo...@gmail.com][~michael.fong][~jjirsa] 

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java 

[jira] [Commented] (CASSANDRA-11748) Schema version mismatch may leads to Casandra OOM at bootstrap during a rolling upgrade process

2018-10-25 Thread Matt Byrd (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663994#comment-16663994
 ] 

Matt Byrd commented on CASSANDRA-11748:
---

I think it would be great to try and fix these related issues in the 4.0 
timeframe. I'd be keen on trying the above outlined approach, I'll have a go at 
sketching it out in a PR to see what folks think.
To reiterate what I believe to be fundamental problem:
The way we tee up a schema pull whenever a relevant gossip event shows a node 
with a different schema version,
results in far too many superfluous pulls for the same schema contents. When 
there are sufficient endpoints and a sufficiently large schema doing so can 
lead to the instance OOMing.

The above proposed solution solves this by decoupling the schema pulls from the 
incoming gossip messages and instead using gossip to update the nodes view of 
which other nodes have which schema version and then having a thread 
periodically check and attempt to resolve any inconsistencies.
There are some details to flesh out and I think an important part will be to 
ensure we have tests to demonstrate the issues and demonstrate we've fixed them.
I'm hoping that we can perhaps leverage 
[CASSANDRA-14821|https://issues.apache.org/jira/browse/CASSANDRA-14821] to do 
so. 
Though we may want to augment this with dtests or something else.
Let me know if you have any thoughts on the above approach, perhaps a sketch in 
code will help better illuminate it and help flush out potential problems. 
[~iamaleksey][~spo...@gmail.com][~michael.fong][~jjirsa] 

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> ---
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
>  Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>Reporter: Michael Fong
>Assignee: Matt Byrd
>Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> --
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> --
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a 

[jira] [Commented] (CASSANDRA-14848) When upgrading 3.11.3->4.0 using SSL 4.0 nodes does not connect to old non seed nodes

2018-10-25 Thread Tommy Stendahl (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663735#comment-16663735
 ] 

Tommy Stendahl commented on CASSANDRA-14848:


I think the problem is in 
{{OutboundMessagingConnection.maybeUpdateConnectionId()}} in combination with 
line 186 in that class:
{code:java}
targetVersion = MessagingService.instance().getVersion(connectionId.remote());
{code}
What happens is that when the {{OutboundMessagingConnection}} is created for 
the seed node {{targerVersion}} is set to 12 since we don't know the version of 
that node yet. When we get incoming messages from the old seed node we detect 
that if has a lower version and the if statement in 
{{maybeUpdateConnectionId()}} will be true:
{code:java}
if (version < targetVersion)
{code}
{{and we will change the port.}}

But when creating {{OutboundMessagingConnection}} for the non-seed nodes we 
already know there versions (from gossiping with the old seed) and on line 186 
{{tagetVersion}} will be set to 11 and the if statement in in 
{{maybeUpdateConnectionId()}} will never be true so we will continue using the 
wrong port.

I verified this by hard coding {{targetVersion=12}} on line 186 and then 
everything was working but I don't think that's the proper fix.

> When upgrading 3.11.3->4.0 using SSL 4.0 nodes does not connect to old non 
> seed nodes
> -
>
> Key: CASSANDRA-14848
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14848
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Priority: Major
>
> When upgrading from 3.11.3 to 4.0 with server encryption enabled the new 4.0 
> node only connects to 3.11.3 seed node, there are no connection established 
> to non-seed nodes on the old version.
> I have four nodes, *.242 is upgraded to 4.0, *.243 and *.244 are 3.11.3 
> non-seed and *.246 are 3.11.3 seed. After starting the 4.0 node I get this 
> nodetool status on the different nodes:
> {noformat}
> *.242
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.216.193.242 1017.77 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 
> RAC1
> DN 10.216.193.243 743.32 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 
> RAC1
> DN 10.216.193.244 711.54 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 
> RAC1
> UN 10.216.193.246 659.81 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 
> RAC1
> *.243 and *.244
> -- Address Load Tokens Owns (effective) Host ID Rack
> DN 10.216.193.242 657.4 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 
> RAC1
> UN 10.216.193.243 471 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
> UN 10.216.193.244 471.71 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 
> RAC1
> UN 10.216.193.246 388.54 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 
> RAC1
> *.246
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.216.193.242 657.4 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 
> RAC1
> UN 10.216.193.243 471 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
> UN 10.216.193.244 471.71 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 
> RAC1
> UN 10.216.193.246 388.54 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 
> RAC1
> {noformat}
>  
> I have built 4.0 with wire tracing activated and in my config the 
> storage_port=12700 and ssl_storage_port=12701. In the log I can see that the 
> 4.0 node start to connect to the 3.11.3 seed node on the storage_port but 
> quickly switch to the ssl_storage_port, but when connecting to the non-seed 
> nodes it never switch to the ssl_storage_port.
> {noformat}
> >grep 193.246 system.log | grep Outbound
> 2018-10-25T10:57:36.799+0200 [MessagingService-NettyOutbound-Thread-4-1] INFO 
> i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x2f0e5e55] CONNECT: 
> /10.216.193.246:12700
> 2018-10-25T10:57:36.902+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
> i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c] CONNECT: 
> /10.216.193.246:12701
> 2018-10-25T10:57:36.905+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
> i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c, 
> L:/10.216.193.242:37252 - R:10.216.193.246/10.216.193.246:12701] ACTIVE
> 2018-10-25T10:57:36.906+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
> i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c, 
> L:/10.216.193.242:37252 - R:10.216.193.246/10.216.193.246:12701] WRITE: 8B
> >grep 193.243 system.log | grep Outbound
> 2018-10-25T10:57:38.438+0200 [MessagingService-NettyOutbound-Thread-4-3] INFO 
> i.n.u.internal.logging.Slf4JLogger:101 info [id: 0xd8f1d6c4] CONNECT: 
> /10.216.193.243:12700
> 2018-10-25T10:57:38.540+0200 [MessagingService-NettyOutbound-Thread-4-4] INFO 
> i.n.u.internal.logging.Slf4JLogger:101 info [id: 0xfde6cc9f] 

[jira] [Updated] (CASSANDRA-14848) When upgrading 3.11.3->4.0 using SSL 4.0 nodes does not connect to old non seed nodes

2018-10-25 Thread Tommy Stendahl (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommy Stendahl updated CASSANDRA-14848:
---
Description: 
When upgrading from 3.11.3 to 4.0 with server encryption enabled the new 4.0 
node only connects to 3.11.3 seed node, there are no connection established to 
non-seed nodes on the old version.

I have four nodes, *.242 is upgraded to 4.0, *.243 and *.244 are 3.11.3 
non-seed and *.246 are 3.11.3 seed. After starting the 4.0 node I get this 
nodetool status on the different nodes:
{noformat}
*.242
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.216.193.242 1017.77 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 
RAC1
DN 10.216.193.243 743.32 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
DN 10.216.193.244 711.54 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 RAC1
UN 10.216.193.246 659.81 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 RAC1

*.243 and *.244
-- Address Load Tokens Owns (effective) Host ID Rack
DN 10.216.193.242 657.4 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 RAC1
UN 10.216.193.243 471 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
UN 10.216.193.244 471.71 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 RAC1
UN 10.216.193.246 388.54 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 RAC1

*.246
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.216.193.242 657.4 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 RAC1
UN 10.216.193.243 471 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
UN 10.216.193.244 471.71 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 RAC1
UN 10.216.193.246 388.54 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 RAC1
{noformat}
 

I have built 4.0 with wire tracing activated and in my config the 
storage_port=12700 and ssl_storage_port=12701. In the log I can see that the 
4.0 node start to connect to the 3.11.3 seed node on the storage_port but 
quickly switch to the ssl_storage_port, but when connecting to the non-seed 
nodes it never switch to the ssl_storage_port.
{noformat}
>grep 193.246 system.log | grep Outbound
2018-10-25T10:57:36.799+0200 [MessagingService-NettyOutbound-Thread-4-1] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x2f0e5e55] CONNECT: 
/10.216.193.246:12700
2018-10-25T10:57:36.902+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c] CONNECT: 
/10.216.193.246:12701
2018-10-25T10:57:36.905+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c, 
L:/10.216.193.242:37252 - R:10.216.193.246/10.216.193.246:12701] ACTIVE
2018-10-25T10:57:36.906+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c, 
L:/10.216.193.242:37252 - R:10.216.193.246/10.216.193.246:12701] WRITE: 8B

>grep 193.243 system.log | grep Outbound
2018-10-25T10:57:38.438+0200 [MessagingService-NettyOutbound-Thread-4-3] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0xd8f1d6c4] CONNECT: 
/10.216.193.243:12700
2018-10-25T10:57:38.540+0200 [MessagingService-NettyOutbound-Thread-4-4] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0xfde6cc9f] CONNECT: 
/10.216.193.243:12700
2018-10-25T10:57:38.694+0200 [MessagingService-NettyOutbound-Thread-4-5] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x7e87fc4e] CONNECT: 
/10.216.193.243:12700
2018-10-25T10:57:38.741+0200 [MessagingService-NettyOutbound-Thread-4-7] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x39395296] CONNECT: 
/10.216.193.243:12700{noformat}
 

When I had the dbug log activated and started the 4.0 node I can see that it 
switch port for *.246 but not for *.243 and *.244.
{noformat}
>grep DEBUG system.log| grep OutboundMessagingConnection | grep 
>maybeUpdateConnectionId
2018-10-25T13:12:56.095+0200 [ScheduledFastTasks:1] DEBUG 
o.a.c.n.a.OutboundMessagingConnection:314 maybeUpdateConnectionId changing 
connectionId to 10.216.193.246:12701 (GOSSIP), with a different port for secure 
communication, because peer version is 11
2018-10-25T13:12:58.100+0200 [ReadStage-1] DEBUG 
o.a.c.n.a.OutboundMessagingConnection:314 maybeUpdateConnectionId changing 
connectionId to 10.216.193.246:12701 (SMALL_MESSAGE), with a different port for 
secure communication, because peer version is 11
2018-10-25T13:13:05.764+0200 [main] DEBUG 
o.a.c.n.a.OutboundMessagingConnection:314 maybeUpdateConnectionId changing 
connectionId to 10.216.193.246:12701 (LARGE_MESSAGE), with a different port for 
secure communication, because peer version is 11
{noformat}
 

  was:
When upgrading from 3.11.3 to 4.0 with server encryption enabled the new 4.0 
node only connects to 3.11.3 seed node, there are no connection established to 
non-seed nodes on the old version.

I have four nodes, *.242 is upgraded to 4.0, *.243 and *.244 are 3.11.3 
non-seed 

[jira] [Created] (CASSANDRA-14848) When upgrading 3.11.3->4.0 using SSL 4.0 nodes does not connect to old non seed nodes

2018-10-25 Thread Tommy Stendahl (JIRA)
Tommy Stendahl created CASSANDRA-14848:
--

 Summary: When upgrading 3.11.3->4.0 using SSL 4.0 nodes does not 
connect to old non seed nodes
 Key: CASSANDRA-14848
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14848
 Project: Cassandra
  Issue Type: Bug
Reporter: Tommy Stendahl


When upgrading from 3.11.3 to 4.0 with server encryption enabled the new 4.0 
node only connects to 3.11.3 seed node, there are no connection established to 
non-seed nodes on the old version.

I have four nodes, *.242 is upgraded to 4.0, *.243 and *.244 are 3.11.3 
non-seed and *.246 are 3.11.3 seed. After starting the 4.0 node I get this 
nodetool status on the different nodes:

 
{noformat}
*.242
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.216.193.242 1017.77 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 
RAC1
DN 10.216.193.243 743.32 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
DN 10.216.193.244 711.54 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 RAC1
UN 10.216.193.246 659.81 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 RAC1

*.243 and *.244
-- Address Load Tokens Owns (effective) Host ID Rack
DN 10.216.193.242 657.4 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 RAC1
UN 10.216.193.243 471 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
UN 10.216.193.244 471.71 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 RAC1
UN 10.216.193.246 388.54 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 RAC1

*.246
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.216.193.242 657.4 KiB 256 75,1% 7d278e14-d549-42f3-840d-77cfd852fbf4 RAC1
UN 10.216.193.243 471 KiB 256 74,8% 5586243a-ca74-4125-8e7e-09e82e23c4e5 RAC1
UN 10.216.193.244 471.71 KiB 256 75,2% c155e262-b898-4e86-9e1d-d4d0f97e88f6 RAC1
UN 10.216.193.246 388.54 KiB 256 74,9% 502dd00f-fc02-4024-b65f-b98ba3808291 RAC1
{noformat}
I have built 4.0 with wire tracing activated and in my config the 
storage_port=12700 and ssl_storage_port=12701. In the log I can see that the 
4.0 node start to connect to the 3.11.3 seed node on the storage_port but 
quickly switch to the ssl_storage_port, but when connecting to the non-seed 
nodes it never switch to the ssl_storage_port.

 

 
{noformat}
>grep 193.246 system.log | grep Outbound
2018-10-25T10:57:36.799+0200 [MessagingService-NettyOutbound-Thread-4-1] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x2f0e5e55] CONNECT: 
/10.216.193.246:12700
2018-10-25T10:57:36.902+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c] CONNECT: 
/10.216.193.246:12701
2018-10-25T10:57:36.905+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c, 
L:/10.216.193.242:37252 - R:10.216.193.246/10.216.193.246:12701] ACTIVE
2018-10-25T10:57:36.906+0200 [MessagingService-NettyOutbound-Thread-4-2] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x9e81f62c, 
L:/10.216.193.242:37252 - R:10.216.193.246/10.216.193.246:12701] WRITE: 8B

>grep 193.243 system.log | grep Outbound
2018-10-25T10:57:38.438+0200 [MessagingService-NettyOutbound-Thread-4-3] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0xd8f1d6c4] CONNECT: 
/10.216.193.243:12700
2018-10-25T10:57:38.540+0200 [MessagingService-NettyOutbound-Thread-4-4] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0xfde6cc9f] CONNECT: 
/10.216.193.243:12700
2018-10-25T10:57:38.694+0200 [MessagingService-NettyOutbound-Thread-4-5] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x7e87fc4e] CONNECT: 
/10.216.193.243:12700
2018-10-25T10:57:38.741+0200 [MessagingService-NettyOutbound-Thread-4-7] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x39395296] CONNECT: 
/10.216.193.243:12700{noformat}
When I had the dbug log activated and started the 4.0 node I can see that it 
switch port for *.246 but not for *.243 and *.244

 

 
{noformat}
>grep DEBUG system.log| grep OutboundMessagingConnection | grep 
>maybeUpdateConnectionId
2018-10-25T13:12:56.095+0200 [ScheduledFastTasks:1] DEBUG 
o.a.c.n.a.OutboundMessagingConnection:314 maybeUpdateConnectionId changing 
connectionId to 10.216.193.246:12701 (GOSSIP), with a different port for secure 
communication, because peer version is 11
2018-10-25T13:12:58.100+0200 [ReadStage-1] DEBUG 
o.a.c.n.a.OutboundMessagingConnection:314 maybeUpdateConnectionId changing 
connectionId to 10.216.193.246:12701 (SMALL_MESSAGE), with a different port for 
secure communication, because peer version is 11
2018-10-25T13:13:05.764+0200 [main] DEBUG 
o.a.c.n.a.OutboundMessagingConnection:314 maybeUpdateConnectionId changing 
connectionId to 10.216.193.246:12701 (LARGE_MESSAGE), with a different port for 
secure communication, because peer version is 11
{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CASSANDRA-14842) SSL connection problems when upgrading to 4.0 when upgrading from 3.0.x

2018-10-25 Thread Tommy Stendahl (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommy Stendahl updated CASSANDRA-14842:
---
Priority: Major  (was: Blocker)

> SSL connection problems when upgrading to 4.0 when upgrading from 3.0.x
> ---
>
> Key: CASSANDRA-14842
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14842
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Priority: Major
>
> While testing to upgrade from 3.0.15 to 4.0 the old nodes fails to connect to 
> the 4.0 node, I get this exception on the 4.0 node:
>  
> {noformat}
> 2018-10-22T11:57:44.366+0200 ERROR [MessagingService-NettyInbound-Thread-3-8] 
> InboundHandshakeHandler.java:300 Failed to properly handshake with peer 
> /10.216.193.246:58296. Closing the channel.
> io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
> SSLv2Hello is disabled
> at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
> at 
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:317)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.net.ssl.SSLHandshakeException: SSLv2Hello is disabled
> at sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:637)
> at sun.security.ssl.InputRecord.read(InputRecord.java:527)
> at sun.security.ssl.EngineInputRecord.read(EngineInputRecord.java:382)
> at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:962)
> at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:294)
> at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1275)
> at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1177)
> at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1221)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
> ... 14 common frames omitted{noformat}
> In the server encryption options on the 4.0 node I have both "enabled and 
> "enable_legacy_ssl_storage_port" set to true so it should accept incoming 
> connections on the "ssl_storage_port".
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14842) SSL connection problems when upgrading to 4.0 when upgrading from 3.0.x

2018-10-25 Thread Tommy Stendahl (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663622#comment-16663622
 ] 

Tommy Stendahl commented on CASSANDRA-14842:


Could this be related to CASSANDRA-8265 "Disable SSLv3 for POODLE"?

> SSL connection problems when upgrading to 4.0 when upgrading from 3.0.x
> ---
>
> Key: CASSANDRA-14842
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14842
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Priority: Blocker
>
> While testing to upgrade from 3.0.15 to 4.0 the old nodes fails to connect to 
> the 4.0 node, I get this exception on the 4.0 node:
>  
> {noformat}
> 2018-10-22T11:57:44.366+0200 ERROR [MessagingService-NettyInbound-Thread-3-8] 
> InboundHandshakeHandler.java:300 Failed to properly handshake with peer 
> /10.216.193.246:58296. Closing the channel.
> io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
> SSLv2Hello is disabled
> at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
> at 
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:317)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.net.ssl.SSLHandshakeException: SSLv2Hello is disabled
> at sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:637)
> at sun.security.ssl.InputRecord.read(InputRecord.java:527)
> at sun.security.ssl.EngineInputRecord.read(EngineInputRecord.java:382)
> at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:962)
> at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907)
> at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
> at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
> at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:294)
> at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1275)
> at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1177)
> at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1221)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
> ... 14 common frames omitted{noformat}
> In the server encryption options on the 4.0 node I have both "enabled and 
> "enable_legacy_ssl_storage_port" set to true so it should accept incoming 
> connections on the "ssl_storage_port".
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14842) SSL connection problems when upgrading to 4.0 when upgrading from 3.0.x

2018-10-25 Thread Tommy Stendahl (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662348#comment-16662348
 ] 

Tommy Stendahl edited comment on CASSANDRA-14842 at 10/25/18 11:38 AM:
---

The issue when upgrading from 3.0.x still remains the same. I activated wire 
trace in {{NettyFactory.java}} to get some more logging.
{noformat}
2018-10-24T15:13:31.724+0200 [MessagingService-NettyInbound-Thread-3-3] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x68a0cdd6, 
L:/10.216.193.242:12701 - R:/10.216.193.243:60911] REGISTERED
2018-10-24T15:13:31.725+0200 [MessagingService-NettyInbound-Thread-3-3] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x68a0cdd6, 
L:/10.216.193.242:12701 - R:/10.216.193.243:60911] ACTIVE
2018-10-24T15:13:31.725+0200 [MessagingService-NettyInbound-Thread-3-3] INFO 
i.n.u.internal.logging.Slf4JLogger:101 info [id: 0x68a0cdd6, 
L:/10.216.193.242:12701 - R:/10.216.193.243:60911] USER_EVENT: 
SslHandshakeCompletionEvent(javax.net.ssl.SSLHandshakeException: SSLv2Hello is 
disabled)
2018-10-24T15:13:31.725+0200 [MessagingService-NettyInbound-Thread-3-3] INFO 
i.n.u.internal.logging.Slf4JLogger:121 info [id: 0x68a0cdd6, 
L:/10.216.193.242:12701 ! R:/10.216.193.243:60911] EXCEPTION: 
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
SSLv2Hello is disabled
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
SSLv2Hello is disabled
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:317)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: SSLv2Hello is disabled
at sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:637)
at sun.security.ssl.InputRecord.read(InputRecord.java:527)
at sun.security.ssl.EngineInputRecord.read(EngineInputRecord.java:382)
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:962)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:294)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1275)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1177)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1221)
at 
io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
... 14 common frames omitted
2018-10-24T15:13:31.725+0200 [MessagingService-NettyInbound-Thread-3-3] ERROR 
o.a.c.n.a.InboundHandshakeHandler:300 exceptionCaught Failed to properly 
handshake with peer /10.216.193.243:60911. Closing the channel.
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
SSLv2Hello is disabled
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at 

[jira] [Updated] (CASSANDRA-14842) SSL connection problems when upgrading to 4.0 when upgrading from 3.0.x

2018-10-25 Thread Tommy Stendahl (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommy Stendahl updated CASSANDRA-14842:
---
Description: 
While testing to upgrade from 3.0.15 to 4.0 the old nodes fails to connect to 
the 4.0 node, I get this exception on the 4.0 node:

 
{noformat}
2018-10-22T11:57:44.366+0200 ERROR [MessagingService-NettyInbound-Thread-3-8] 
InboundHandshakeHandler.java:300 Failed to properly handshake with peer 
/10.216.193.246:58296. Closing the channel.
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
SSLv2Hello is disabled
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:317)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: SSLv2Hello is disabled
at sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:637)
at sun.security.ssl.InputRecord.read(InputRecord.java:527)
at sun.security.ssl.EngineInputRecord.read(EngineInputRecord.java:382)
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:962)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:294)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1275)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1177)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1221)
at 
io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
... 14 common frames omitted{noformat}
In the server encryption options on the 4.0 node I have both "enabled and 
"enable_legacy_ssl_storage_port" set to true so it should accept incoming 
connections on the "ssl_storage_port".

 

  was:
While testing to upgrade from 3.0.15 to 4.0 the old nodes fails to connect to 
the 4.0 node, I get this exception on the 4.0 node:

 
{noformat}
2018-10-22T11:57:44.366+0200 ERROR [MessagingService-NettyInbound-Thread-3-8] 
InboundHandshakeHandler.java:300 Failed to properly handshake with peer 
/10.216.193.246:58296. Closing the channel.
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: 
SSLv2Hello is disabled
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:808)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:317)
at 

[jira] [Updated] (CASSANDRA-14847) improvement of nodetool status -r

2018-10-25 Thread Fumiya Yamashita (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fumiya Yamashita updated CASSANDRA-14847:
-
Description: 
Hello,

When using "nodetool -r", I found a problem that the response time becomes 
longer depending on the number of vnodes.
 In my testing environment, when the num_token is 256 and the number of nodes 
is 6, the response takes about 60 seconds.

It turned out that the findMaxAddressLength method in status.java is causing 
the delay.
 Despite only obtaining the maximum length of the address by the number of 
vnodes, `tokenrange * vnode` times also loop processing, there is redundancy.

To prevent duplicate host names from being referenced every time, I modified to 
check with hash.
 In my environment, the response time has been reduced from 60 seconds to 2 
seconds.

I attached the patch, so please check it.
 Thank you
{code:java}
[before]
Datacenter: dc1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN *** 559.32 KB 256 48.7% 0555746a-60c2-4717-b042-94ba951ef679 ***
UN *** 721.48 KB 256 51.4% 1af4acb6-e0a0-4bcb-8bba-76ae2e225cd5 ***
UN *** 699.98 KB 256 48.3% 5215c728-9b80-4e3c-b46b-c5b8e5eb753f ***
UN *** 691.65 KB 256 48.1% 57da4edf-4acb-474d-b26c-27f048c37bd6 ***
UN *** 705.66 KB 256 52.8% 07520eab-47d2-4f5d-aeeb-f6e599c9b084 ***
UN *** 610.87 KB 256 50.7% 6b39acaf-6ed6-42e4-a357-0d258bdf87b7 ***

time : 66s

[after]
Datacenter: dc1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN *** 559.32 KB 256 48.7% 0555746a-60c2-4717-b042-94ba951ef679 ***
UN *** 721.48 KB 256 51.4% 1af4acb6-e0a0-4bcb-8bba-76ae2e225cd5 ***
UN *** 699.98 KB 256 48.3% 5215c728-9b80-4e3c-b46b-c5b8e5eb753f ***
UN *** 691.65 KB 256 48.1% 57da4edf-4acb-474d-b26c-27f048c37bd6 ***
UN *** 705.66 KB 256 52.8% 07520eab-47d2-4f5d-aeeb-f6e599c9b084 ***
UN *** 610.87 KB 256 50.7% 6b39acaf-6ed6-42e4-a357-0d258bdf87b7 ***

time : 2s
{code}

  was:
Hello,

When using "nodetool -r", I found a problem that the response time becomes 
longer depending on the number of vnodes.
In my testing environment, when the num_token is 256 and the number of nodes is 
6, the response takes about 60 seconds.

It turned out that the findMaxAddressLength method in status.java is causing 
the delay.
Despite only obtaining the maximum length of the address by the number of 
vnodes, `tokenrange * vnode` times also loop processing, there is redundancy.

To prevent duplicate host names from being referenced every time, I modified to 
check with hash.
In my environment, the response time has been reduced from 60 seconds to 2 
seconds.

I attached the patch, so please check it.
Thank you
{code:java}
[before] Datacenter: dc1 === Status=Up/Down |/ 
State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) 
Host ID Rack UN *** 559.32 KB 256 48.7% 
0555746a-60c2-4717-b042-94ba951ef679 *** UN *** 721.48 KB 256 51.4% 
1af4acb6-e0a0-4bcb-8bba-76ae2e225cd5 *** UN *** 699.98 KB 256 48.3% 
5215c728-9b80-4e3c-b46b-c5b8e5eb753f *** UN *** 691.65 KB 256 48.1% 
57da4edf-4acb-474d-b26c-27f048c37bd6 *** UN *** 705.66 KB 256 52.8% 
07520eab-47d2-4f5d-aeeb-f6e599c9b084 *** UN *** 610.87 KB 256 50.7% 
6b39acaf-6ed6-42e4-a357-0d258bdf87b7 *** time : 66s [after] Datacenter: dc1 
=== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- 
Address Load Tokens Owns (effective) Host ID Rack UN *** 559.32 KB 256 
48.7% 0555746a-60c2-4717-b042-94ba951ef679 *** UN *** 721.48 KB 256 
51.4% 1af4acb6-e0a0-4bcb-8bba-76ae2e225cd5 *** UN *** 699.98 KB 256 
48.3% 5215c728-9b80-4e3c-b46b-c5b8e5eb753f *** UN *** 691.65 KB 256 
48.1% 57da4edf-4acb-474d-b26c-27f048c37bd6 *** UN *** 705.66 KB 256 
52.8% 07520eab-47d2-4f5d-aeeb-f6e599c9b084 *** UN *** 610.87 KB 256 
50.7% 6b39acaf-6ed6-42e4-a357-0d258bdf87b7 *** time : 2s
{code}


> improvement of nodetool status -r
> -
>
> Key: CASSANDRA-14847
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14847
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Fumiya Yamashita
>Priority: Major
> Fix For: 3.11.x
>
> Attachments: 3.11.1.patch
>
>
> Hello,
> When using "nodetool -r", I found a problem that the response time becomes 
> longer depending on the number of vnodes.
>  In my testing environment, when the num_token is 256 and the number of nodes 
> is 6, the response takes about 60 seconds.
> It turned out that the findMaxAddressLength method in status.java is causing 
> the delay.
>  Despite only obtaining the 

[jira] [Created] (CASSANDRA-14847) improvement of nodetool status -r

2018-10-25 Thread Fumiya Yamashita (JIRA)
Fumiya Yamashita created CASSANDRA-14847:


 Summary: improvement of nodetool status -r
 Key: CASSANDRA-14847
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14847
 Project: Cassandra
  Issue Type: Improvement
Reporter: Fumiya Yamashita
 Fix For: 3.11.x
 Attachments: 3.11.1.patch

Hello,

When using "nodetool -r", I found a problem that the response time becomes 
longer depending on the number of vnodes.
In my testing environment, when the num_token is 256 and the number of nodes is 
6, the response takes about 60 seconds.

It turned out that the findMaxAddressLength method in status.java is causing 
the delay.
Despite only obtaining the maximum length of the address by the number of 
vnodes, `tokenrange * vnode` times also loop processing, there is redundancy.

To prevent duplicate host names from being referenced every time, I modified to 
check with hash.
In my environment, the response time has been reduced from 60 seconds to 2 
seconds.

I attached the patch, so please check it.
Thank you
{code:java}
[before] Datacenter: dc1 === Status=Up/Down |/ 
State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) 
Host ID Rack UN *** 559.32 KB 256 48.7% 
0555746a-60c2-4717-b042-94ba951ef679 *** UN *** 721.48 KB 256 51.4% 
1af4acb6-e0a0-4bcb-8bba-76ae2e225cd5 *** UN *** 699.98 KB 256 48.3% 
5215c728-9b80-4e3c-b46b-c5b8e5eb753f *** UN *** 691.65 KB 256 48.1% 
57da4edf-4acb-474d-b26c-27f048c37bd6 *** UN *** 705.66 KB 256 52.8% 
07520eab-47d2-4f5d-aeeb-f6e599c9b084 *** UN *** 610.87 KB 256 50.7% 
6b39acaf-6ed6-42e4-a357-0d258bdf87b7 *** time : 66s [after] Datacenter: dc1 
=== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- 
Address Load Tokens Owns (effective) Host ID Rack UN *** 559.32 KB 256 
48.7% 0555746a-60c2-4717-b042-94ba951ef679 *** UN *** 721.48 KB 256 
51.4% 1af4acb6-e0a0-4bcb-8bba-76ae2e225cd5 *** UN *** 699.98 KB 256 
48.3% 5215c728-9b80-4e3c-b46b-c5b8e5eb753f *** UN *** 691.65 KB 256 
48.1% 57da4edf-4acb-474d-b26c-27f048c37bd6 *** UN *** 705.66 KB 256 
52.8% 07520eab-47d2-4f5d-aeeb-f6e599c9b084 *** UN *** 610.87 KB 256 
50.7% 6b39acaf-6ed6-42e4-a357-0d258bdf87b7 *** time : 2s
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org