[jira] [Updated] (CASSANDRA-15989) Provide easy copypasta config formatting for nodetool get commands
[ https://issues.apache.org/jira/browse/CASSANDRA-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-15989: --- Description: Allow all nodetool commands which print out the state of the node or cluster to do so in a way that makes it easy to re-use or paste on other nodes or config files. For example, the command getcompactionthroughput formats its output like this: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput Current compaction throughput: 64 MB/s {noformat} But with an --as-yaml option, it could do this instead: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-yaml compaction_throughput_mb_per_sec: 64{noformat} and with an --as-cli option, it could do this: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-cli ./nodetool setcompactionthroughput 64{noformat} Any other nodetool standard options should simply be carried along to the --as-cli form, with the exception of -pw. Any -pw options should be elided with a warning in comments, but -pwf options should be allowed. This would allow users using -pw to append a password at their discretion, but would allow -pwf to work as usual. In the absence of either of the options above (--as-yaml or --as-cli) the formatting should not be changed to avoid breaking extant tool integrations. was: Allow all nodetool commands which print out the state of the node or cluster to do so in a way that makes it easy to re-use or paste on other nodes or config files. For example, the command getcompactionthroughput formats its output like this: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput Current compaction throughput: 64 MB/s {noformat} But with an --as-yaml option, it could do this instead: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-yaml compaction_throughput_mb_per_sec: 64{noformat} and with an --as-cli option, it could do this: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-cli ./nodetool setcompactionthroughput 64{noformat} Any other nodetool standard options should simply be carried along to the --as-cli form, with the exception of -pw. Any -pw options should be elided with a warning in comments, but -pwf options should be allowed. This would allow users using -pw to append a password at their discretion, but would allow -pwf to work as usual. > Provide easy copypasta config formatting for nodetool get commands > -- > > Key: CASSANDRA-15989 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15989 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Shook >Priority: Normal > > Allow all nodetool commands which print out the state of the node or cluster > to do so in a way that makes it easy to re-use or paste on other nodes or > config files. > For example, the command getcompactionthroughput formats its output like this: > {noformat} > [jshook@cass4 bin]$ ./nodetool getcompactionthroughput > Current compaction throughput: 64 MB/s > {noformat} > But with an --as-yaml option, it could do this instead: > {noformat} > [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-yaml > compaction_throughput_mb_per_sec: 64{noformat} > and with an --as-cli option, it could do this: > {noformat} > [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-cli > ./nodetool setcompactionthroughput 64{noformat} > Any other nodetool standard options should simply be carried along to the > --as-cli form, with the exception of -pw. > Any -pw options should be elided with a warning in comments, but -pwf options > should be allowed. This would allow users using -pw to append a password at > their discretion, but would allow -pwf to work as usual. > In the absence of either of the options above (--as-yaml or --as-cli) the > formatting should not be changed to avoid breaking extant tool integrations. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15989) Provide easy copypasta config formatting for nodetool get commands
Jonathan Shook created CASSANDRA-15989: -- Summary: Provide easy copypasta config formatting for nodetool get commands Key: CASSANDRA-15989 URL: https://issues.apache.org/jira/browse/CASSANDRA-15989 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Shook Allow all nodetool commands which print out the state of the node or cluster to do so in a way that makes it easy to re-use or paste on other nodes or config files. For example, the command getcompactionthroughput formats its output like this: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput Current compaction throughput: 64 MB/s {noformat} But with an --as-yaml option, it could do this instead: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-yaml compaction_throughput_mb_per_sec: 64{noformat} and with an --as-cli option, it could do this: {noformat} [jshook@cass4 bin]$ ./nodetool getcompactionthroughput --as-cli ./nodetool setcompactionthroughput 64{noformat} Any other nodetool standard options should simply be carried along to the --as-cli form, with the exception of -pw. Any -pw options should be elided with a warning in comments, but -pwf options should be allowed. This would allow users using -pw to append a password at their discretion, but would allow -pwf to work as usual. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15988) Add nodetool getfullquerylog
[ https://issues.apache.org/jira/browse/CASSANDRA-15988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165937#comment-17165937 ] Jonathan Shook commented on CASSANDRA-15988: My take on this: I think that it is reasonable to show the user whether FQL is enabled or not as the first item. Additionally, the configuration of FQL should be dumped to stdout in the same formatting convention of other nodetool get... commands. In terms of whether it goes into 4.0 or 4.1, I think it is obviously missing functionality. Not being able to query the state of the service without changing it is a problem. Consider a scenario where multiple users are managing a system together and need to double check the state of things before proceeding to the next step in their process. As manual as this sounds, many teams still do this type of ops work and need visibility to the operational state of the system. > Add nodetool getfullquerylog > - > > Key: CASSANDRA-15988 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15988 > Project: Cassandra > Issue Type: Improvement >Reporter: Ekaterina Dimitrova >Priority: Normal > > This ticket is raised based on CASSANDRA-15791 and valuable feedback provided > by [~jshook]. > There are two outstanding questions: > * forming the exact shape of such a command and how it can benefit the > users; to be discussed in detail in this ticket > * Is this a thing we as a project can add to 4.0 beta or it should be > considered in 4.1 for example > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15971) full query log needs improvement
[ https://issues.apache.org/jira/browse/CASSANDRA-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163905#comment-17163905 ] Jonathan Shook commented on CASSANDRA-15971: I was able to get the server to log FQL data with both Java 8 and Java 11. This appears to be a docs issue, as I had read that you could configure fql logging either in yaml or via nodetool. The official docs seem correct, so no issue there. However, the other usability issues still apply in my view, and we might want to triage them separately. > full query log needs improvement > > > Key: CASSANDRA-15971 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15971 > Project: Cassandra > Issue Type: Improvement > Components: Tool/fql >Reporter: Jonathan Shook >Priority: Normal > Attachments: st1.txt > > > When trying out full query logging as a possible integration for nosqlbench > usage, I ran across many issues which would make it painful for users. Since > there were several, they will be added to a single issue for now. This issue > can be broken up if needed. > > FQL doesn't work on my system, even though it says it is logging queries. > With the following configuration in cassandra.yaml: > > {noformat} > full_query_logging_options: > log_dir: /REDACTED/fullquerylogs > roll_cycle: HOURLY > block: true > max_queue_weight: 268435456 # 256 MiB > max_log_size: 17179869184 # 16 GiB > ## archive command is "/path/to/script.sh %path" where %path is replaced > with the file being rolled: > # archive_command: > # max_archive_retries: 10 > {noformat} > which appears to be the minimal configuration needed to enable fql, only a > single file `directory-listing.cq4t` is created, which is a 64K sized file of > zeroes. > > > Calling bin/nodetool enablefullquerylog throws an error initially. > [jshook@cass4 bin]$ ./nodetool enablefullquerylog > > {noformat} > error: sun.nio.ch.FileChannelImpl.map0(int,long,long) > -- StackTrace -- > java.lang.NoSuchMethodException: > sun.nio.ch.FileChannelImpl.map0(int,long,long) > at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) > at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) > at > net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} > (full stack trace attached to this ticket) > > Subsequent calls produce normal output: > > {noformat} > [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog > nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs > See 'nodetool help' or 'nodetool help '.{noformat} > > > nodetool missing getfullquerylog makes it difficult to verify current > fullquerylog state without changing it. The conventions for nodetool commands > should be followed to avoid confusing users. > > (maybe) > {noformat} > tools/bin/fqltool help{noformat} > should print out help for all fqltool commands rather than simply repeating > the default The most commonly used fqltool commands are.. > > [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is > malformatted, mixing the appearance of configuration with comments, which is > confusing at best. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15971) full query log needs improvement
[ https://issues.apache.org/jira/browse/CASSANDRA-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163718#comment-17163718 ] Jonathan Shook commented on CASSANDRA-15971: My issue is with the actual fql logging on the server not writing logs. I'll try to look into it today. > full query log needs improvement > > > Key: CASSANDRA-15971 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15971 > Project: Cassandra > Issue Type: Improvement > Components: Tool/fql >Reporter: Jonathan Shook >Priority: Normal > Attachments: st1.txt > > > When trying out full query logging as a possible integration for nosqlbench > usage, I ran across many issues which would make it painful for users. Since > there were several, they will be added to a single issue for now. This issue > can be broken up if needed. > > FQL doesn't work on my system, even though it says it is logging queries. > With the following configuration in cassandra.yaml: > > {noformat} > full_query_logging_options: > log_dir: /REDACTED/fullquerylogs > roll_cycle: HOURLY > block: true > max_queue_weight: 268435456 # 256 MiB > max_log_size: 17179869184 # 16 GiB > ## archive command is "/path/to/script.sh %path" where %path is replaced > with the file being rolled: > # archive_command: > # max_archive_retries: 10 > {noformat} > which appears to be the minimal configuration needed to enable fql, only a > single file `directory-listing.cq4t` is created, which is a 64K sized file of > zeroes. > > > Calling bin/nodetool enablefullquerylog throws an error initially. > [jshook@cass4 bin]$ ./nodetool enablefullquerylog > > {noformat} > error: sun.nio.ch.FileChannelImpl.map0(int,long,long) > -- StackTrace -- > java.lang.NoSuchMethodException: > sun.nio.ch.FileChannelImpl.map0(int,long,long) > at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) > at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) > at > net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} > (full stack trace attached to this ticket) > > Subsequent calls produce normal output: > > {noformat} > [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog > nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs > See 'nodetool help' or 'nodetool help '.{noformat} > > > nodetool missing getfullquerylog makes it difficult to verify current > fullquerylog state without changing it. The conventions for nodetool commands > should be followed to avoid confusing users. > > (maybe) > {noformat} > tools/bin/fqltool help{noformat} > should print out help for all fqltool commands rather than simply repeating > the default The most commonly used fqltool commands are.. > > [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is > malformatted, mixing the appearance of configuration with comments, which is > confusing at best. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15971) full query log needs improvement
[ https://issues.apache.org/jira/browse/CASSANDRA-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163072#comment-17163072 ] Jonathan Shook commented on CASSANDRA-15971: This was with Java 11. > full query log needs improvement > > > Key: CASSANDRA-15971 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15971 > Project: Cassandra > Issue Type: Improvement > Components: Tool/fql >Reporter: Jonathan Shook >Priority: Normal > Attachments: st1.txt > > > When trying out full query logging as a possible integration for nosqlbench > usage, I ran across many issues which would make it painful for users. Since > there were several, they will be added to a single issue for now. This issue > can be broken up if needed. > > FQL doesn't work on my system, even though it says it is logging queries. > With the following configuration in cassandra.yaml: > > {noformat} > full_query_logging_options: > log_dir: /REDACTED/fullquerylogs > roll_cycle: HOURLY > block: true > max_queue_weight: 268435456 # 256 MiB > max_log_size: 17179869184 # 16 GiB > ## archive command is "/path/to/script.sh %path" where %path is replaced > with the file being rolled: > # archive_command: > # max_archive_retries: 10 > {noformat} > which appears to be the minimal configuration needed to enable fql, only a > single file `directory-listing.cq4t` is created, which is a 64K sized file of > zeroes. > > > Calling bin/nodetool enablefullquerylog throws an error initially. > [jshook@cass4 bin]$ ./nodetool enablefullquerylog > > {noformat} > error: sun.nio.ch.FileChannelImpl.map0(int,long,long) > -- StackTrace -- > java.lang.NoSuchMethodException: > sun.nio.ch.FileChannelImpl.map0(int,long,long) > at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) > at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) > at > net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} > (full stack trace attached to this ticket) > > Subsequent calls produce normal output: > > {noformat} > [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog > nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs > See 'nodetool help' or 'nodetool help '.{noformat} > > > nodetool missing getfullquerylog makes it difficult to verify current > fullquerylog state without changing it. The conventions for nodetool commands > should be followed to avoid confusing users. > > (maybe) > {noformat} > tools/bin/fqltool help{noformat} > should print out help for all fqltool commands rather than simply repeating > the default The most commonly used fqltool commands are.. > > [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is > malformatted, mixing the appearance of configuration with comments, which is > confusing at best. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15971) full query log needs improvement
[ https://issues.apache.org/jira/browse/CASSANDRA-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-15971: --- Description: When trying out full query logging as a possible integration for nosqlbench usage, I ran across many issues which would make it painful for users. Since there were several, they will be added to a single issue for now. This issue can be broken up if needed. FQL doesn't work on my system, even though it says it is logging queries. With the following configuration in cassandra.yaml: {noformat} full_query_logging_options: log_dir: /REDACTED/fullquerylogs roll_cycle: HOURLY block: true max_queue_weight: 268435456 # 256 MiB max_log_size: 17179869184 # 16 GiB ## archive command is "/path/to/script.sh %path" where %path is replaced with the file being rolled: # archive_command: # max_archive_retries: 10 {noformat} which appears to be the minimal configuration needed to enable fql, only a single file `directory-listing.cq4t` is created, which is a 64K sized file of zeroes. Calling bin/nodetool enablefullquerylog throws an error initially. [jshook@cass4 bin]$ ./nodetool enablefullquerylog {noformat} error: sun.nio.ch.FileChannelImpl.map0(int,long,long) -- StackTrace -- java.lang.NoSuchMethodException: sun.nio.ch.FileChannelImpl.map0(int,long,long) at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) at net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} (full stack trace attached to this ticket) Subsequent calls produce normal output: {noformat} [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs See 'nodetool help' or 'nodetool help '.{noformat} nodetool missing getfullquerylog makes it difficult to verify current fullquerylog state without changing it. The conventions for nodetool commands should be followed to avoid confusing users. (maybe) {noformat} tools/bin/fqltool help{noformat} should print out help for all fqltool commands rather than simply repeating the default The most commonly used fqltool commands are.. 1. [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is malformatted, mixing the appearance of configuration with comments, which is confusing at best. was: When trying out full query logging as a possible integration for nosqlbench usage, I ran across many issues which would make it painful for users. Since there were several, they will be added to a single issue for now. This issue can be broken up if needed. FQL doesn't work on my system, even though it says it is logging queries. With the following configuration in cassandra.yaml: {noformat} full_query_logging_options: log_dir: /REDACTED/fullquerylogs roll_cycle: HOURLY block: true max_queue_weight: 268435456 # 256 MiB max_log_size: 17179869184 # 16 GiB ## archive command is "/path/to/script.sh %path" where %path is replaced with the file being rolled: # archive_command: # max_archive_retries: 10 {noformat} which appears to be the minimal configuration needed to enable fql, only a single file `directory-listing.cq4t` is created, which is a 64K sized file of zeroes. Calling bin/nodetool enablefullquerylog throws an error initially. [jshook@cass4 bin]$ ./nodetool enablefullquerylog {noformat} error: sun.nio.ch.FileChannelImpl.map0(int,long,long) -- StackTrace -- java.lang.NoSuchMethodException: sun.nio.ch.FileChannelImpl.map0(int,long,long) at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) at net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} (full stack trace attached to this ticket) Subsequent calls produce normal output: {noformat} [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs See 'nodetool help' or 'nodetool help '.{noformat} nodetool missing getfullquerylogging makes it difficult to verify current fullquerylog state without changing it. The conventions for nodetool commands should be followed to avoid confusing users. (maybe) {noformat} tools/bin/fqltool help{noformat} should print out help for all fqltool commands rather than simply repeating the default The most commonly used fqltool commands are.. 1. [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is malformatted, mixing the appearance of configuration with comments, which is confusing at best. > full query log needs improvement > > > Key: CASSANDRA-15971 > URL:
[jira] [Updated] (CASSANDRA-15971) full query log needs improvement
[ https://issues.apache.org/jira/browse/CASSANDRA-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-15971: --- Description: When trying out full query logging as a possible integration for nosqlbench usage, I ran across many issues which would make it painful for users. Since there were several, they will be added to a single issue for now. This issue can be broken up if needed. FQL doesn't work on my system, even though it says it is logging queries. With the following configuration in cassandra.yaml: {noformat} full_query_logging_options: log_dir: /REDACTED/fullquerylogs roll_cycle: HOURLY block: true max_queue_weight: 268435456 # 256 MiB max_log_size: 17179869184 # 16 GiB ## archive command is "/path/to/script.sh %path" where %path is replaced with the file being rolled: # archive_command: # max_archive_retries: 10 {noformat} which appears to be the minimal configuration needed to enable fql, only a single file `directory-listing.cq4t` is created, which is a 64K sized file of zeroes. Calling bin/nodetool enablefullquerylog throws an error initially. [jshook@cass4 bin]$ ./nodetool enablefullquerylog {noformat} error: sun.nio.ch.FileChannelImpl.map0(int,long,long) -- StackTrace -- java.lang.NoSuchMethodException: sun.nio.ch.FileChannelImpl.map0(int,long,long) at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) at net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} (full stack trace attached to this ticket) Subsequent calls produce normal output: {noformat} [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs See 'nodetool help' or 'nodetool help '.{noformat} nodetool missing getfullquerylog makes it difficult to verify current fullquerylog state without changing it. The conventions for nodetool commands should be followed to avoid confusing users. (maybe) {noformat} tools/bin/fqltool help{noformat} should print out help for all fqltool commands rather than simply repeating the default The most commonly used fqltool commands are.. [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is malformatted, mixing the appearance of configuration with comments, which is confusing at best. was: When trying out full query logging as a possible integration for nosqlbench usage, I ran across many issues which would make it painful for users. Since there were several, they will be added to a single issue for now. This issue can be broken up if needed. FQL doesn't work on my system, even though it says it is logging queries. With the following configuration in cassandra.yaml: {noformat} full_query_logging_options: log_dir: /REDACTED/fullquerylogs roll_cycle: HOURLY block: true max_queue_weight: 268435456 # 256 MiB max_log_size: 17179869184 # 16 GiB ## archive command is "/path/to/script.sh %path" where %path is replaced with the file being rolled: # archive_command: # max_archive_retries: 10 {noformat} which appears to be the minimal configuration needed to enable fql, only a single file `directory-listing.cq4t` is created, which is a 64K sized file of zeroes. Calling bin/nodetool enablefullquerylog throws an error initially. [jshook@cass4 bin]$ ./nodetool enablefullquerylog {noformat} error: sun.nio.ch.FileChannelImpl.map0(int,long,long) -- StackTrace -- java.lang.NoSuchMethodException: sun.nio.ch.FileChannelImpl.map0(int,long,long) at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) at net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} (full stack trace attached to this ticket) Subsequent calls produce normal output: {noformat} [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs See 'nodetool help' or 'nodetool help '.{noformat} nodetool missing getfullquerylog makes it difficult to verify current fullquerylog state without changing it. The conventions for nodetool commands should be followed to avoid confusing users. (maybe) {noformat} tools/bin/fqltool help{noformat} should print out help for all fqltool commands rather than simply repeating the default The most commonly used fqltool commands are.. 1. [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is malformatted, mixing the appearance of configuration with comments, which is confusing at best. > full query log needs improvement > > > Key: CASSANDRA-15971 > URL:
[jira] [Created] (CASSANDRA-15971) full query log needs improvement
Jonathan Shook created CASSANDRA-15971: -- Summary: full query log needs improvement Key: CASSANDRA-15971 URL: https://issues.apache.org/jira/browse/CASSANDRA-15971 Project: Cassandra Issue Type: Improvement Components: Tool/fql Reporter: Jonathan Shook Attachments: st1.txt When trying out full query logging as a possible integration for nosqlbench usage, I ran across many issues which would make it painful for users. Since there were several, they will be added to a single issue for now. This issue can be broken up if needed. FQL doesn't work on my system, even though it says it is logging queries. With the following configuration in cassandra.yaml: {noformat} full_query_logging_options: log_dir: /REDACTED/fullquerylogs roll_cycle: HOURLY block: true max_queue_weight: 268435456 # 256 MiB max_log_size: 17179869184 # 16 GiB ## archive command is "/path/to/script.sh %path" where %path is replaced with the file being rolled: # archive_command: # max_archive_retries: 10 {noformat} which appears to be the minimal configuration needed to enable fql, only a single file `directory-listing.cq4t` is created, which is a 64K sized file of zeroes. Calling bin/nodetool enablefullquerylog throws an error initially. [jshook@cass4 bin]$ ./nodetool enablefullquerylog {noformat} error: sun.nio.ch.FileChannelImpl.map0(int,long,long) -- StackTrace -- java.lang.NoSuchMethodException: sun.nio.ch.FileChannelImpl.map0(int,long,long) at java.base/java.lang.Class.getDeclaredMethod(Class.java:2553) at net.openhft.chronicle.core.OS.lambda$static$0(OS.java:51) at net.openhft.chronicle.core.ClassLocal.computeValue(ClassLocal.java:53){noformat} (full stack trace attached to this ticket) Subsequent calls produce normal output: {noformat} [jshook@cass4 c4b1]$ bin/nodetool enablefullquerylog nodetool: Already logging to /home/jshook/c4b1/data/fullquerylogs See 'nodetool help' or 'nodetool help '.{noformat} nodetool missing getfullquerylogging makes it difficult to verify current fullquerylog state without changing it. The conventions for nodetool commands should be followed to avoid confusing users. (maybe) {noformat} tools/bin/fqltool help{noformat} should print out help for all fqltool commands rather than simply repeating the default The most commonly used fqltool commands are.. 1. [https://cassandra.apache.org/doc/latest/new/fqllogging.html] is malformatted, mixing the appearance of configuration with comments, which is confusing at best. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-12268) Make MV Index creation robust for wide referent rows
[ https://issues.apache.org/jira/browse/CASSANDRA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-12268: --- Description: When creating an index for a materialized view for extant data, heap pressure is very dependent on the cardinality of of rows associated with each index value. With the way that per-index value rows are created within the index, this can cause unbounded heap pressure, which can cause OOM. This appears to be a side-effect of how each index row is applied atomically as with batches. The commit logs can accumulate enough during the process to prevent the node from being restarted. Given that this occurs during global index creation, this can happen on multiple nodes, making stable recovery of a node set difficult, as co-replicas become unavailable to assist in back-filling data from commitlogs. While it is understandable that you want to avoid having relatively wide rows even in materialized views, this represents a particularly difficult scenario for triage. The basic recommendation for improving this is to sub-group the index creation into smaller chunks internally, providing a maximal bound against the heap pressure when it is needed. was: When creating an index for a materialized view for extant data, heap pressure is very dependent on the cardinality of of rows associated with each index value. With the way that per-index value rows are created within the index, this can cause unbounded heap pressure, which can cause OOM. This appears to be a side-effect of how each index row is applied atomically as with batches. The commit logs can accumulate enough during the process to prevent the node from being restarted. Given that this occurs during global index creation, this can happen on multiple nodes, making stable recovery of a node set difficult, as co-replicas become unavailable to assist in back-filling data from commitlogs. While it is understandable that you want to avoid having relatively wide rows even in materialized views, this scenario represent a particularly difficult scenario for triage. The basic recommendation for improving this is to sub-group the index creation into smaller chunks internally, providing a maximal bound against the heap pressure when it is needed. > Make MV Index creation robust for wide referent rows > > > Key: CASSANDRA-12268 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12268 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Shook > > When creating an index for a materialized view for extant data, heap pressure > is very dependent on the cardinality of of rows associated with each index > value. With the way that per-index value rows are created within the index, > this can cause unbounded heap pressure, which can cause OOM. This appears to > be a side-effect of how each index row is applied atomically as with batches. > The commit logs can accumulate enough during the process to prevent the node > from being restarted. Given that this occurs during global index creation, > this can happen on multiple nodes, making stable recovery of a node set > difficult, as co-replicas become unavailable to assist in back-filling data > from commitlogs. > While it is understandable that you want to avoid having relatively wide rows > even in materialized views, this represents a particularly difficult > scenario for triage. > The basic recommendation for improving this is to sub-group the index > creation into smaller chunks internally, providing a maximal bound against > the heap pressure when it is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-12268) Make MV Index creation robust for wide referent rows
Jonathan Shook created CASSANDRA-12268: -- Summary: Make MV Index creation robust for wide referent rows Key: CASSANDRA-12268 URL: https://issues.apache.org/jira/browse/CASSANDRA-12268 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Shook When creating an index for a materialized view for extant data, heap pressure is very dependent on the cardinality of of rows associated with each index value. With the way that per-index value rows are created within the index, this can cause unbounded heap pressure, which can cause OOM. This appears to be a side-effect of how each index row is applied atomically as with batches. The commit logs can accumulate enough during the process to prevent the node from being restarted. Given that this occurs during global index creation, this can happen on multiple nodes, making stable recovery of a node set difficult, as co-replicas become unavailable to assist in back-filling data from commitlogs. While it is understandable that you want to avoid having relatively wide rows even in materialized views, this scenario represent a particularly difficult scenario for triage. The basic recommendation for improving this is to sub-group the index creation into smaller chunks internally, providing a maximal bound against the heap pressure when it is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11753) cqlsh show sessions truncates time_elapsed values > 999999
Jonathan Shook created CASSANDRA-11753: -- Summary: cqlsh show sessions truncates time_elapsed values > 99 Key: CASSANDRA-11753 URL: https://issues.apache.org/jira/browse/CASSANDRA-11753 Project: Cassandra Issue Type: Bug Components: CQL, Observability, Testing, Tools Reporter: Jonathan Shook Output from show session in cqlsh: {quote} Submit hint for /10.255.227.20 [EXPIRING-MAP-REAPER:1] | 2016-05-11 15:57:53.73 | 10.255.226.163 | 283246 {quote} Output from select * from trace_events where session_id=(same as above): {quote} 1bbce5c0-1791-11e6-9598-3b9ec975a2e6 | 1ee37a20-1791-11e6-9598-3b9ec975a2e6 | Submit hint for /10.255.227.20 | 10.255.226.163 | 5283246 | EXPIRING-MAP-REAPER:1 {quote} Notice that the 5 (seconds) part is being truncated in the output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11688) Replace_address should sanity check prior node state before migrating tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-11688: --- Description: During a node replacement, a replace_address was used which was associated with a different node than the intended one. The result was that both nodes remained active after the node came up. This caused several other issues which were difficult to diagnose, including invalid gossip state, etc. Replace_address should be more robust in this scenario. It would be much more user friendly if the replace_address logic would first do some basic sanity checks, possibly to include: - Pinging the other node to see if it is indeed “down”, if the address is different than all local interface addresses - Checking gossip state of the node to verify that it is not known to peers. It may even be safest to require that both address reachability and gossip state are required to show the replace_address as down by default before allowing any token migration or other replace_address actions to occur. In the case that the replace_address is not ready to be replaced, the log should indicate that you are trying to replace an active node, and cassandra should refuse to start. was: During a node replacement, a customer used an ip address associated with a different node than the intended one. The result was that both nodes remained active after the node came up. This caused several other issues which were difficult to diagnose, including invalid gossip state, etc. Replace_address should be more robust in this scenario. It would be much more user friendly if the replace_address logic would first do some basic sanity checks, possibly to include: - Pinging the other node to see if it is indeed “down”, if the address is different than all local interface addresses - Checking gossip state of the node to verify that it is not known to peers. It may even be safest to require that both address reachability and gossip state are required to show the replace_address as down by default before allowing any token migration or other replace_address actions to occur. In the case that the replace_address is not ready to be replaced, the log should indicate that you are trying to replace an active node, and cassandra should refuse to start. > Replace_address should sanity check prior node state before migrating tokens > > > Key: CASSANDRA-11688 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11688 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Shook > > During a node replacement, a replace_address was used which was associated > with a different node than the intended one. The result was that both nodes > remained active after the node came up. This caused several other issues > which were difficult to diagnose, including invalid gossip state, etc. > Replace_address should be more robust in this scenario. It would be much more > user friendly if the replace_address logic would first do some basic sanity > checks, possibly to include: > - Pinging the other node to see if it is indeed “down”, if the address is > different than all local interface addresses > - Checking gossip state of the node to verify that it is not known to peers. > It may even be safest to require that both address reachability and gossip > state are required to show the replace_address as down by default before > allowing any token migration or other replace_address actions to occur. > In the case that the replace_address is not ready to be replaced, the log > should indicate that you are trying to replace an active node, and cassandra > should refuse to start. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11688) Replace_address should sanity check prior node state before migrating tokens
Jonathan Shook created CASSANDRA-11688: -- Summary: Replace_address should sanity check prior node state before migrating tokens Key: CASSANDRA-11688 URL: https://issues.apache.org/jira/browse/CASSANDRA-11688 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Shook During a node replacement, a customer used an ip address associated with a different node than the intended one. The result was that both nodes remained active after the node came up. This caused several other issues which were difficult to diagnose, including invalid gossip state, etc. Replace_address should be more robust in this scenario. It would be much more user friendly if the replace_address logic would first do some basic sanity checks, possibly to include: - Pinging the other node to see if it is indeed “down”, if the address is different than all local interface addresses - Checking gossip state of the node to verify that it is not known to peers. It may even be safest to require that both address reachability and gossip state are required to show the replace_address as down by default before allowing any token migration or other replace_address actions to occur. In the case that the replace_address is not ready to be replaced, the log should indicate that you are trying to replace an active node, and cassandra should refuse to start. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9666) Provide an alternative to DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216688#comment-15216688 ] Jonathan Shook commented on CASSANDRA-9666: --- There are two areas of concern that we should discuss more directly.. 1. The pacing of memtable flushing on a given system can be matched up with the base window size with DTCS, avoiding logical write amplification that can occur before the scheduling discipline kicks in. This is not so easy when you water down the configuration and remove the ability to manage the fresh sstables. The benefits from time-series friendly compaction can be had for both the newest and the oldest tables, and both are relevant here. 2. The window placement. From what I've seen, the anchoring point for whether a cell goes into a bucket or not is different between the two approaches. To me this is fairly arbitrary in terms of processing overhead comparisons, all else assumed close enough. However, when trying to reconcile, shifting all of your data to a different bucket will not be a welcome event for most users. This makes "graceful" reconciliation difficult at best. Can we simply try to make DTCS as (perceptually) easy to use for the default case as TWCS (perceptually) ? To me, this is more about the user entry point and understanding behavior as designed than it is about the machinery that makes it happen. The basic design between them has so much in common that reconciling them completely would be mostly a shell game of parameter names as well as lobbing off some functionality that can be complete bypassed, given the right settings. Can we identify the functionally equivalent settings for TWCS that DTCS needs to emulate, given proper settings (possibly including anchoring point), and then simply provide the same simple configuration to users, without having to maintain two separate sibling compaction strategies? One sticking point that I've had on this suggesting in conversation is the bucketing logic being too difficult to think about. If we were able to provide the self-same behavior for TWCS-like configuration, the bucketing logic could be used only when the parameters require non-uniform windows. Would that make everyone happy? > Provide an alternative to DTCS > -- > > Key: CASSANDRA-9666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9666 > Project: Cassandra > Issue Type: Improvement >Reporter: Jeff Jirsa >Assignee: Jeff Jirsa > Fix For: 2.1.x, 2.2.x > > Attachments: dtcs-twcs-io.png, dtcs-twcs-load.png > > > DTCS is great for time series data, but it comes with caveats that make it > difficult to use in production (typical operator behaviors such as bootstrap, > removenode, and repair have MAJOR caveats as they relate to > max_sstable_age_days, and hints/read repair break the selection algorithm). > I'm proposing an alternative, TimeWindowCompactionStrategy, that sacrifices > the tiered nature of DTCS in order to address some of DTCS' operational > shortcomings. I believe it is necessary to propose an alternative rather than > simply adjusting DTCS, because it fundamentally removes the tiered nature in > order to remove the parameter max_sstable_age_days - the result is very very > different, even if it is heavily inspired by DTCS. > Specifically, rather than creating a number of windows of ever increasing > sizes, this strategy allows an operator to choose the window size, compact > with STCS within the first window of that size, and aggressive compact down > to a single sstable once that window is no longer current. The window size is > a combination of unit (minutes, hours, days) and size (1, etc), such that an > operator can expect all data using a block of that size to be compacted > together (that is, if your unit is hours, and size is 6, you will create > roughly 4 sstables per day, each one containing roughly 6 hours of data). > The result addresses a number of the problems with > DateTieredCompactionStrategy: > - At the present time, DTCS’s first window is compacted using an unusual > selection criteria, which prefers files with earlier timestamps, but ignores > sizes. In TimeWindowCompactionStrategy, the first window data will be > compacted with the well tested, fast, reliable STCS. All STCS options can be > passed to TimeWindowCompactionStrategy to configure the first window’s > compaction behavior. > - HintedHandoff may put old data in new sstables, but it will have little > impact other than slightly reduced efficiency (sstables will cover a wider > range, but the old timestamps will not impact sstable selection criteria > during compaction) > - ReadRepair may put old data in new sstables, but it will have little impact > other than slightly reduced efficiency
[jira] [Comment Edited] (CASSANDRA-9666) Provide an alternative to DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216688#comment-15216688 ] Jonathan Shook edited comment on CASSANDRA-9666 at 3/29/16 7:21 PM: There are two areas of concern that we should discuss more directly.. 1. The pacing of memtable flushing on a given system can be matched up with the base window size with DTCS, avoiding logical write amplification that can occur before the scheduling discipline kicks in. This is not so easy when you water down the configuration and remove the ability to manage the fresh sstables. The benefits from time-series friendly compaction can be had for both the newest and the oldest tables, and both are relevant here. 2. The window placement. From what I've seen, the anchoring point for whether a cell goes into a bucket or not is different between the two approaches. To me this is fairly arbitrary in terms of processing overhead comparisons, all else assumed close enough. However, when trying to reconcile, shifting all of your data to a different bucket will not be a welcome event for most users. This makes "graceful" reconciliation difficult at best. Can we simply try to make DTCS as (perceptually) easy to use for the default case as TWCS (perceptually) ? To me, this is more about the user entry point and understanding behavior as designed than it is about the machinery that makes it happen. The basic design between them has so much in common that reconciling them completely would be mostly a shell game of parameter names as well as lobbing off some functionality that can be completely bypassed, given the right settings. Can we identify the functionally equivalent settings for TWCS that DTCS needs to emulate, given proper settings (possibly including anchoring point), and then simply provide the same simple configuration to users, without having to maintain two separate sibling compaction strategies? One sticking point that I've had on this suggesting in conversation is the bucketing logic being too difficult to think about. If we were able to provide the self-same behavior for TWCS-like configuration, the bucketing logic could be used only when the parameters require non-uniform windows. Would that make everyone happy? was (Author: jshook): There are two areas of concern that we should discuss more directly.. 1. The pacing of memtable flushing on a given system can be matched up with the base window size with DTCS, avoiding logical write amplification that can occur before the scheduling discipline kicks in. This is not so easy when you water down the configuration and remove the ability to manage the fresh sstables. The benefits from time-series friendly compaction can be had for both the newest and the oldest tables, and both are relevant here. 2. The window placement. From what I've seen, the anchoring point for whether a cell goes into a bucket or not is different between the two approaches. To me this is fairly arbitrary in terms of processing overhead comparisons, all else assumed close enough. However, when trying to reconcile, shifting all of your data to a different bucket will not be a welcome event for most users. This makes "graceful" reconciliation difficult at best. Can we simply try to make DTCS as (perceptually) easy to use for the default case as TWCS (perceptually) ? To me, this is more about the user entry point and understanding behavior as designed than it is about the machinery that makes it happen. The basic design between them has so much in common that reconciling them completely would be mostly a shell game of parameter names as well as lobbing off some functionality that can be complete bypassed, given the right settings. Can we identify the functionally equivalent settings for TWCS that DTCS needs to emulate, given proper settings (possibly including anchoring point), and then simply provide the same simple configuration to users, without having to maintain two separate sibling compaction strategies? One sticking point that I've had on this suggesting in conversation is the bucketing logic being too difficult to think about. If we were able to provide the self-same behavior for TWCS-like configuration, the bucketing logic could be used only when the parameters require non-uniform windows. Would that make everyone happy? > Provide an alternative to DTCS > -- > > Key: CASSANDRA-9666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9666 > Project: Cassandra > Issue Type: Improvement >Reporter: Jeff Jirsa >Assignee: Jeff Jirsa > Fix For: 2.1.x, 2.2.x > > Attachments: dtcs-twcs-io.png, dtcs-twcs-load.png > > > DTCS is great for time series data, but it comes with caveats that make it >
[jira] [Updated] (CASSANDRA-11408) simple compaction defaults for common scenarios
[ https://issues.apache.org/jira/browse/CASSANDRA-11408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-11408: --- Description: As compaction strategies get more flexible over time, some users might prefer to have a simple named profile for their settings. {code:title=example, syntax variant|borderStyle=solid} alter table foo.bar with compaction = 'timeseries-hourly-for-a-week'; {code} {code:title=example, syntax variant |borderStyle=solid} alter table foo.bar with compaction = { 'profile' : 'key-value-balanced-ops' }; {code} These would simply be a map into sets of well-tested and documented defaults across any of the core compaction strategies. This would simplify setting up compaction for well-understood workloads, but still allow for customization where desired. was: As compaction strategies get more flexible over time, some users might prefer to have a simple named profile for their settings. For example, alter table foo.bar with compaction = 'timeseries-hourly-for-a-week'; or, with slightly different syntax: alter table foo.bar with compaction = { 'profile' : 'key-value-balanced-ops' }; These would simply be a map into sets of well-tested and documented defaults across any of the core compaction strategies. This would simplify setting up compaction for well-understood workloads, but still allow for customization where desired. > simple compaction defaults for common scenarios > --- > > Key: CASSANDRA-11408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11408 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Shook > > As compaction strategies get more flexible over time, some users might prefer > to have a simple named profile for their settings. > {code:title=example, syntax variant|borderStyle=solid} > alter table foo.bar with compaction = 'timeseries-hourly-for-a-week'; > {code} > {code:title=example, syntax variant |borderStyle=solid} > alter table foo.bar with compaction = { 'profile' : 'key-value-balanced-ops' > }; > {code} > These would simply be a map into sets of well-tested and documented defaults > across any of the core compaction strategies. > This would simplify setting up compaction for well-understood workloads, but > still allow for customization where desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11408) simple compaction defaults for common scenarios
Jonathan Shook created CASSANDRA-11408: -- Summary: simple compaction defaults for common scenarios Key: CASSANDRA-11408 URL: https://issues.apache.org/jira/browse/CASSANDRA-11408 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Shook As compaction strategies get more flexible over time, some users might prefer to have a simple named profile for their settings. For example, alter table foo.bar with compaction = 'timeseries-hourly-for-a-week'; or, with slightly different syntax: alter table foo.bar with compaction = { 'profile' : 'key-value-balanced-ops' }; These would simply be a map into sets of well-tested and documented defaults across any of the core compaction strategies. This would simplify setting up compaction for well-understood workloads, but still allow for customization where desired. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10425) Autoselect GC settings depending on system memory
[ https://issues.apache.org/jira/browse/CASSANDRA-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095515#comment-15095515 ] Jonathan Shook edited comment on CASSANDRA-10425 at 1/13/16 3:05 AM: - I think we should try to come up with a way of handling settings which one would choose differently for a new install. Settings like this will live forever without a better approach. I agree entirely with the principle of least surprise. However, according to this default, there will be new systems deployed in 2020 with CMS. There has to be a better way. If we were able to have an install mode which would honor previous settings or take new defaults that are more desirable for current code and systems, perhaps we can avoid the CMS in 2020 problem. Installers may require a user to specify a mode in order to make this truly unsurprising. If I were installing a new cluster in 2020, I would be quite surprised to find it running CMS. Also, the point of having the settings be size-specific is to avoid surprising performance deficiencies. This is the kind of change that I would expect to go with a major version upgrade. So, to follow the principle of least surprise, perhaps we need to consider making this possible for those who expect to be able to use more than 32GB with G1 to address GC bandwidth and pause issues for heavy workloads, as we've come to expect through field experience. Otherwise, we'll be manually rewiring this from now on for all but historic pizza-boxen. was (Author: jshook): I think we should try to come up with a way of handling settings which one would choose differently for a new install. Settings like this will live forever without a better approach. I agree entirely with the principle of least surprise. However, according to this default, there will be new systems deployed in 2020 with CMS. There has to be a better way. If we were able to have an install mode which would honor previous settings or take new defaults that are more desirable for current code and systems, perhaps we can avoid the CMS in 2020 problem. Installers may require a user to specify a mode in order to make this truly unsurprising. If I were installing a new cluster in 2020, I would be quite surprised to find it running CMS. Also, the point of having the settings be size-specific is to avoid surprising performance deficiencies. This is the kind of change that I would expect to go with a major version. So, to follow the principle of least surprise, perhaps we need to consider making this possible for those who expect to be able to use more than 32GB with G1 to address GC bandwidth and pause issues for heavy workloads, as we've come to expect through field experience. Otherwise, we'll be manually rewiring this from now on for all but historic pizza-boxen. > Autoselect GC settings depending on system memory > - > > Key: CASSANDRA-10425 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10425 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: Jonathan Shook > > 1) Make GC modular within cassandra-env > 2) For systems with 32GB or less of ram, use the classic CMS with the > established default settings. > 3) For systems with 48GB or more of ram, use 1/2 or up to 32GB of heap with > G1, whichever is lower. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-11006) Allow upgrades and installs to take modern defaults
Jonathan Shook created CASSANDRA-11006: -- Summary: Allow upgrades and installs to take modern defaults Key: CASSANDRA-11006 URL: https://issues.apache.org/jira/browse/CASSANDRA-11006 Project: Cassandra Issue Type: Improvement Components: Configuration, Lifecycle, Packaging, Tools Reporter: Jonathan Shook See CASSANDRA-10425 for background. We simply need to provide a way to install or upgrade C* on a system with modern settings. Keeping the previous defaults has been the standard rule of thumb to avoid surprises. This is a reasonable approach, but we haven't yet provided an alternative for full upgrades with new default nor for more appropriate installs of new systems. The number of previous defaults which may need to be modified for a saner deployment has become a form of technical baggage. Often, users will have to micro-manage basic settings to more reasonable defaults for every single deployment, upgrade or not. This is surprising. For newer settings that would be more appropriate, we could force the user to make a choice. If you are installing a new cluster or node, you may want the modern defaults. If you are upgrading an existing node, you may still want the modern defaults. If you are upgrading an existing node and have some very carefully selected tunings for your hardware, then you may want to keep them. Even then, they may be worse than the modern defaults, given version changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11006) Allow upgrades and installs to take modern defaults
[ https://issues.apache.org/jira/browse/CASSANDRA-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095532#comment-15095532 ] Jonathan Shook edited comment on CASSANDRA-11006 at 1/13/16 3:34 AM: - The difference in the original ticket CASSANDRA-10425 was not that we were opting into auto-tuning. The difference was simply that we could take into consideration more contemporary hardware that is being deployed, including the trending size of RAM. I would generally expect that auto-tuning settings like this could be adapted for major versions, and added to the release notes like other potentially surprising, yet generally useful changes. If this is not the case for GC settings, then how do we allow for the change for CMS to G1 as average RAM sizing continues to change? was (Author: jshook): The difference in the original ticket CASSANDRA-10425 was not that we were opting into auto-tuning. The difference was simply that we could take account of more contemporary hardware that is being deployed presently, including the trending size of RAM. I would generally expect that auto-tuning settings like this could be adapted for major versions, and added to the release notes like other potentially surprising, yet generally useful changes. If this is not the case for GC settings, then how do we allow for the change for CMS to G1 as average RAM sizing continues to change? > Allow upgrades and installs to take modern defaults > --- > > Key: CASSANDRA-11006 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11006 > Project: Cassandra > Issue Type: Improvement > Components: Configuration, Lifecycle, Packaging, Tools >Reporter: Jonathan Shook > > See CASSANDRA-10425 for background. > We simply need to provide a way to install or upgrade C* on a system with > modern settings. Keeping the previous defaults has been the standard rule of > thumb to avoid surprises. This is a reasonable approach, but we haven't yet > provided an alternative for full upgrades with new default nor for more > appropriate installs of new systems. The number of previous defaults which > may need to be modified for a saner deployment has become a form of technical > baggage. Often, users will have to micro-manage basic settings to more > reasonable defaults for every single deployment, upgrade or not. This is > surprising. > For newer settings that would be more appropriate, we could force the user to > make a choice. If you are installing a new cluster or node, you may want the > modern defaults. If you are upgrading an existing node, you may still want > the modern defaults. If you are upgrading an existing node and have some very > carefully selected tunings for your hardware, then you may want to keep them. > Even then, they may be worse than the modern defaults, given version changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10425) Autoselect GC settings depending on system memory
[ https://issues.apache.org/jira/browse/CASSANDRA-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095515#comment-15095515 ] Jonathan Shook commented on CASSANDRA-10425: I think we should try to come up with a way of handling settings which one would choose differently for a new install. Settings like this will live forever without a better approach. I agree entirely with the principle of least surprise. However, according to this default, there will be new systems deployed in 2020 with CMS. There has to be a better way. If we were able to have an install mode which would honor previous settings or take new defaults that are more desirable for current code and systems, perhaps we can avoid the CMS in 2020 problem. Installers may require a user to specify a mode in order to make this truly unsurprising. If I were installing a new cluster in 2020, I would be quite surprised to find it running CMS. Also, the point of having the settings be size-specific is to avoid surprising performance deficiencies. This is the kind of change that I would expect to go with a major version. So, to follow the principle of least surprise, perhaps we need to consider making this possible for those who expect to be able to use more than 32GB with G1 to address GC bandwidth and pause issues for heavy workloads, as we've come to expect through field experience. Otherwise, we'll be manually rewiring this from now on for all but historic pizza-boxen. > Autoselect GC settings depending on system memory > - > > Key: CASSANDRA-10425 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10425 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: Jonathan Shook > > 1) Make GC modular within cassandra-env > 2) For systems with 32GB or less of ram, use the classic CMS with the > established default settings. > 3) For systems with 48GB or more of ram, use 1/2 or up to 32GB of heap with > G1, whichever is lower. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11006) Allow upgrades and installs to take modern defaults
[ https://issues.apache.org/jira/browse/CASSANDRA-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095532#comment-15095532 ] Jonathan Shook commented on CASSANDRA-11006: The difference in the original ticket CASSANDRA-10425 was not that we were opting into auto-tuning. The difference was simply that we could take account of more contemporary hardware that is being deployed presently, including the trending size of RAM. I would generally expect that auto-tuning settings like this could be adapted for major versions, and added to the release notes like other potentially surprising, yet generally useful changes. If this is not the case for GC settings, then how do we allow for the change for CMS to G1 as average RAM sizing continues to change? > Allow upgrades and installs to take modern defaults > --- > > Key: CASSANDRA-11006 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11006 > Project: Cassandra > Issue Type: Improvement > Components: Configuration, Lifecycle, Packaging, Tools >Reporter: Jonathan Shook > > See CASSANDRA-10425 for background. > We simply need to provide a way to install or upgrade C* on a system with > modern settings. Keeping the previous defaults has been the standard rule of > thumb to avoid surprises. This is a reasonable approach, but we haven't yet > provided an alternative for full upgrades with new default nor for more > appropriate installs of new systems. The number of previous defaults which > may need to be modified for a saner deployment has become a form of technical > baggage. Often, users will have to micro-manage basic settings to more > reasonable defaults for every single deployment, upgrade or not. This is > surprising. > For newer settings that would be more appropriate, we could force the user to > make a choice. If you are installing a new cluster or node, you may want the > modern defaults. If you are upgrading an existing node, you may still want > the modern defaults. If you are upgrading an existing node and have some very > carefully selected tunings for your hardware, then you may want to keep them. > Even then, they may be worse than the modern defaults, given version changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10425) Autoselect GC settings depending on system memory
[ https://issues.apache.org/jira/browse/CASSANDRA-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095534#comment-15095534 ] Jonathan Shook commented on CASSANDRA-10425: CASSANDRA-11006 was created to discuss possible ways of handling this. > Autoselect GC settings depending on system memory > - > > Key: CASSANDRA-10425 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10425 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: Jonathan Shook > > 1) Make GC modular within cassandra-env > 2) For systems with 32GB or less of ram, use the classic CMS with the > established default settings. > 3) For systems with 48GB or more of ram, use 1/2 or up to 32GB of heap with > G1, whichever is lower. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10742) Real world DateTieredCompaction tests
[ https://issues.apache.org/jira/browse/CASSANDRA-10742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024539#comment-15024539 ] Jonathan Shook commented on CASSANDRA-10742: [~krummas], Some notes on test setup, and some observations from data models we've seen. We can try to get some additional details from willing users if this doesn't get us close enough. The baseline test I use is high-ingest, read-most-recent, with some read-cold mixed-in. The idea is to simulate the typical access patterns of time-series telemetry with roll-up processing, with the occasional historic query or reprocessing of old data. I use 90/10/1 ratio for write/recent-read/cold-read as a starting point. I usually back off the ingest rate from a saturating load in order to find a stable steady-state reference point. This still is much higher load per-node than you would often have in a production scenario. It does provide for good contrast with trade-offs, like compaction load. Often, you will be accumulating data over a longer period of time, so ingest rates that approach the reasonable saturating load are closer to stress tests than real-world. As such, they are still good tests. If you can run a node at 10x to 1000x the data rates that you would expect in production, then 1) you can complete the test in a reasonable amount of time and 2) you're not too worried about the margin of error. The data model I use is essentially ((datasource, timebucket), parametername, timestamp) -> value, although future testing will likely drop the timebucket component, relying instead on the time-based layout of sstables as a simplification. (Still needs supporting data from tests). parametername is just a variable name that is associated with a type of measurement. This is selected from a fixed set, as is often the case in the wild. The value can vary in type and size according to the type of data logging. I use a range from 1k to 5k, depending on the type of test. In the simplest cases, a value is an int or float, but it can also be a log line from a stack trace. The model of writes/read-most-recent/read-cold can cover lots of ground in terms of time-series. The ratios can be varied. Also, the number of partitions per node in conjunction with the number of parameters should vary. In some cases in the wild, time-series partitions are single-series. In other cases, they can have hundreds of related series by name (by cluster). In some cases, the parameters associated with a data source are distributed by partition to support async loading the cluster for responsive reads of significant data. To cover this, simply move the parenthesis right by one term above. If you cover some of the permutations above for op ratios, clustering structure, grain of partition, and payload size, then you'll be covering lots of the space we see in practice. > Real world DateTieredCompaction tests > - > > Key: CASSANDRA-10742 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10742 > Project: Cassandra > Issue Type: Test >Reporter: Marcus Eriksson > > So, to be able to actually evaluate DTCS (or TWCS) we need stress profiles > that are similar to something that could be found in real production systems. > We should then run these profiles for _weeks_, and do regular operational > tasks on the cluster - like bootstrap, decom, repair etc. > [~jjirsa] [~jshook] (or anyone): could you describe any write/read patterns > you have seen people use with DTCS in production? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951568#comment-14951568 ] Jonathan Shook commented on CASSANDRA-10403: Anecdote: https://www.youtube.com/watch?v=1R-mgOcOSd4=youtu.be=24m27s > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10495) Improve the way we do streaming with vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950983#comment-14950983 ] Jonathan Shook commented on CASSANDRA-10495: What if the streaming protocol were enhanced to allow sending nodes to provide an offer manifest, blocking until the receiver responded with a preferred ordering and grouping. Does this help address any of the planning issues better? > Improve the way we do streaming with vnodes > --- > > Key: CASSANDRA-10495 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10495 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson > Fix For: 3.x > > > Streaming with vnodes usually creates a large amount of sstables on the > target node - for example if each source node has 100 sstables and we use > num_tokens = 256, the bootstrapping (for example) node might get 100*256 > sstables > One approach could be to do an on-the-fly compaction on the source node, > meaning we would only stream out one sstable per range. Note that we will > want the compaction strategy to decide how to combine the sstables, for > example LCS will not want to mix sstables from different levels while STCS > can probably just combine everything > cc [~yukim] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949492#comment-14949492 ] Jonathan Shook commented on CASSANDRA-10489: So, against a non-indexed field, the processing bound will be the size of the partition. If you only hold a scoreboard of limit items in memory and stream through the rest, replacing items, the memory requirements are lower, but the IO requirements could be substantial. If you do this with RF>1 and CL>1, then you may have semantics of result merging at the coordinator, but this should still be bounded to the result size and not the search space. I would like for us to consider this operation for indexed fields and non-indexed fields as separate features, possibly putting the non-indexed version behind a warning or such. I'm sure some will absolutely try to sort 10^9 items with limit 10. At least they should know that it has a completely different op cost. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949436#comment-14949436 ] Jonathan Shook commented on CASSANDRA-10489: Would this need to be limited to indexed (in some form) fields? Without an index, it would be difficult for the coordinator to know the bound of sorting ahead of time. Or would this be for rows selected by some indexed field with limit, and then sorted only after limit was applied? Essentially, should we define this as a valid goal for results for which we already can know the cardinality bounds without traversing the whole partition? > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10490) DTCS historic compaction, possibly with major compaction
Jonathan Shook created CASSANDRA-10490: -- Summary: DTCS historic compaction, possibly with major compaction Key: CASSANDRA-10490 URL: https://issues.apache.org/jira/browse/CASSANDRA-10490 Project: Cassandra Issue Type: Bug Reporter: Jonathan Shook Presently, it's simply painful to run a major compaction with DTCS. It doesn't really serve a useful purpose. Instead, a DTCS major compaction should allow for compaction to go back before max_sstable_age_days. We can call this a historic compaction, for lack of a better term. Such a compaction should not take precedence over normal compaction work, but should be considered a background task. By default there should be a cap on the number of these tasks running. It would be nice to have a separate "max_historic_compaction_tasks" and possibly a "max_historic_compaction_throughput" in the compaction settings to allow for separate throttles on this. I would set these at 1 and 20% of the usual compaction throughput if they aren't set explicitly. It may also be desirable to allow historic compaction to run apart from running a major compaction, and to simply disable major compaction altogether for DTCS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949492#comment-14949492 ] Jonathan Shook edited comment on CASSANDRA-10489 at 10/8/15 10:14 PM: -- So, against a non-indexed field, the processing bound will be the size of the partition. If you only hold a scoreboard of limit items in memory and stream through the rest, replacing items, the memory requirements are lower, but the IO requirements could be substantial. If you do this with RF>1 and CL>1, then you may have semantics of result merging at the coordinator, but this should still be bounded to the result size and not the search space. I would like for us to consider this operation for indexed fields and non-indexed fields as separate features, possibly putting the non-indexed version behind a warning or such. I'm sure some will absolutely try to sort 10^9 *unindexed* items with limit 10. At least they should know that it has a completely different op cost. was (Author: jshook): So, against a non-indexed field, the processing bound will be the size of the partition. If you only hold a scoreboard of limit items in memory and stream through the rest, replacing items, the memory requirements are lower, but the IO requirements could be substantial. If you do this with RF>1 and CL>1, then you may have semantics of result merging at the coordinator, but this should still be bounded to the result size and not the search space. I would like for us to consider this operation for indexed fields and non-indexed fields as separate features, possibly putting the non-indexed version behind a warning or such. I'm sure some will absolutely try to sort 10^9 items with limit 10. At least they should know that it has a completely different op cost. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10490) DTCS historic compaction, possibly with major compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-10490: --- Description: Presently, it's simply painful to run a major compaction with DTCS. It doesn't really serve a useful purpose. Instead, a DTCS major compaction should allow for a DTCS-style compaction to go back before max_sstable_age_days. We can call this a historic compaction, for lack of a better term. Such a compaction should not take precedence over normal compaction work, but should be considered a background task. By default there should be a cap on the number of these tasks running. It would be nice to have a separate "max_historic_compaction_tasks" and possibly a "max_historic_compaction_throughput" in the compaction settings to allow for separate throttles on this. I would set these at 1 and 20% of the usual compaction throughput if they aren't set explicitly. It may also be desirable to allow historic compaction to run apart from running a major compaction, and to simply disable major compaction altogether for DTCS. was: Presently, it's simply painful to run a major compaction with DTCS. It doesn't really serve a useful purpose. Instead, a DTCS major compaction should allow for compaction to go back before max_sstable_age_days. We can call this a historic compaction, for lack of a better term. Such a compaction should not take precedence over normal compaction work, but should be considered a background task. By default there should be a cap on the number of these tasks running. It would be nice to have a separate "max_historic_compaction_tasks" and possibly a "max_historic_compaction_throughput" in the compaction settings to allow for separate throttles on this. I would set these at 1 and 20% of the usual compaction throughput if they aren't set explicitly. It may also be desirable to allow historic compaction to run apart from running a major compaction, and to simply disable major compaction altogether for DTCS. > DTCS historic compaction, possibly with major compaction > > > Key: CASSANDRA-10490 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10490 > Project: Cassandra > Issue Type: Bug >Reporter: Jonathan Shook > > Presently, it's simply painful to run a major compaction with DTCS. It > doesn't really serve a useful purpose. Instead, a DTCS major compaction > should allow for a DTCS-style compaction to go back before > max_sstable_age_days. We can call this a historic compaction, for lack of a > better term. > Such a compaction should not take precedence over normal compaction work, but > should be considered a background task. By default there should be a cap on > the number of these tasks running. It would be nice to have a separate > "max_historic_compaction_tasks" and possibly a > "max_historic_compaction_throughput" in the compaction settings to allow for > separate throttles on this. I would set these at 1 and 20% of the usual > compaction throughput if they aren't set explicitly. > It may also be desirable to allow historic compaction to run apart from > running a major compaction, and to simply disable major compaction altogether > for DTCS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10489) arbitrary order by on partitions
[ https://issues.apache.org/jira/browse/CASSANDRA-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949613#comment-14949613 ] Jonathan Shook commented on CASSANDRA-10489: I'm totally cool with a threshold warning here. But something that is easily ignored is easily ignored, like log spam. Also, if it is documented clearly in terms of op costs, I'm ok with that too. Anywhere we have a list of "these things that can be expensive if you don't understand what they are doing", this should be on it. > arbitrary order by on partitions > > > Key: CASSANDRA-10489 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10489 > Project: Cassandra > Issue Type: Improvement >Reporter: Jon Haddad >Priority: Minor > > We've got aggregations, we might as well allow sorting rows within a > partition on arbitrary fields. Currently the advice is "do it client side", > but when combined with a LIMIT clause it makes sense do this server side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10443) CQLSStableWriter example fails on 3.0rc1
[ https://issues.apache.org/jira/browse/CASSANDRA-10443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-10443: --- Summary: CQLSStableWriter example fails on 3.0rc1 (was: CQLSStableWriter example fails on C*3.0) > CQLSStableWriter example fails on 3.0rc1 > > > Key: CASSANDRA-10443 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10443 > Project: Cassandra > Issue Type: Bug > Components: Core, Tools >Reporter: Jonathan Shook > > CQLSSTableWriter which works with 2.2.1 does not work with 3.0rc1. > Something like https://github.com/yukim/cassandra-bulkload-example should be > added to the test suite. > Exception in thread "main" java.lang.RuntimeException: > java.lang.ExceptionInInitializerError > at > org.apache.cassandra.io.sstable.SSTableSimpleUnsortedWriter.close(SSTableSimpleUnsortedWriter.java:136) > at > org.apache.cassandra.io.sstable.CQLSSTableWriter.close(CQLSSTableWriter.java:274) > at com.metawiring.sandbox.BulkLoadExample.main(BulkLoadExample.java:160) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) > Caused by: java.lang.ExceptionInInitializerError > at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:372) > at org.apache.cassandra.db.Keyspace.(Keyspace.java:309) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:133) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110) > at > org.apache.cassandra.io.sstable.SSTableTxnWriter.create(SSTableTxnWriter.java:97) > at > org.apache.cassandra.io.sstable.AbstractSSTableSimpleWriter.createWriter(AbstractSSTableSimpleWriter.java:63) > at > org.apache.cassandra.io.sstable.SSTableSimpleUnsortedWriter$DiskWriter.run(SSTableSimpleUnsortedWriter.java:206) > Caused by: java.lang.NullPointerException > at > org.apache.cassandra.config.DatabaseDescriptor.getFlushWriters(DatabaseDescriptor.java:1153) > at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:116) > ... 7 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10443) CQLSStableWriter example fails on C*3.0
Jonathan Shook created CASSANDRA-10443: -- Summary: CQLSStableWriter example fails on C*3.0 Key: CASSANDRA-10443 URL: https://issues.apache.org/jira/browse/CASSANDRA-10443 Project: Cassandra Issue Type: Bug Components: Core, Tools Reporter: Jonathan Shook CQLSSTableWriter which works with 2.2.1 does not work with 3.0rc1. Something like https://github.com/yukim/cassandra-bulkload-example should be added to the test suite. Exception in thread "main" java.lang.RuntimeException: java.lang.ExceptionInInitializerError at org.apache.cassandra.io.sstable.SSTableSimpleUnsortedWriter.close(SSTableSimpleUnsortedWriter.java:136) at org.apache.cassandra.io.sstable.CQLSSTableWriter.close(CQLSSTableWriter.java:274) at com.metawiring.sandbox.BulkLoadExample.main(BulkLoadExample.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) Caused by: java.lang.ExceptionInInitializerError at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:372) at org.apache.cassandra.db.Keyspace.(Keyspace.java:309) at org.apache.cassandra.db.Keyspace.open(Keyspace.java:133) at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110) at org.apache.cassandra.io.sstable.SSTableTxnWriter.create(SSTableTxnWriter.java:97) at org.apache.cassandra.io.sstable.AbstractSSTableSimpleWriter.createWriter(AbstractSSTableSimpleWriter.java:63) at org.apache.cassandra.io.sstable.SSTableSimpleUnsortedWriter$DiskWriter.run(SSTableSimpleUnsortedWriter.java:206) Caused by: java.lang.NullPointerException at org.apache.cassandra.config.DatabaseDescriptor.getFlushWriters(DatabaseDescriptor.java:1153) at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:116) ... 7 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938223#comment-14938223 ] Jonathan Shook commented on CASSANDRA-10403: [~JoshuaMcKenzie] I'd prefer not to make too many assumptions about confirmation or (human) memory bias on this. We will not get off this carousel without actual data. However, to the degree that you are right about it, it should encourage us to explore further, not less. CMS's pain in those cases has much to do with its inability to scale with hardware sizing and concurrency trends, which we seem to be working really hard to disregard. Until someone puts together a view of current and emerging system parameters, we really don't have the data that we need to set a default. I posit that the general case system is much bigger in practice that in the past. I also posit that on those systems, G1 is an obviously better default than CMS. So, we are likely going to get some data on 1) what the hardware data looks like in the field and 2) whether or not we can demonstrate the CMS improvements with larger memory that we've seen with *actual workloads* on *current system profiles*. I'm simply eager to see more data at this point. This is a bit out of scope of the ticket, but it is important. If we were able to set a default depending on the available memory, there would not be a single default. Trying to scale GC bandwidth up on bigger metal with CMS is arguably more painful than trying to make G1 useable with lower memory. However, we don't have to make that bargain as either-or. We can have the best of both, if we simply align the GC settings to the type of hardware that they work well for. I'll create another ticket for that. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938237#comment-14938237 ] Jonathan Shook commented on CASSANDRA-10403: I created CASSANDRA-10425 to discuss the per-size defaults. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938336#comment-14938336 ] Jonathan Shook edited comment on CASSANDRA-10403 at 9/30/15 7:28 PM: - [~JoshuaMcKenzie] I understand and appreciate the need to control scoping effort for 3.0 planning. bq. Shouldn't the read/write workload distribution also play into that? Yes, but there is a mostly orthogonal effect to the nuances of the workload mix which has to do with the vertical scalability of GC when the system is more fully utilized. This is visible along the sizing spectrum. Run the same workload and try to scale the heap proportionally over the memory (1/4 or whatever) and you will likely see CMS suffer no matter what. This is slightly conjectural, but easily verifiable with some effort. bq. the idea of having a default that's optimal for everyone is unrealistic I think we are converging on a common perspective on this. [~slebresne] bq. 3.2 will come only 2 months after 3.0 My preference would be to have the CASSANDRA-10425 out of the gate, but this still would require some testing effort for safety. The reason being that 3.0 represents a reframing of performance expectations, and after that, any changes to default, even for larger memory systems constitute a bigger chance of surprise. Do we have a chance to learn about sizing from surveys, etc before the runway ends for 3.0? If we could get something like CASSANDRA-10425 in place, it would cover both bases. was (Author: jshook): [~JoshuaMcKenzie] I understand and appreciate the need to control scoping effort for 3.0 planning. bq. Shouldn't the read/write workload distribution also play into that? Yes, but there is a mostly orthogonal effect to the nuances of the workload mix which has to do with the vertical scalability of GC when the system. This is visible along the sizing spectrum. Run the same workload and try to scale the heap proportionally over the memory (1/4 or whatever) and you will likely see CMS suffer no matter what. This is slightly conjectural, but easily verifiable with some effort. bq. the idea of having a default that's optimal for everyone is unrealistic I think we are converging on a common perspective on this. [~slebresne] bq. 3.2 will come only 2 months after 3.0 My preference would be to have the CASSANDRA-10425 out of the gate, but this still would require some testing effort for safety. The reason being that 3.0 represents a reframing of performance expectations, and after that, any changes to default, even for larger memory systems constitute a bigger chance of surprise. Do we have a chance to learn about sizing from surveys, etc before the runway ends for 3.0? If we could get something like CASSANDRA-10425 in place, it would cover both bases. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938336#comment-14938336 ] Jonathan Shook commented on CASSANDRA-10403: [~JoshuaMcKenzie] I understand and appreciate the need to control scoping effort for 3.0 planning. bq. Shouldn't the read/write workload distribution also play into that? Yes, but there is a mostly orthogonal effect to the nuances of the workload mix which has to do with the vertical scalability of GC when the system. This is visible along the sizing spectrum. Run the same workload and try to scale the heap proportionally over the memory (1/4 or whatever) and you will likely see CMS suffer no matter what. This is slightly conjectural, but easily verifiable with some effort. bq. the idea of having a default that's optimal for everyone is unrealistic I think we are converging on a common perspective on this. [~slebresne] bq. 3.2 will come only 2 months after 3.0 My preference would be to have the CASSANDRA-10425 out of the gate, but this still would require some testing effort for safety. The reason being that 3.0 represents a reframing of performance expectations, and after that, any changes to default, even for larger memory systems constitute a bigger chance of surprise. Do we have a chance to learn about sizing from surveys, etc before the runway ends for 3.0? If we could get something like CASSANDRA-10425 in place, it would cover both bases. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10425) Autoselect GC settings depending on system memory
[ https://issues.apache.org/jira/browse/CASSANDRA-10425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938236#comment-14938236 ] Jonathan Shook commented on CASSANDRA-10425: Consider adding some weightings for different levels of buffer-cache sensitivity in workload. > Autoselect GC settings depending on system memory > - > > Key: CASSANDRA-10425 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10425 > Project: Cassandra > Issue Type: Bug > Components: Config, Core >Reporter: Jonathan Shook > > 1) Make GC modular within cassandra-env > 2) For systems with 32GB or less of ram, use the classic CMS with the > established default settings. > 3) For systems with 48GB or more of ram, use 1/2 or up to 32GB of heap with > G1, whichever is lower. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10425) Autoselect GC settings depending on system memory
Jonathan Shook created CASSANDRA-10425: -- Summary: Autoselect GC settings depending on system memory Key: CASSANDRA-10425 URL: https://issues.apache.org/jira/browse/CASSANDRA-10425 Project: Cassandra Issue Type: Bug Components: Config, Core Reporter: Jonathan Shook 1) Make GC modular within cassandra-env 2) For systems with 32GB or less of ram, use the classic CMS with the established default settings. 3) For systems with 48GB or more of ram, use 1/2 or up to 32GB of heap with G1, whichever is lower. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938731#comment-14938731 ] Jonathan Shook edited comment on CASSANDRA-10403 at 9/30/15 7:45 PM: - To simplify, implementing CASSANDRA-10425 is effectively the same as reverting for the systems that we have commonly tested for, while allowing a likely better starting point for those that we have field experience with G1. was (Author: jshook): To simplify, implementing CASSANDRA-10425 is effectively the same as reverting for the system that we have tested for, while allowing a likely better starting point for those that we have field experience with G1. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938731#comment-14938731 ] Jonathan Shook commented on CASSANDRA-10403: To simplify, implementing CASSANDRA-10425 is effectively the same as reverting for the system that we have tested for, while allowing a likely better starting point for those that we have field experience with G1. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937449#comment-14937449 ] Jonathan Shook commented on CASSANDRA-10403: So, just to be clear, We are we disregarding G1 for systems with larger memory with the assumption that 8GB is all you'll ever need for "all but the most write-heavy workoads", even for system that have larger memory ?? > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937456#comment-14937456 ] Jonathan Shook commented on CASSANDRA-10403: [~pauloricardomg] I understand, with your updated comment. For systems that can't support a larger heap, CMS is fine, as long as you don't mind saturating survivor and triggering the cascade of GC-induced side-effects. Still, this is a performance trade-off with resiliency. I want to be clear that I think it would be a loss for us to just disregard G1 for larger memory systems as the general case. There seems to be some tension between the actual field experience and prognostication as to how it should work. I would like for data to lead the way on this, as it should. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937039#comment-14937039 ] Jonathan Shook commented on CASSANDRA-10403: This statement carries certain assumptions about the whole system, which may not be fair across the board. For example, buffer cache is a critical consideration, but to a varying degree depending on how cache-friendly the workload is. Further, the storage subsystem determines a very large part of how much of a cache-miss penalty there is. So, prioritizing the cache at the expense of the heap is not a sure win. Often it is not the right balance. With system that have high concurrency, it is possible to scale up the performance on the node as long as you can provide reasonable tunings to effectively take advantage of available resources without critically bottle-necking on one. For example, with systems that have higher effective IO concurrency and IO bandwidth across many devices, you actually need higher GC throughput in order to match the overall IO capacity of the system, from storage subsystem all the way to the network stack. This rationale has been evidenced in the field when we have made tuning improvements with G1 in certain systems as an opportunistic test. My explanation above is a probably a gross oversimplification, but it reflects experience addressing GC throughput (and pauses, and phi, and hints, and load shifting ... etc) issues. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937041#comment-14937041 ] Jonathan Shook commented on CASSANDRA-10403: To be clear, in some cases, we found G1 to be a better production GC, and those tests simply allowed us to verify this before leaving it in place. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936132#comment-14936132 ] Jonathan Shook commented on CASSANDRA-10403: To be fair, m1.xlarge has less than 16GB of RAM, which still on the small side for G1 effectiveness, although at some point between 14G and 24G you should start seeing G1 provide more stability than CMS for GC saturating loads. (Assuming you don't set the GC pause target down too low) G1 should start to be the obvious choice when you run with more than about 24GB and even more obviously with 32GB of heap. This might seem large, but if you look at what businesses tend to deploy in data centers for bare metal, they aren't just 32GB systems anymore. You'll often see 64, 128, or more GB of DRAM. There are some other ec2 profiles which get up to this range, but they are disproportionately more expensive. So, tests that go up to 32G of heap on a system with 64GB of main memory are really where the proof points are. Saturating loads are good too. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936186#comment-14936186 ] Jonathan Shook edited comment on CASSANDRA-10403 at 9/30/15 12:58 AM: -- Note about memory sizes. Everything I wrote above assumes that we are talking about smaller heaps. Things clearly change when we go up in heap size beyond what CMS can handle well. (for those reading from the middle) was (Author: jshook): Note about memory sizes. Everything I wrote above assumes that we are talking about smaller heaps. Things clearly change when we go up in heap size beyond what CMS can handle well. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936186#comment-14936186 ] Jonathan Shook commented on CASSANDRA-10403: Note about memory sizes. Everything I wrote above assumes that we are talking about smaller heaps. Things clearly change when we go up in heap size beyond what CMS can handle well. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10419) Make JBOD compaction and flushing more robust
Jonathan Shook created CASSANDRA-10419: -- Summary: Make JBOD compaction and flushing more robust Key: CASSANDRA-10419 URL: https://issues.apache.org/jira/browse/CASSANDRA-10419 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Shook Attachments: timeseries-study-overview-jbods.png With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is possible to run out of space prematurely. With a sufficient ingestion rate, disk selection logic seems to overselect on certain JBOD targets. This causes a premature C* shutdown when there is a significant amount of space left. With DTCS, for example, it should be possible to utilize over 90% of the available space with certain settings. However in the scenario I tested, only about 50% was utilized, before a filesystem full error. (see below). It is likely that this is a scheduling challenge between high rates of ingest and smaller data directories. It would be good to use an anticipatory model if possible to more carefully select compaction targets according to fill rates. The attached image shows a test with 12 1.2TB JBOD data directories. At the end, the utilizations are: 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, 1.083TB, 1092TiB, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10419) Make JBOD compaction and flushing more robust
[ https://issues.apache.org/jira/browse/CASSANDRA-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-10419: --- Description: With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is possible to run out of space prematurely. With a sufficient ingestion rate, disk selection logic seems to overselect on certain JBOD targets. This causes a premature C* shutdown when there is a significant amount of space left. With DTCS, for example, it should be possible to utilize over 90% of the available space with certain settings. However in the scenario I tested, only about 50% was utilized, before a filesystem full error. (see below). It is likely that this is a scheduling challenge between high rates of ingest and smaller data directories. It would be good to use an anticipatory model if possible to more carefully select compaction targets according to fill rates. As well, if the largest sstable that can be supported is constrained by the largest JBOD extent, we should make that visible to the compaction logic where possible. The attached image shows a test with 12 1.2TB JBOD data directories. At the end, the utilizations are: 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, 1.083TB, 1092TiB, was: With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is possible to run out of space prematurely. With a sufficient ingestion rate, disk selection logic seems to overselect on certain JBOD targets. This causes a premature C* shutdown when there is a significant amount of space left. With DTCS, for example, it should be possible to utilize over 90% of the available space with certain settings. However in the scenario I tested, only about 50% was utilized, before a filesystem full error. (see below). It is likely that this is a scheduling challenge between high rates of ingest and smaller data directories. It would be good to use an anticipatory model if possible to more carefully select compaction targets according to fill rates. The attached image shows a test with 12 1.2TB JBOD data directories. At the end, the utilizations are: 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, 1.083TB, 1092TiB, > Make JBOD compaction and flushing more robust > - > > Key: CASSANDRA-10419 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10419 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Shook > Attachments: timeseries-study-overview-jbods.png > > > With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is > possible to run out of space prematurely. With a sufficient ingestion rate, > disk selection logic seems to overselect on certain JBOD targets. This causes > a premature C* shutdown when there is a significant amount of space left. > With DTCS, for example, it should be possible to utilize over 90% of the > available space with certain settings. However in the scenario I tested, only > about 50% was utilized, before a filesystem full error. (see below). It is > likely that this is a scheduling challenge between high rates of ingest and > smaller data directories. It would be good to use an anticipatory model if > possible to more carefully select compaction targets according to fill rates. > As well, if the largest sstable that can be supported is constrained by the > largest JBOD extent, we should make that visible to the compaction logic > where possible. > The attached image shows a test with 12 1.2TB JBOD data directories. At the > end, the utilizations are: > 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, > 1.083TB, 1092TiB, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936156#comment-14936156 ] Jonathan Shook commented on CASSANDRA-10403: I do think it is valid, however I expect the findings to be slightly different. The promise of G1 on smaller systems is more robust performance across a range of workloads without manual tuning. That said, it probably won't perform as well in terms of ops/s, etc. The question to me is really whether we are trying to save people from the pain of not going fast enough or whether we are trying to save them from the pain of a CMS once they start having cascading IO and heap pressure through the system. I am very curious about our tests proving this out as we would expect. As an operator and a developer, I'd take an easily tuned and stable setting over one that goes fast until it doesn't go, any day. However, some will have already adjusted their cluster sizing around one expectation, so we'd want to make sure to avoid surprises. With 3.0 having other changes as well to offset, it might be a wash. Raw performance is only part of the picture. I would like to see your results, for sure. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie >Assignee: Paulo Motta > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10419) Make JBOD compaction and flushing more robust
[ https://issues.apache.org/jira/browse/CASSANDRA-10419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Shook updated CASSANDRA-10419: --- Description: With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is possible to run out of space prematurely. With a sufficient ingestion rate, disk selection logic seems to overselect on certain JBOD targets. This causes a premature C* shutdown when there is a significant amount of space left. With DTCS, for example, it should be possible to utilize over 90% of the available space with certain settings. However in the scenario I tested, only about 50% was utilized, before a filesystem full error. (see below). It is likely that this is a scheduling challenge between high rates of ingest and smaller data directories. It would be good to use an anticipatory model if possible to more carefully select compaction targets according to fill rates. As well, if the largest sstable that can be supported is constrained by the largest JBOD extent, we should make that visible to the compaction logic where possible. The attached image shows a test with 12 1.2TB JBOD data directories. At the end, the utilizations are: 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, 1.083TB, 1.092TiB, was: With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is possible to run out of space prematurely. With a sufficient ingestion rate, disk selection logic seems to overselect on certain JBOD targets. This causes a premature C* shutdown when there is a significant amount of space left. With DTCS, for example, it should be possible to utilize over 90% of the available space with certain settings. However in the scenario I tested, only about 50% was utilized, before a filesystem full error. (see below). It is likely that this is a scheduling challenge between high rates of ingest and smaller data directories. It would be good to use an anticipatory model if possible to more carefully select compaction targets according to fill rates. As well, if the largest sstable that can be supported is constrained by the largest JBOD extent, we should make that visible to the compaction logic where possible. The attached image shows a test with 12 1.2TB JBOD data directories. At the end, the utilizations are: 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, 1.083TB, 1092TiB, > Make JBOD compaction and flushing more robust > - > > Key: CASSANDRA-10419 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10419 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Jonathan Shook > Attachments: timeseries-study-overview-jbods.png > > > With JBOD and several smaller disks, like SSDs at 1.2 TB or lower, it is > possible to run out of space prematurely. With a sufficient ingestion rate, > disk selection logic seems to overselect on certain JBOD targets. This causes > a premature C* shutdown when there is a significant amount of space left. > With DTCS, for example, it should be possible to utilize over 90% of the > available space with certain settings. However in the scenario I tested, only > about 50% was utilized, before a filesystem full error. (see below). It is > likely that this is a scheduling challenge between high rates of ingest and > smaller data directories. It would be good to use an anticipatory model if > possible to more carefully select compaction targets according to fill rates. > As well, if the largest sstable that can be supported is constrained by the > largest JBOD extent, we should make that visible to the compaction logic > where possible. > The attached image shows a test with 12 1.2TB JBOD data directories. At the > end, the utilizations are: > 59GiB, 83GiB, 83GiB, 97GiB, 330GiB, 589GiB, 604GiB, 630GiB, 697GiB, 1.055TiB, > 1.083TB, 1.092TiB, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933621#comment-14933621 ] Jonathan Shook commented on CASSANDRA-10403: I would be entirely in favor of having a separate settings file that can simply be sourced in. Having several related GC options sprinkled through the -env file is bothersome. This should apply as well to the CMS settings. Perhaps it should even be a soft setting, as long as the possible values are marshaled against any injection. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10403) Consider reverting to CMS GC on 3.0
[ https://issues.apache.org/jira/browse/CASSANDRA-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933505#comment-14933505 ] Jonathan Shook commented on CASSANDRA-10403: Can we get some G1 tests with a 24+G heap to see if it's worth making this machine-specific? The notion of "commodity" changes with time. The settings need to adapt if possible. > Consider reverting to CMS GC on 3.0 > --- > > Key: CASSANDRA-10403 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10403 > Project: Cassandra > Issue Type: Improvement > Components: Config >Reporter: Joshua McKenzie > Fix For: 3.0.0 rc2 > > > Reference discussion on CASSANDRA-7486. > For smaller heap sizes G1 appears to have some throughput/latency issues when > compared to CMS. With our default max heap size at 8G on 3.0, there's a > strong argument to be made for having CMS as the default for the 3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10280) Make DTCS work well with old data
[ https://issues.apache.org/jira/browse/CASSANDRA-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901028#comment-14901028 ] Jonathan Shook commented on CASSANDRA-10280: I've read the patch and the comments. Deprecating max_sstable_age_days in favor of the max window size is a good simplification. It also does what I originally had hoped max_sstable_age_days would do. So +1 on all of that. Just to make sure, can we identify whether or not this might affect tombstone compaction scheduling? As in, could it cause tombstone compactions that would otherwise happen to not? > Make DTCS work well with old data > - > > Key: CASSANDRA-10280 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10280 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson > Fix For: 3.x, 2.1.x, 2.2.x > > > Operational tasks become incredibly expensive if you keep around a long > timespan of data with DTCS - with default settings and 1 year of data, the > oldest window covers about 180 days. Bootstrapping a node with vnodes with > this data layout will force cassandra to compact very many sstables in this > window. > We should probably put a cap on how big the biggest windows can get. We could > probably default this to something sane based on max_sstable_age (ie, say we > can reasonably handle 1000 sstables per node, then we can calculate how big > the windows should be to allow that) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Migrate to G1GC by default
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877316#comment-14877316 ] Jonathan Shook commented on CASSANDRA-7486: --- This seems pretty open-and-shut where I would expect a bit more of a nuanced test. We've honestly seen G1 be the operative improvement in some cases in the field. I'd much prefer to see "needs more analysis" than to see it resolved as fixed. CMS will *not* scale with hardware as we go forward. This is not in debate. > Migrate to G1GC by default > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: New Feature > Components: Config >Reporter: Jonathan Ellis >Assignee: Albert P Tobey > Fix For: 3.0 alpha 1 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7486) Migrate to G1GC by default
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877411#comment-14877411 ] Jonathan Shook edited comment on CASSANDRA-7486 at 9/20/15 5:15 AM: I do believe that there is a gap between the maximum effective CMS heap sizes and the minimum effective G1 sizes. I'd estimate it to be about the 14GB - 24GB range. Neither does admirably when taxed for GC throughput in that range. Put in another way, I've never and would never advocate that someone use G1 with less than 24G of heap. In practice, I use it only on systems with 64GB of memory, where it is no big deal to give G1 32GB to work with. We have simply seen G1 go slower when it doesn't have adequate scratch space. In essence, it really likes to have more memory. We have also seen anecdotal evidence that G1 seems to settle in, performance wise, after a warm-up time. It could be that it needs to collect metrics long enough under steady state before it learns how to handle GC and heap allocation better. This hasn't been proven out definitively, but is strongly evidenced in some longer-run workload studies. I do agree that when you don't really need more than 12GB of heap, CMS will be difficult to beat with the appropriate tunings. I'm not really sure what to do about the mid-band where neither CMS nor G1 are very happy. We may have to be prescriptive in the sense that if you want to use G1, then you should give it enough to work with effectively. Perhaps we need to make the startup scripts source a different GC config file depending on the detected memory in the system. I normally configure G1 as a sourced (included) file to the -env.sh script, so this would be fairly straightforward. [~ato...@datastax.com], any comments on this? was (Author: jshook): I do believe that there is a gap between the maximum effective CMS heap sizes and the minimum effective G1 sizes. I'd estimate it to be about the 14GB - 24GB range. Neither does admirably when taxed for GC throughput in that range. Put in another way, I've never and would never advocate that someone use G1 with less than 24G of heap. In practice, I use it only on systems with 64GB of memory, where it is no big deal to give G1 32GB to work with. We have simply seen G1 go slower when it doesn't have adequate scratch space. In essence, it really likes to have more memory. We have also seen anecdotal evidence that G1 seems to settle in, performance wise, after a warm-up time. It could be that it needs to collect metrics long enough under steady state before it learns how to handle GC and heap allocation better. This hasn't been provided definitively, but is strongly evidenced in some longer-run workload studies. I do agree that when you don't really need more than 12GB of heap, CMS will be difficult to beat with the appropriate tunings. I'm not really sure what to do about the mid-band where neither CMS nor G1 are very happy. We may have to be prescriptive in the sense that if you want to use G1, then you should give it enough to work with effectively. Perhaps we need to make the startup scripts source a different GC config file depending on the detected memory in the system. I normally configure G1 as a sourced (included) file to the -env.sh script, so this would be fairly straightforward. [~ato...@datastax.com], any comments on this? > Migrate to G1GC by default > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: New Feature > Components: Config >Reporter: Jonathan Ellis >Assignee: Benedict > Fix For: 3.0 alpha 1 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Migrate to G1GC by default
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877432#comment-14877432 ] Jonathan Shook commented on CASSANDRA-7486: --- I'd argue that there already is an increase in pain as you try to use more of the metal on a node. We've just become acclimated to it. Instead of scaling the compute side over the metal, we do silly things like run multiple instances per box. Its not really silly if it gets results, but it is an example of where we do something tactically, get so used to it as a necessary complexity, and then just keep taking for granted that this is how we do it. I personally don't want to keep going down this path. So, I am inclined to carry on with the testing and characterization, in time. We should compare notes and methods and see what can be done to reduce the overall effort. > Migrate to G1GC by default > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: New Feature > Components: Config >Reporter: Jonathan Ellis >Assignee: Benedict > Fix For: 3.0 alpha 1 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Migrate to G1GC by default
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877411#comment-14877411 ] Jonathan Shook commented on CASSANDRA-7486: --- I do believe that there is a gap between the maximum effective CMS heap sizes and the minimum effective G1 sizes. I'd estimate it to be about the 14GB - 24GB range. Neither does admirably when taxed for GC throughput in that range. Put in another way, I've never and would never advocate that someone use G1 with less than 24G of heap. In practice, I use it only on systems with 64GB of memory, where it is no big deal to give G1 32GB to work with. We have simply seen G1 go slower when it doesn't have adequate scratch space. In essence, it really likes to have more memory. We have also seen anecdotal evidence that G1 seems to settle in, performance wise, after a warm-up time. It could be that it needs to collect metrics long enough under steady state before it learns how to handle GC and heap allocation better. This hasn't been provided definitively, but is strongly evidenced in some longer-run workload studies. I do agree that when you don't really need more than 12GB of heap, CMS will be difficult to beat with the appropriate tunings. I'm not really sure what to do about the mid-band where neither CMS nor G1 are very happy. We may have to be prescriptive in the sense that if you want to use G1, then you should give it enough to work with effectively. Perhaps we need to make the startup scripts source a different GC config file depending on the detected memory in the system. I normally configure G1 as a sourced (included) file to the -env.sh script, so this would be fairly straightforward. [~ato...@datastax.com], any comments on this? > Migrate to G1GC by default > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: New Feature > Components: Config >Reporter: Jonathan Ellis >Assignee: Albert P Tobey > Fix For: 3.0 alpha 1 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7486) Migrate to G1GC by default
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877316#comment-14877316 ] Jonathan Shook edited comment on CASSANDRA-7486 at 9/19/15 8:44 PM: This seems pretty open-and-shut where I would expect a bit more of a nuanced test. We've honestly seen G1 be the operative improvement in some cases in the field. I'd much prefer to see "needs more analysis" than to see it resolved as fixed. CMS will *not* scale with hardware as we go forward. This is not in debate. An, nevermind. I see that is what the status is now. was (Author: jshook): This seems pretty open-and-shut where I would expect a bit more of a nuanced test. We've honestly seen G1 be the operative improvement in some cases in the field. I'd much prefer to see "needs more analysis" than to see it resolved as fixed. CMS will *not* scale with hardware as we go forward. This is not in debate. > Migrate to G1GC by default > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: New Feature > Components: Config >Reporter: Jonathan Ellis >Assignee: Albert P Tobey > Fix For: 3.0 alpha 1 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7486) Migrate to G1GC by default
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877316#comment-14877316 ] Jonathan Shook edited comment on CASSANDRA-7486 at 9/20/15 3:33 AM: This seems pretty open-and-shut where I would expect a bit more of a nuanced test. We've honestly seen G1 be the operative improvement in some cases in the field. I'd much prefer to see "needs more analysis" than to see it resolved as fixed. CMS will *not* scale with hardware as we go forward. This is not in debate. Ah, nevermind. I see that is what the status is now. was (Author: jshook): This seems pretty open-and-shut where I would expect a bit more of a nuanced test. We've honestly seen G1 be the operative improvement in some cases in the field. I'd much prefer to see "needs more analysis" than to see it resolved as fixed. CMS will *not* scale with hardware as we go forward. This is not in debate. An, nevermind. I see that is what the status is now. > Migrate to G1GC by default > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: New Feature > Components: Config >Reporter: Jonathan Ellis >Assignee: Albert P Tobey > Fix For: 3.0 alpha 1 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10297) Low-effort configuration of metrics reporters via JMX/nodetool
Jonathan Shook created CASSANDRA-10297: -- Summary: Low-effort configuration of metrics reporters via JMX/nodetool Key: CASSANDRA-10297 URL: https://issues.apache.org/jira/browse/CASSANDRA-10297 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Jonathan Shook Fix For: 3.x Provide the ability to configure metrics reporters via JMX, with default support for common reporters out of the box, including graphite. Configuration commands should allow for full programmatic configuration of reporters, including managing active reporters and their filtering settings. The prefix value should be configurable with support for several common tokens which will be interpolated when the prefix value is set: hostname, rpc_ipaddr, cluster_name, etc. Optionally, the configuration should be backed by a configuration file which is automatically loaded at startup if it exists, but with no errors if it doesn't. JMX options added here should also be supported with nodetool. The purpose of this improvement is to allow for bulk (re)configuration of metrics collection in larger deployments. An possible approach that would be easier to implement would be to provide the yaml reporter configuration via a JMX method parameter, with an optional boolean which would signal the method to persist the file in a pre-defined 'reporter config' location. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10249) Reduce over-read for standard disk io by 16x
[ https://issues.apache.org/jira/browse/CASSANDRA-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729510#comment-14729510 ] Jonathan Shook commented on CASSANDRA-10249: +1 on configurable > Reduce over-read for standard disk io by 16x > > > Key: CASSANDRA-10249 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10249 > Project: Cassandra > Issue Type: Improvement >Reporter: Albert P Tobey > Fix For: 2.1.x > > Attachments: patched-2.1.9-dstat-lvn10.png, > stock-2.1.9-dstat-lvn10.png, yourkit-screenshot.png > > > On read workloads, Cassandra 2.1 reads drastically more data than it emits > over the network. This causes problems throughput the system by wasting disk > IO and causing unnecessary GC. > I have reproduce the issue on clusters and locally with a single instance. > The only requirement to reproduce the issue is enough data to blow through > the page cache. The default schema and data size with cassandra-stress is > sufficient for exposing the issue. > With stock 2.1.9 I regularly observed anywhere from 300:1 to 500 > disk:network ratio. That is to say, for 1MB/s of network IO, Cassandra was > doing 300-500MB/s of disk reads, saturating the drive. > After applying this patch for standard IO mode > https://gist.github.com/tobert/10c307cf3709a585a7cf the ratio fell to around > 100:1 on my local test rig. Latency improved considerably and GC became a lot > less frequent. > I tested with 512 byte reads as well, but got the same performance, which > makes sense since all HDD and SSD made in the last few years have a 4K block > size (many of them lie and say 512). > I'm re-running the numbers now and will post them tomorrow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10249) Reduce over-read for standard disk io by 16x
[ https://issues.apache.org/jira/browse/CASSANDRA-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727703#comment-14727703 ] Jonathan Shook commented on CASSANDRA-10249: I'm not so sure that this is a niche. Compression is not a default win, and I'd prefer that it be "unset" and require users to pick "compressed" or "uncompressed" in the DDL. But we don't do that. So, compressed is a default. Still, uncompressed is not quite a niche. I'm less sure about the buffered IO angle. If these are reasonable options for some scenarios, then I don't feel quite right calling them niche. One persons niche is another's standard. For those that need these settings to get the most of their current hardware, the large minimum read size is, in fact, a deoptimization from normal. > Reduce over-read for standard disk io by 16x > > > Key: CASSANDRA-10249 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10249 > Project: Cassandra > Issue Type: Improvement >Reporter: Albert P Tobey > Fix For: 2.1.x > > Attachments: patched-2.1.9-dstat-lvn10.png, > stock-2.1.9-dstat-lvn10.png, yourkit-screenshot.png > > > On read workloads, Cassandra 2.1 reads drastically more data than it emits > over the network. This causes problems throughput the system by wasting disk > IO and causing unnecessary GC. > I have reproduce the issue on clusters and locally with a single instance. > The only requirement to reproduce the issue is enough data to blow through > the page cache. The default schema and data size with cassandra-stress is > sufficient for exposing the issue. > With stock 2.1.9 I regularly observed anywhere from 300:1 to 500 > disk:network ratio. That is to say, for 1MB/s of network IO, Cassandra was > doing 300-500MB/s of disk reads, saturating the drive. > After applying this patch for standard IO mode > https://gist.github.com/tobert/10c307cf3709a585a7cf the ratio fell to around > 100:1 on my local test rig. Latency improved considerably and GC became a lot > less frequent. > I tested with 512 byte reads as well, but got the same performance, which > makes sense since all HDD and SSD made in the last few years have a 4K block > size (many of them lie and say 512). > I'm re-running the numbers now and will post them tomorrow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10013) Default commitlog_total_space_in_mb to 4G
[ https://issues.apache.org/jira/browse/CASSANDRA-10013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712420#comment-14712420 ] Jonathan Shook commented on CASSANDRA-10013: +1 Commit log space is not in short supply. I think it would be ok to make it larger even, but don't have any recent results to support that idea. At least setting it to 4G is an improvement. Default commitlog_total_space_in_mb to 4G - Key: CASSANDRA-10013 URL: https://issues.apache.org/jira/browse/CASSANDRA-10013 Project: Cassandra Issue Type: Improvement Components: Config Reporter: Brandon Williams Fix For: 2.1.x First, it bothers me that we default to 1G but have 4G commented out in the config. More importantly though is more than once I've seen this lead to dropped mutations, because you have ~100 tables (which isn't that hard to do with OpsCenter and CFS and an application that uses a moderately high but still reasonable amount of tables itself) and when the limit is reached CLA flushes the oldest tables to try to free up CL space, but this in turn causes a flush stampede that in some cases never ends and backs up the flush queue which then causes the drops. This leaves you thinking you have a load shedding situation (which I guess you kind of do) but it would go away if you had just uncommented that config line. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9264) Cassandra should not persist files without checksums
[ https://issues.apache.org/jira/browse/CASSANDRA-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707820#comment-14707820 ] Jonathan Shook commented on CASSANDRA-9264: --- [~aweisberg] Am I correct in assuming that you agree with the need for the checkums, but simply want a method that is simple to reason about as well as more inline with the related data? Is there an opportunity here to consider this topic for inclusion in the table format tickets? Cassandra should not persist files without checksums Key: CASSANDRA-9264 URL: https://issues.apache.org/jira/browse/CASSANDRA-9264 Project: Cassandra Issue Type: Wish Reporter: Ariel Weisberg Fix For: 3.x Even if checksums aren't validated on the read side every time it is helpful to have them persisted with checksums so that if a corrupted file is encountered you can at least validate that the issue is corruption and not an application level error that generated a corrupt file. We should standardize on conventions for how to checksum a file and which checksums to use so we can ensure we get the best performance possible. For a small checksum I think we should use CRC32 because the hardware support appears quite good. For cases where a 4-byte checksum is not enough I think we can look at either xxhash64 or MurmurHash3. The problem with xxhash64 is that output is only 8-bytes. The problem with MurmurHash3 is that the Java implementation is slow. If we can live with 8-bytes and make it easy to switch hash implementations I think xxhash64 is a good choice because we already ship a good implementation with LZ4. I would also like to see hashes always prefixed by a type so that we can swap hashes without running into pain trying to figure out what hash implementation is present. I would also like to avoid making assumptions about the number of bytes in a hash field where possible keeping in mind compatibility and space issues. Hashing after compression is also desirable over hashing before compression. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9264) Cassandra should not persist files without checksums
[ https://issues.apache.org/jira/browse/CASSANDRA-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706942#comment-14706942 ] Jonathan Shook commented on CASSANDRA-9264: --- This came up in discussion with a customer today. There is effectively a difference in read response handling between data from compressed sstables vs non-compressed sstables. This is due to the fact that the block checksums on compressed sstables can disqualify corrupted data. Non-compressed sstables have no equivalent checksum mechanism, so are susceptible to passing hardware-level corruption up without detection. Sectors that have been corrupted may cause an sstable to be unreadable, but it may also manifest as an undetected change in data. Cassandra should not persist files without checksums Key: CASSANDRA-9264 URL: https://issues.apache.org/jira/browse/CASSANDRA-9264 Project: Cassandra Issue Type: Wish Reporter: Ariel Weisberg Fix For: 3.x Even if checksums aren't validated on the read side every time it is helpful to have them persisted with checksums so that if a corrupted file is encountered you can at least validate that the issue is corruption and not an application level error that generated a corrupt file. We should standardize on conventions for how to checksum a file and which checksums to use so we can ensure we get the best performance possible. For a small checksum I think we should use CRC32 because the hardware support appears quite good. For cases where a 4-byte checksum is not enough I think we can look at either xxhash64 or MurmurHash3. The problem with xxhash64 is that output is only 8-bytes. The problem with MurmurHash3 is that the Java implementation is slow. If we can live with 8-bytes and make it easy to switch hash implementations I think xxhash64 is a good choice because we already ship a good implementation with LZ4. I would also like to see hashes always prefixed by a type so that we can swap hashes without running into pain trying to figure out what hash implementation is present. I would also like to avoid making assumptions about the number of bytes in a hash field where possible keeping in mind compatibility and space issues. Hashing after compression is also desirable over hashing before compression. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6477) Materialized Views (was: Global Indexes)
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632018#comment-14632018 ] Jonathan Shook commented on CASSANDRA-6477: --- The comment about adding a hop was with respect to what users would currently be doing to maintain multiple views of data. They don't expect that there is a proxy proxy for their writes, no matter whether they are using async or not, batches or not. Materialized Views (was: Global Indexes) Key: CASSANDRA-6477 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Assignee: Carl Yeksigian Labels: cql Fix For: 3.0 beta 1 Attachments: test-view-data.sh, users.yaml Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6477) Materialized Views (was: Global Indexes)
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631983#comment-14631983 ] Jonathan Shook commented on CASSANDRA-6477: --- If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: * Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. * The same CL is generally used for all related tables. * Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because the are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really matter to me that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above. Materialized Views (was: Global Indexes) Key: CASSANDRA-6477 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Assignee: Carl Yeksigian Labels: cql Fix For: 3.0 beta 1 Attachments: test-view-data.sh, users.yaml Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-6477) Materialized Views (was: Global Indexes)
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631983#comment-14631983 ] Jonathan Shook edited comment on CASSANDRA-6477 at 7/17/15 9:55 PM: If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: * Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. * The same CL is generally used for all related tables. * Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because the are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above. was (Author: jshook): If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: * Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. * The same CL is generally used for all related tables. * Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because the are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really matter to me that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above. Materialized Views (was: Global Indexes) Key: CASSANDRA-6477 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Assignee: Carl Yeksigian Labels: cql Fix For: 3.0 beta 1 Attachments: test-view-data.sh, users.yaml Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-6477) Materialized Views (was: Global Indexes)
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631983#comment-14631983 ] Jonathan Shook edited comment on CASSANDRA-6477 at 7/17/15 9:58 PM: If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: * Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. * The same CL is generally used for all related tables. * Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because they are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above. was (Author: jshook): If we look at this from the perspective of a typical developer who simply wants query tables to be easier to manage, then the basic requirements are pretty simple: Emulate current practice. That isn't to say that we shouldn't dig deeper in terms of what would could make sense in different contexts, but the basic usage pattern that it is meant to simplify is pretty basic: * Logged batches are not commonly used to wrap a primary table with it's query tables during writes. The failure modes of these are usually well understood, meaning that it is clear what the implications are for a failed write in nearly every case. * The same CL is generally used for all related tables. * Savvy users will do this with async with the same CL for all of these operations. So effectively, I would expect the very basic form of this feature to look much like it would in practice already, except that it requires much less effort on the end user to maintain. I would like for us to consider that where the implementation varies from this, that there may be lots of potential for surprise. I really think we need to be following the principle of least surprise here as a start. It is almost certain that MV will be adopted quickly in places that have a need for it because the are essentially doing this manually at the present. If you require them to micro-manage the settings in order to even get close to the current result (performance, availability assumptions, ...) then we should change the defaults. It doesn't really seem necessary that we force the coordinator node to be a replica. This is orthogonal to the base problem, and has controls in topology aware clients already. As well, it does add potentially another hop, which I do have concerns about with respect to the above. Materialized Views (was: Global Indexes) Key: CASSANDRA-6477 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Assignee: Carl Yeksigian Labels: cql Fix For: 3.0 beta 1 Attachments: test-view-data.sh, users.yaml Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6477) Materialized Views (was: Global Indexes)
[ https://issues.apache.org/jira/browse/CASSANDRA-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632015#comment-14632015 ] Jonathan Shook commented on CASSANDRA-6477: --- This goes directly to my point. It would be ideal if we simply allow users to simplify what they are already doing with the least amount of special handling we can add to the mix. In terms of solving the problem in a way that users understand, we must strive to compose a solution from the already established primitives that we teach users about all the time. Any failure modes should be explained in those terms as well. Other approach are likely to create more special cases, which I think we all can agree are not good for anybody. Materialized Views (was: Global Indexes) Key: CASSANDRA-6477 URL: https://issues.apache.org/jira/browse/CASSANDRA-6477 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Jonathan Ellis Assignee: Carl Yeksigian Labels: cql Fix For: 3.0 beta 1 Attachments: test-view-data.sh, users.yaml Local indexes are suitable for low-cardinality data, where spreading the index across the cluster is a Good Thing. However, for high-cardinality data, local indexes require querying most nodes in the cluster even if only a handful of rows is returned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9130) reduct default dtcs max_sstable_age
[ https://issues.apache.org/jira/browse/CASSANDRA-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615546#comment-14615546 ] Jonathan Shook commented on CASSANDRA-9130: --- I'm not particularly concerned about the corner cases for lots of sstables, but it does need to be documented better. We do not yet have tools to manage re-compacting DTCS past max_sstable_age_days. Even if we did, it would not be an automatic win in every case. The operational trade-offs that come with different max_sstable_age_days are simply too stark to avoid. I still believe that 365 is way too high. Studying the total bytes compacted over different DTCS settings and ingest rates can show the IO load. 365 is way beyond the point at which you start paying for more compaction than you need in most systems. I do agree, though about the boundary condition. We should have a safety in place to avoid max_sstable_age_days table TTL until we can verify that a TTL-specific compaction pass will occur as needed. This might be a concern as well for per-write TTLs. [~jjirsa] Is there a way that you would like to see the interplay between TTLs and max_sstable_age_days handled? Is there a solution which you would consider safe? reduct default dtcs max_sstable_age --- Key: CASSANDRA-9130 URL: https://issues.apache.org/jira/browse/CASSANDRA-9130 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Marcus Eriksson Priority: Minor Fix For: 3.x, 2.1.x, 2.0.x Now that CASSANDRA-9056 is fixed it should be safe to reduce the default age and increase performance correspondingly. [~jshook] suggests that two weeks may be appropriate, or we could make it dynamic based on gcgs (since that's the window past which we should expect repair to not introduce fragmentation anymore). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9378) Instrument the logger with error count metrics by level
Jonathan Shook created CASSANDRA-9378: - Summary: Instrument the logger with error count metrics by level Key: CASSANDRA-9378 URL: https://issues.apache.org/jira/browse/CASSANDRA-9378 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Shook Priority: Minor The ability to do sanity checks against logged errors and warning counts could be helpful for several reasons. One of the most obvious would be as a way to verify that no errors were logged during a (semi-) automated upgrade or restart process. Fortunately, this is easy to enable as described here: https://dropwizard.github.io/metrics/3.1.0/manual/logback/ It was pointed out by [~jjordan] that this ability should exist in current version if the user is willing to drop in the right jars and modify the appender config. It would also be helpful as a programmatic feature with a toggle to enable or disable, possibly with a cassandra.yaml config parameter. There may be some users who would prefer to disable it to avoid calling another appender. If testing shows the overhead for this to be sufficiently low, we could just leave it on by default. These should be exposed via JMX when they are enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8303) Create a capability limitation framework
[ https://issues.apache.org/jira/browse/CASSANDRA-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538244#comment-14538244 ] Jonathan Shook commented on CASSANDRA-8303: --- I am concerned that we are creating more complexity through pedantry here. Let me explain... I know this is late in coming, but I'm going to explain my position anyway. The idea that a user would be authorized for an operation does not depend on it being on specific data. There is no hard and fast rule that says you must not use authorization to control access to types of actions. From the general perspective, that is what authorization is about. It is simply a mechanism to answer the question Is the current user allowed to do ? This is not strictly limited to accessing specific data, but may also be used to limit access to specific types of actions which are not data-specific. To assert that it is is to ignore lots of established practice across a great number of systems. Authorization is a general concept which needs to be applied in a conceptually useful and idiomatic way for each system. There is a variety of approaches in the wild for how to structure permissions. Some of them assume no access with exclusions in the form of grants. Some do the opposite. My favorite is the 3-state logic used by the postfix MTA (thank you, Venema!) which allows for a very plugable system. In this system, a chain of evaluators is used, and each may grant, deny, or indicate I don't know, ask the next one.. And all you have to do to establish a default is put a static grant or deny at the end of the evaluator chain. So, this is a bit non-sequitor, but the point is to illustrate the variety and flexibility of authorization systems out there. A competing concern for the flexibility of these system is always how easy they are to understand and use. That feeds directly into my next point. I don't understand why we would want to create two semantically distinct interfaces for users when we are really talking about a basic authentication problem. Group or individual, data-oriented or command-oriented. Once you've made the user pay the price of entry to use the authentication system, you're going to tell them that they have to learn a different system to do capability limiting because we decided to name it and treat it differently? I think this is a case of accidental complexity in the name of separation of concerns, when in fact they are not really separate concerns. You can't completely separate the mechanisms of authorization, group membership, and capabilities. They may have cleanly defined APIs, but if you look at the implementation details of CASSANDRA-7653, saying that they are logically separate would be a half-truth at best. Indeed, the authentication data has been pulled up to be visible and owned by the group management logic. You simply can't have group authentication without authentication. How they are mapped together is another matter. I would assert that authenticating individuals and mapping them via groups would probably be cleaner, but the mechanisms would still need to be inextricably linked at some layer. From a security perspective, limiting commands is every bit a part of managing system availability in a security and continuation sense as other forms of authentication+authorization. Think about DOS attacks and what it means to prevent them via command restrictions. It's just that some of the commands are data-specific, and some are not. You simply can not have a proper and separate subsystem for limiting commands without mostly reinventing the wheel around authentication and authorization. So, regardless of how it's implemented, I don't think we should be trying to designate authorization and limiting allowed commands as different concerns from the user's perspective. If you subscribe to that mindset, any notion that they would be implemented as rigidly isolated subsystems would be a reason for concern. I'm all for keeping the implemention clean and composable. I just don't want to see us shoot ourselves in the foot because we are forcing separation when there is conceptual, logical, and mechanical affinity. I uncomfortable with the suggestion under What does this buy us?, #5 that this would simply be vetoed if it weren't done the way suggested above. Was that meant to be a qualification of how this ticket can move forward? Create a capability limitation framework Key: CASSANDRA-8303 URL: https://issues.apache.org/jira/browse/CASSANDRA-8303 Project: Cassandra Issue Type: Improvement Reporter: Anupam Arora Assignee: Sam Tunnicliffe Fix For: 3.x In addition to our current Auth framework that acts as a white list, and regulates access to data,
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/10/15 12:32 AM: - I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users' expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. (Perhaps this exception or something like it can be thrown _before_ load shedding occurs.) This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints merely because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not solve the problem of needing to know when the system is being pushed beyond capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful not to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the user has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. In the current system, this decision is already made for them. They have no choice. In a more optimistic world, users would get near optimal performance for a well tuned workload with back-pressure active throughout the system, or something very much like it. We could call it a different kind of scheduler, different queue management methods, or whatever. As long as the user could prioritize stability at some bounded load over possible instability at an over-saturating load, I think they would in most cases. Like I said, they really don't have this choice right now. I know this is not trivial. We can't remove the need to make sane judgments about sizing and configuration. We might be able to, however, make the system ramp more predictably up to saturation, and behave more reasonably at that level. Order of precedence, How to designate a mode of operation, or any other concerns aren't really addressed here. I just provided the examples above as types of behaviors which are nuanced yet perfectly valid for different types of system designs. The real point here is that there is not a single overall QoS/capacity/back-pressure behavior which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. I do believe that these ideas are all compatible ideas on a spectrum of behavior. They are not mutually exclusive from a design/implementation perspective. It's possible that they could be specified per operation, even, with some traffic yield to others due to client policies. For example, a lower priority client could yield when it knows the
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/10/15 12:26 AM: - I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users' expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. (Perhaps this exception or something like it can be thrown _before_ load shedding occurs.) This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints merely because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not solve the problem of needing to know when the system is being pushed beyond capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful not to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the user has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. In the current system, this decision is already made for them. They have no choice. In a more optimistic world, users would get near optimal performance for a well tuned workload with back-pressure active throughout the system, or something very much like it. We could call it a different kind of scheduler, different queue management methods, or whatever. As long as the user could prioritize stability at some bounded load over possible instability at an over-saturating load, I think they would in most cases. Like I said, they really don't have this choice right now. I know this is not trivial. We can't remove the need to make sane judgments about sizing and configuration. We might be able to, however, make the system ramp more predictably up to saturation, and behave more reasonable at that level. Order of precedence, How to designate a mode of operation, or any other concerns aren't really addressed here. I just provided the examples above as types of behaviors which are nuanced yet perfectly valid for different types of system designs. The real point here is that there is not a single overall QoS/capacity/back-pressure behavior which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. I do believe that these ideas are all compatible ideas on a spectrum of behavior. They are not mutually exclusive from a design/implementation perspective. It's possible that they could be specified per operation, even, with some traffic yield to others due to client policies. For example, a lower priority client could yield when it knows the
[jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook commented on CASSANDRA-9318: --- I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not relieve us of the problem of needing to know when the system is at capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the users has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. Order of precedence, designated mode of operation, or any other concerns aren't really addressed here. I just provided them as examples of types of behaviors which are nuanced yet perfectly valid for different types of system designers. The real point here is that there is not a single overall design which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. We can come up with methods to improve the reliable and responsive capacity of the system even with some internal load management. If the first cut ends up being sub-optimal, then we can measure it against non-bounded workload tests and strive to close the gap. If it is implemented in a way that can support multiple usage scenarios, as described above, then such a limitation might be unlimited, bounded at level ___, or bounded by inline resource management.. But in any case would be controllable by some users/admin, client.. If we could ultimately give the categories of users above the ability to enable the various modes, then the 2a) scenario would be perfectly desirable for many users already even if the back-pressure logic only gave you 70% of the effective system capacity. Once testing shows that performance with active back-pressure to the client is close enough to the unbounded workloads, it could be enabled. Summary: We still need reasonable back-pressure support throughout the system and eventually to the client. Features like this that can be a stepping stone towards such are still needed. The most perfect load shedding and hinting systems will still not be a sufficient replacement for back-pressure and capacity management. Bound the number of in-flight requests at the coordinator
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/9/15 7:42 PM: --- I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not relieve us of the problem of needing to know when the system is at capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the users has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. Order of precedence, designated mode of operation, or any other concerns aren't really addressed here. I just provided them as examples of types of behaviors which are nuanced yet perfectly valid for different types of system designers. The real point here is that there is not a single overall design which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. We can come up with methods to improve the reliable and responsive capacity of the system even with some internal load management. If the first cut ends up being sub-optimal, then we can measure it against non-bounded workload tests and strive to close the gap. If it is implemented in a way that can support multiple usage scenarios, as described above, then such a limitation might be unlimited, bounded at level ___, or bounded by inline resource management.. But in any case would be controllable by some users/admin, client.. If we could ultimately give the categories of users above the ability to enable the various modes, then the 2a) scenario would be perfectly desirable for many users already even if the back-pressure logic only gave you 70% of the effective system capacity. Once testing shows that performance with active back-pressure to the client is close enough to the unbounded workloads, it could be enabled by default. Summary: We still need reasonable back-pressure support throughout the system and eventually to the client. Features like this that can be a stepping stone towards such are still needed. The most perfect load shedding and hinting systems will still not be a sufficient replacement for back-pressure and capacity management. was (Author: jshook): I would venture that a
[jira] [Comment Edited] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536846#comment-14536846 ] Jonathan Shook edited comment on CASSANDRA-9318 at 5/9/15 7:46 PM: --- I would venture that a solid load shedding system may improve the degenerate overloading case, but it is not the preferred method for dealing with overloading for most users. The concept of back-pressure is more squarely what people expect, for better or worse. Here is what I think reasonable users want to see, with some variations: 1) The system performs with stability, up to the workload that it is able to handle with stability. 2a) Once it reaches that limit, it starts pushing back in terms of how quickly it accepts new work. This means that it simply blocks the operations or submissions of new requests with some useful bound that is determined by the system. It does not yet have to shed load. It does not yet have to give exceptions. This is a very reasonable expectation for most users. This is what they expect. Load shedding is a term of art which does not change the users expectations. 2b) Once it reaches that limit, it starts throwing OE to the client. It does not have to shed load yet. This is a very reasonable expectation for users who are savvy enough to do active load management at the client level. It may have to start writing hints, but if you are writing hints because of load, this might not be the best justification for having the hints system kick in. To me this is inherently a convenient remedy for the wrong problem, even if it works well. Yes, hints are there as a general mechanism, but it does not relieve us of the problem of needing to know when the system is at capacity and how to handle it proactively. You could also say that hints actively hurt capacity when you need them most sometimes. They are expensive to process given the current implementation, and will always be load shifting even at theoretical best. Still we need them for node availability concerns, although we should be careful to use them as a crutch for general capacity issues. 2c) Once it reaches that limit, it starts backlogging (without a helpful signature of such in the responses, maybe BackloggingException with some queue estimate). This is a very reasonable expectation for users who are savvy enough to manage their peak and valley workloads in a sensible way. Sometimes you actually want to tax the ingest and flush side of the system for a bit before allowing it to switch modes and catch up with compaction. The fact that C* can do this is an interesting capability, but those who want backpressure will not easily see it that way. 2d) If the system is being pushed beyond its capacity, then it may have to shed load. This should only happen if the users has decided that they want to be responsible for such and have pushed the system beyond the reasonable limit without paying attention to the indications in 2a, 2b, and 2c. Order of precedence, designated mode of operation, or any other concerns aren't really addressed here. I just provided the examples above as types of behaviors which are nuanced yet perfectly valid for different types of system designs. The real point here is that there is not a single overall QoS/capacity/back-pressure behavior which is going to be acceptable to all users. Still, we need to ensure stability under saturating load where possible. I would like to think that with CASSANDRA-8099 that we can start discussing some of the client-facing back-pressure ideas more earnestly. We can come up with methods to improve the reliable and responsive capacity of the system even with some internal load management. If the first cut ends up being sub-optimal, then we can measure it against non-bounded workload tests and strive to close the gap. If it is implemented in a way that can support multiple usage scenarios, as described above, then such a limitation might be unlimited, bounded at level ___, or bounded by inline resource management.. But in any case would be controllable by some users/admin, client.. If we could ultimately give the categories of users above the ability to enable the various modes, then the 2a) scenario would be perfectly desirable for many users already even if the back-pressure logic only gave you 70% of the effective system capacity. Once testing shows that performance with active back-pressure to the client is close enough to the unbounded workloads, it could be enabled by default. Summary: We still need reasonable back-pressure support throughout the system and eventually to the client. Features like this that can be a stepping stone towards such are still needed. The most perfect load shedding and hinting systems will still not be a sufficient replacement for back-pressure and capacity management. was (Author:
[jira] [Commented] (CASSANDRA-9234) Disable single-sstable tombstone compactions for DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-9234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515852#comment-14515852 ] Jonathan Shook commented on CASSANDRA-9234: --- +1 Disable single-sstable tombstone compactions for DTCS - Key: CASSANDRA-9234 URL: https://issues.apache.org/jira/browse/CASSANDRA-9234 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson Assignee: Marcus Eriksson Fix For: 2.0.15 Attachments: 0001-9234.patch We should probably disable tombstone compactions by default for DTCS for these reasons: # users should not do deletes with DTCS # the only way we should get rid of data is by TTL - and then we don't want to trigger a single sstable compaction whenever an sstable is 20%+ expired, we want to drop the whole thing when it is fully expired -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8929) Workload sampling
[ https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496850#comment-14496850 ] Jonathan Shook commented on CASSANDRA-8929: --- There was an interesting discussion on this today. Notably, user effort to take a workload sample and create a stress tool or profile from it could be very low. It should be possible to take a sample of workload from a development system as the basis for a fully-configured stress test. It would also be less theoretical than any of the other approaches we are currently discussing, so would have a relatively limited scope of implementation. Workload sampling - Key: CASSANDRA-8929 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Workload *recording* looks to be unworkable (CASSANDRA-6572). We could build something almost as useful by sampling the requests sent to a node and building a synthetic workload with the same characteristics using the same (or anonymized) schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8826) Distributed aggregates
[ https://issues.apache.org/jira/browse/CASSANDRA-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486692#comment-14486692 ] Jonathan Shook edited comment on CASSANDRA-8826 at 4/9/15 4:56 AM: --- Consider that many systems are implementing aggregate processing at the client node. A more optimal system would allow those aggregates to be processed close to storage rather than bulk shipping operands across the wire to the client before any computation can even be started. Even using the coordinate for this is relatively wasteful. After considering multiple options for how to handle aggregates in a Cassandra-idiomatic way, I arrived at pretty much the same place as [~benedict]. The point is not to try to emulate other systems, but to highly optimize a very common and traffic-sensitive usage pattern. The partial data scenarios (CL1) are interesting, but you can easily describe what a reasonable behavior would be if data were missing from a replica. In the most basic case, you simply reflect the standard CL interpretation that the results from these nodes is not consistent at CL=Q. While this is not helpful to clients as such, it is a consistent interpretation of the semantics. The same types of things you might do as a user to deal with it do not change. If the data of interest is consistent, then aggregations of that data will be consistent, and vice-versa. That almost certainly invites more questions about the likely scenario of partial data for near-time reads at CL1. That, to me, is the most interesting and challenging part of this idea. If you simply do active read repair logic as an intermediate step (when needed), you still maintain the same CL semantics that users would expect. Am I missing something that makes this more complicated than I am thinking? My impression is that the concern for complexity is more fairly placed on the more advanced things that you might build on top of distributed single partition aggregates, not the basic idea of it. was (Author: jshook): Consider that many systems are implementing aggregate processing at the client node. A more optimal system would allow those aggregates to be processed close to storage rather than bulk shipping operands across the wire to the client before any computation can even be started. Even using the coordinate for this is relatively wasteful. After considering multiple options for how to handle aggregates in a Cassandra-idiomatic way, I arrived at pretty much the same place as [~benedict]. The point is not to try to emulate other systems, but to highly optimize a very common and traffic-sensitive usage pattern. The partial data scenarios (CL1) are interesting, but you can easily describe what a reasonable behavior would be if data were missing from a replica. In the most basic case, you simply reflect the standard CL interpretation that the results from these nodes is not consistent at CL=Q. While this is not helpful to clients as such, it is a consistent interpretation of the semantics. The same types of things you might do as a user to deal with it do not change. If the data of interest is consistent, then aggregations of that data will be consistent, and vice-versa. That almost certainly invites more questions about the likely scenario of partial data for near-time reads at CL1. That, to me, is the most interesting and challenging part of this idea. If you simply active read repair logic as an intermediate step, you still maintain the same CL semantics that users would expect. Am I missing something that makes this more complicated than I am thinking? My impression is that the concern for complexity is more fairly placed on the more advanced things that you might build on top of distributed single partition aggregates, not the basic idea of it. Distributed aggregates -- Key: CASSANDRA-8826 URL: https://issues.apache.org/jira/browse/CASSANDRA-8826 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Robert Stupp Priority: Minor Aggregations have been implemented in CASSANDRA-4914. All calculation is performed on the coordinator. This means, that all data is pulled by the coordinator and processed there. This ticket's about to distribute aggregates to make them more efficient. Currently some related tickets (esp. CASSANDRA-8099) are currently in progress - we should wait for them to land before talking about implementation. Another playgrounds (not covered by this ticket), that might be related is about _distributed filtering_. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8826) Distributed aggregates
[ https://issues.apache.org/jira/browse/CASSANDRA-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486692#comment-14486692 ] Jonathan Shook commented on CASSANDRA-8826: --- Consider that many systems are implementing aggregate processing at the client node. A more optimal system would allow those aggregates to be processed close to storage rather than bulk shipping operands across the wire to the client before any computation can even be started. Even using the coordinate for this is relatively wasteful. After considering multiple options for how to handle aggregates in a Cassandra-idiomatic way, I arrived at pretty much the same place as [~benedict]. The point is not to try to emulate other systems, but to highly optimize a very common and traffic-sensitive usage pattern. The partial data scenarios (CL1) are interesting, but you can easily describe what a reasonable behavior would be if data were missing from a replica. In the most basic case, you simply reflect the standard CL interpretation that the results from these nodes is not consistent at CL=Q. While this is not helpful to clients as such, it is a consistent interpretation of the semantics. The same types of things you might do as a user to deal with it do not change. If the data of interest is consistent, then aggregations of that data will be consistent, and vice-versa. That almost certainly invites more questions about the likely scenario of partial data for near-time reads at CL1. That, to me, is the most interesting and challenging part of this idea. If you simply active read repair logic as an intermediate step, you still maintain the same CL semantics that users would expect. Am I missing something that makes this more complicated than I am thinking? My impression is that the concern for complexity is more fairly placed on the more advanced things that you might build on top of distributed single partition aggregates, not the basic idea of it. Distributed aggregates -- Key: CASSANDRA-8826 URL: https://issues.apache.org/jira/browse/CASSANDRA-8826 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Robert Stupp Priority: Minor Aggregations have been implemented in CASSANDRA-4914. All calculation is performed on the coordinator. This means, that all data is pulled by the coordinator and processed there. This ticket's about to distribute aggregates to make them more efficient. Currently some related tickets (esp. CASSANDRA-8099) are currently in progress - we should wait for them to land before talking about implementation. Another playgrounds (not covered by this ticket), that might be related is about _distributed filtering_. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8359) Make DTCS consider removing SSTables much more frequently
[ https://issues.apache.org/jira/browse/CASSANDRA-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389501#comment-14389501 ] Jonathan Shook commented on CASSANDRA-8359: --- Linking to a 9056, possibly a duplicate. Make DTCS consider removing SSTables much more frequently - Key: CASSANDRA-8359 URL: https://issues.apache.org/jira/browse/CASSANDRA-8359 Project: Cassandra Issue Type: Improvement Reporter: Björn Hegerfors Assignee: Björn Hegerfors Priority: Minor Attachments: cassandra-2.0-CASSANDRA-8359.txt When I run DTCS on a table where every value has a TTL (always the same TTL), SSTables are completely expired, but still stay on disk for much longer than they need to. I've applied CASSANDRA-8243, but it doesn't make an apparent difference (probably because the subject SSTables are purged via compaction anyway, if not by directly dropping them). Disk size graphs show clearly that tombstones are only removed when the oldest SSTable participates in compaction. In the long run, size on disk continually grows bigger. This should not have to happen. It should easily be able to stay constant, thanks to DTCS separating the expired data from the rest. I think checks for whether SSTables can be dropped should happen much more frequently. This is something that probably only needs to be tweaked for DTCS, but perhaps there's a more general place to put this. Anyway, my thinking is that DTCS should, on every call to getNextBackgroundTask, check which SSTables can be dropped. It would be something like a call to CompactionController.getFullyExpiredSSTables with all non-compactingSSTables sent in as compacting and all other SSTables sent in as overlapping. The returned SSTables, if any, are then added to whichever set of SSTables that DTCS decides to compact. Then before the compaction happens, Cassandra is going to make another call to CompactionController.getFullyExpiredSSTables, where it will see that it can just drop them. This approach has a bit of redundancy in that it needs to call CompactionController.getFullyExpiredSSTables twice. To avoid that, the code path for deciding SSTables to drop would have to be changed. (Side tracking a little here: I'm also thinking that tombstone compactions could be considered more often in DTCS. Maybe even some kind of multi-SSTable tombstone compaction involving the oldest couple of SSTables...) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8986) Major cassandra-stress refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368394#comment-14368394 ] Jonathan Shook commented on CASSANDRA-8986: --- It is good to see the discussion move in this direction. [~benedict], All, Nearly all of what you describe in the list of behaviors are on my list for another project as well. Although it's still a fairly new project, there have been some early successes with demos and training tools. Here is a link that explains the project and motives: https://github.com/jshook/metagener/blob/master/metagener-core/docs/README.md I'd be happy to talk in more detail about it. It seems like we have lots of the same ideas about what is needed at the foundational level. It's possible to achieve a drastic simplification of the user-facing part, but only if we are willing to revamp the notion of how we define test loads. RE: distributing test loads: I have been thinking about how to distribute stress across multiple clients as well. The gist of it is that we can't get there without having a way to automatically partition the client workload across some spectrum. As follow-on work, I think it can be done. First we need a conceptually obvious and clean way to define whole test loads such that they can be partitioned compatibly with the behaviors described above. If I can help, given the other work I've been doing, let's keep the conversation going. Major cassandra-stress refactor --- Key: CASSANDRA-8986 URL: https://issues.apache.org/jira/browse/CASSANDRA-8986 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict We need a tool for both stressing _and_ validating more complex workloads than stress currently supports. Stress needs a raft of changes, and I think it would be easier to deliver many of these as a single major endeavour which I think is justifiable given its audience. The rough behaviours I want stress to support are: * Ability to know exactly how many rows it will produce, for any clustering prefix, without generating those prefixes * Ability to generate an amount of data proportional to the amount it will produce to the server (or consume from the server), rather than proportional to the variation in clustering columns * Ability to reliably produce near identical behaviour each run * Ability to understand complex overlays of operation types (LWT, Delete, Expiry, although perhaps not all implemented immediately, the framework for supporting them easily) * Ability to (with minimal internal state) understand the complete cluster state through overlays of multiple procedural generations * Ability to understand the in-flight state of in-progress operations (i.e. if we're applying a delete, understand that the delete may have been applied, and may not have been, for potentially multiple conflicting in flight operations) I think the necessary changes to support this would give us the _functional_ base to support all the functionality I can currently envisage stress needing. Before embarking on this (which I may attempt very soon), it would be helpful to get input from others as to features missing from stress that I haven't covered here that we will certainly want in the future, so that they can be factored in to the overall design and hopefully avoid another refactor one year from now, as its complexity is scaling each time, and each time it is a higher sunk cost. [~jbellis] [~iamaleksey] [~slebresne] [~tjake] [~enigmacurry] [~aweisberg] [~blambov] [~jshook] ... and @everyone else :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8929) Workload sampling
[ https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350956#comment-14350956 ] Jonathan Shook edited comment on CASSANDRA-8929 at 3/7/15 12:50 AM: Ideas on how I would like to see this work: (This is where I contradict myself in terms of simplicity by asking for more.) Intercept at the coordinator, only record samples at the coordinator. Make sampling a sticky setting. Make it a table option, but also soft-settable via JMX. Sampling controls: * sample_probability: Just like trace probability * sample_interval_seconds: Number of seconds for each sampling interval (I can't imagine why we'd need something finer grained, but maybe?) * sample_max_per_interval: explained below sample_max_per_interval: Number of samples per sampling interval, after which samples are suppressed. In this case, when the interval completes, the number of suppressed samples should also be written to the sample log, and reset. It's ok for this to be inconsistent with respect to restarts, etc. The main purpose it to avoid significant over sampling load, while still being able to see meaningful data during unexpected bursts. Data controls, for anonymizing field values, when needed, the ability to select a level of obfuscation: sample_data_obfuscate: * actualfields - No changes, record samples with full field values * hashedfields - Use md5 or something better to hide original sample values, but allow for statistical analysis * fieldsizes - Discard value, but record string lengths and collection counts * nofields - Do not retain the field values Data coverage: What to record. * the statement itself * whether it was prepared or not * consistency level * the client address * any changes to sampling policy or settings - This could be a separate type of record in the sample log, as long as the formatting is stable for each value it encodes * any counts for suppressed samples (written lazily at unthrottling time) was (Author: jshook): Ideas on how I would like to see this work: (This is where I contradict myself in terms of simplicity by asking for more.) Intercept at the coordinator, only record samples at the coordinator. Make sampling a sticky setting. Make it a table option, but also soft-settable via JMX. Sampling controls: * sample_probability: Just like trace probability * sample_interval_seconds: Number of seconds for each sampling interval (I can't imagine why we'd need something finer grained, but maybe?) * sample_max_per_interval: explained below sample_max_per_interval: Number of samples per sampling interval, after which samples are suppressed. In this case, when the interval completes, the number of suppressed samples should also be written to the sample log, and reset. It's ok for this to be inconsistent with respect to restarts, etc. The main purpose it to avoid significant over sampling load, while still being able to see meaningful data during unexpected bursts. Data controls, for anonymizing field values, when needed, the ability to select a level of obfuscation: sample_data_obfuscate: * actual - No changes, record samples with full field values * hashed - Use md5 or something better to hide original sample values, but allow for statistical analysis * sizes - Discard value, but record string lengths and collection counts Data coverage: What to record. * the statement itself * whether it was prepared or not * consistency level * the client address * any changes to sampling policy or settings - This could be a separate type of record in the sample log, as long as the formatting is stable for each value it encodes * any counts for suppressed samples (written lazily at unthrottling time) Workload sampling - Key: CASSANDRA-8929 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Workload *recording* looks to be unworkable (CASSANDRA-6572). We could build something almost as useful by sampling the requests sent to a node and building a synthetic workload with the same characteristics using the same (or anonymized) schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8929) Workload sampling
[ https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350897#comment-14350897 ] Jonathan Shook commented on CASSANDRA-8929: --- The ability to build testing tools around particular workloads is something we have been needing for a long time. I don't understand why the implementation would be complex. It is arguably much simpler than something like tracing, possibly even just a subset of tracing. All that has to be done is to support probabilistic sampling of statements, either at the coordinator or at the replica level. It's not complicated. Capturing the data in sample form is just the first step. The ability to look at a set of captured data and build a reasonably accurate test profile is something that we can't yet do automatically. However, it is something that can be made possible by having the samples. Still, I'd consider analysis of samples as a separate scope, and not the thrust of this request. Consuming sstables offline as a way to generate stress profiles is really avoiding the whole idea of sampling. You might be able to use CDC for that eventually (CASSANDRA-8844). In order to capture meaningful samples at a reasonable cost and level of operational simplicity means that we have to treat this as an operational feature worth pursuing. There are other reasons to want sampling besides just feeding stress. There are other testing tools which might make use of the data to help with full-stack testing. I can easily see someone wanting to use samples in an operational monitoring sense as well. Workload sampling - Key: CASSANDRA-8929 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Workload *recording* looks to be unworkable (CASSANDRA-6572). We could build something almost as useful by sampling the requests sent to a node and building a synthetic workload with the same characteristics using the same (or anonymized) schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8929) Workload sampling
[ https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350956#comment-14350956 ] Jonathan Shook commented on CASSANDRA-8929: --- Ideas on how I would like to see this work: (This is where I contradict myself in terms of simplicity by asking for more.) Intercept at the coordinator, only record samples at the coordinator. Make sampling a sticky setting. Make it a table option, but also soft-settable via JMX. Sampling controls: * sample_probability: Just like trace probability * sample_interval_seconds: Number of seconds for each sampling interval (I can't imagine why we'd need something finer grained, but maybe?) * sample_max_per_interval: explained below sample_max_per_interval: Number of samples per sampling interval, after which samples are suppressed. In this case, when the interval completes, the number of suppressed samples should also be written to the sample log, and reset. It's ok for this to be inconsistent with respect to restarts, etc. The main purpose it to avoid significant over sampling load, while still being able to see meaningful data during unexpected bursts. Data controls, for anonymizing field values, when needed, the ability to select a level of obfuscation: * sample_data_obfuscate * actual - No changes, record samples with full field values * hashed - Use md5 or something better to hide original sample values, but allow for statistical analysis * sizes - Discard value, but record string lengths and collection counts Data coverage: What to record. * the statement itself * whether it was prepared or not * consistency level * the client address * any changes to sampling policy or settings - This could be a separate type of record in the sample log, as long as the formatting is stable for each value it encodes * any counts for suppressed samples (written lazily at unthrottling time) Workload sampling - Key: CASSANDRA-8929 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Workload *recording* looks to be unworkable (CASSANDRA-6572). We could build something almost as useful by sampling the requests sent to a node and building a synthetic workload with the same characteristics using the same (or anonymized) schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8929) Workload sampling
[ https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350901#comment-14350901 ] Jonathan Shook commented on CASSANDRA-8929: --- Responding to [~jbellis], as we posted in parallel. Short of having sampling support on the server side, I do not see us getting useful samples. In all the environments that we operate in, the most reliable tools we have are those that are built into Cassandra directly. This feature would allow us to stop reinventing the wheel with users every time we need to understand what their workload is with respect to POCs and forward planning. I've personally started leaning more and more on settraceprobability for this, but it comes with its own caveats. To have something that is more tailored around sampling *just* the statements would save lots of time and energy. This is the type of feature that, when you need it, there is no substitute. If we could go into a new environment and make reasonable suggestions for how to configure sampling up front, we would be able to simply refer back to the data for historic context, changes in workload patterns, changes in data rates, etc. The short answer is, No, I don't know of an easier way, given all the trade-offs. Workload sampling - Key: CASSANDRA-8929 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Workload *recording* looks to be unworkable (CASSANDRA-6572). We could build something almost as useful by sampling the requests sent to a node and building a synthetic workload with the same characteristics using the same (or anonymized) schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8929) Workload sampling
[ https://issues.apache.org/jira/browse/CASSANDRA-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350956#comment-14350956 ] Jonathan Shook edited comment on CASSANDRA-8929 at 3/6/15 9:48 PM: --- Ideas on how I would like to see this work: (This is where I contradict myself in terms of simplicity by asking for more.) Intercept at the coordinator, only record samples at the coordinator. Make sampling a sticky setting. Make it a table option, but also soft-settable via JMX. Sampling controls: * sample_probability: Just like trace probability * sample_interval_seconds: Number of seconds for each sampling interval (I can't imagine why we'd need something finer grained, but maybe?) * sample_max_per_interval: explained below sample_max_per_interval: Number of samples per sampling interval, after which samples are suppressed. In this case, when the interval completes, the number of suppressed samples should also be written to the sample log, and reset. It's ok for this to be inconsistent with respect to restarts, etc. The main purpose it to avoid significant over sampling load, while still being able to see meaningful data during unexpected bursts. Data controls, for anonymizing field values, when needed, the ability to select a level of obfuscation: sample_data_obfuscate: * actual - No changes, record samples with full field values * hashed - Use md5 or something better to hide original sample values, but allow for statistical analysis * sizes - Discard value, but record string lengths and collection counts Data coverage: What to record. * the statement itself * whether it was prepared or not * consistency level * the client address * any changes to sampling policy or settings - This could be a separate type of record in the sample log, as long as the formatting is stable for each value it encodes * any counts for suppressed samples (written lazily at unthrottling time) was (Author: jshook): Ideas on how I would like to see this work: (This is where I contradict myself in terms of simplicity by asking for more.) Intercept at the coordinator, only record samples at the coordinator. Make sampling a sticky setting. Make it a table option, but also soft-settable via JMX. Sampling controls: * sample_probability: Just like trace probability * sample_interval_seconds: Number of seconds for each sampling interval (I can't imagine why we'd need something finer grained, but maybe?) * sample_max_per_interval: explained below sample_max_per_interval: Number of samples per sampling interval, after which samples are suppressed. In this case, when the interval completes, the number of suppressed samples should also be written to the sample log, and reset. It's ok for this to be inconsistent with respect to restarts, etc. The main purpose it to avoid significant over sampling load, while still being able to see meaningful data during unexpected bursts. Data controls, for anonymizing field values, when needed, the ability to select a level of obfuscation: * sample_data_obfuscate * actual - No changes, record samples with full field values * hashed - Use md5 or something better to hide original sample values, but allow for statistical analysis * sizes - Discard value, but record string lengths and collection counts Data coverage: What to record. * the statement itself * whether it was prepared or not * consistency level * the client address * any changes to sampling policy or settings - This could be a separate type of record in the sample log, as long as the formatting is stable for each value it encodes * any counts for suppressed samples (written lazily at unthrottling time) Workload sampling - Key: CASSANDRA-8929 URL: https://issues.apache.org/jira/browse/CASSANDRA-8929 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Workload *recording* looks to be unworkable (CASSANDRA-6572). We could build something almost as useful by sampling the requests sent to a node and building a synthetic workload with the same characteristics using the same (or anonymized) schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8869) Normalize prepared query text
[ https://issues.apache.org/jira/browse/CASSANDRA-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339310#comment-14339310 ] Jonathan Shook commented on CASSANDRA-8869: --- We looked at this in more detail. It initially looked like a trivial change, but after digging in a bit, there are some potentially thorny issues how it might behave in practice. For example, if you are using 75% of the allotted space for prepared statement caching, then it would be possible to seriously impact a production load due to a rolling upgrade churning the cache. Fixing this issue might not be worth the risk or even the trouble to log a warning. Normalize prepared query text - Key: CASSANDRA-8869 URL: https://issues.apache.org/jira/browse/CASSANDRA-8869 Project: Cassandra Issue Type: Improvement Components: API Reporter: Michael Penick Priority: Trivial Labels: lhf It's possible for equivalent queries with different case and/or whitespace to resolve to different prepared statement hashes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8406) Add option to set max_sstable_age in seconds in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306403#comment-14306403 ] Jonathan Shook commented on CASSANDRA-8406: --- +1 on 0001-8406.patch Add option to set max_sstable_age in seconds in DTCS Key: CASSANDRA-8406 URL: https://issues.apache.org/jira/browse/CASSANDRA-8406 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson Assignee: Marcus Eriksson Fix For: 2.0.13 Attachments: 0001-8406.patch, 0001-patch.patch Using days as the unit for max_sstable_age in DTCS might be too much, add option to set it in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279768#comment-14279768 ] Jonathan Shook commented on CASSANDRA-8621: --- For the scenario that prompted this ticket, it appeared that the streaming process was completely stalled. One side of the stream (the sender side) had an exception that appeared to be a connection reset. The receiving side appeared to think that the connection was still active, at least in terms of the netstats reported by nodetool. We were unable to verify whether this was specifically the case in terms of connected sockets due to the fact that there were multiple streams for those peers, and there is no simple way to correlate a specific stream to a tcp session. [~yukim] If there is a diagnostic method that we can use to provide more information about specific stalled streams, please let us know so that we can approach the user to get more data. For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream --- Key: CASSANDRA-8621 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jeremy Hanna Assignee: Yuki Morishita Currently we have a setting (streaming_socket_timeout_in_ms) that will timeout and retry the stream operation in the case where tcp is idle for a period of time. However in the case where the socket is closed or reset, we do not retry the operation. This can happen for a number of reasons, including when a firewall sends a reset message on a socket during a streaming operation, such as nodetool rebuild necessarily across DCs or repairs. Doing a retry would make the streaming operations more resilient. It would be good to log the retry clearly as well (with the stream session ID and node address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8621) For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream
[ https://issues.apache.org/jira/browse/CASSANDRA-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279774#comment-14279774 ] Jonathan Shook commented on CASSANDRA-8621: --- As well, there were no TCP level errors showing for the receiving side. So it is unclear whether exceptions are being omitted, or whether there was something really strange occurring with the network. For streaming operations, when a socket is closed/reset, we should retry/reinitiate that stream --- Key: CASSANDRA-8621 URL: https://issues.apache.org/jira/browse/CASSANDRA-8621 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jeremy Hanna Assignee: Yuki Morishita Currently we have a setting (streaming_socket_timeout_in_ms) that will timeout and retry the stream operation in the case where tcp is idle for a period of time. However in the case where the socket is closed or reset, we do not retry the operation. This can happen for a number of reasons, including when a firewall sends a reset message on a socket during a streaming operation, such as nodetool rebuild necessarily across DCs or repairs. Doing a retry would make the streaming operations more resilient. It would be good to log the retry clearly as well (with the stream session ID and node address). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8371) DateTieredCompactionStrategy is always compacting
[ https://issues.apache.org/jira/browse/CASSANDRA-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268134#comment-14268134 ] Jonathan Shook commented on CASSANDRA-8371: --- [~Bj0rn], [~michaelsembwever] Is there any new data on this? Any changes to settings or observations since the last major update? DateTieredCompactionStrategy is always compacting -- Key: CASSANDRA-8371 URL: https://issues.apache.org/jira/browse/CASSANDRA-8371 Project: Cassandra Issue Type: Bug Components: Core Reporter: mck Assignee: Björn Hegerfors Labels: compaction, performance Attachments: java_gc_counts_rate-month.png, read-latency-recommenders-adview.png, read-latency.png, sstables-recommenders-adviews.png, sstables.png, vg2_iad-month.png Running 2.0.11 and having switched a table to [DTCS|https://issues.apache.org/jira/browse/CASSANDRA-6602] we've seen that disk IO and gc count increase, along with the number of reads happening in the compaction hump of cfhistograms. Data, and generally performance, looks good, but compactions are always happening, and pending compactions are building up. The schema for this is {code}CREATE TABLE search ( loginid text, searchid timeuuid, description text, searchkey text, searchurl text, PRIMARY KEY ((loginid), searchid) );{code} We're sitting on about 82G (per replica) across 6 nodes in 4 DCs. CQL executed against this keyspace, and traffic patterns, can be seen in slides 7+8 of https://prezi.com/b9-aj6p2esft/ Attached are sstables-per-read and read-latency graphs from cfhistograms, and screenshots of our munin graphs as we have gone from STCS, to LCS (week ~44), to DTCS (week ~46). These screenshots are also found in the prezi on slides 9-11. [~pmcfadin], [~Bj0rn], Can this be a consequence of occasional deleted rows, as is described under (3) in the description of CASSANDRA-6602 ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8303) Provide strict mode for CQL Queries
[ https://issues.apache.org/jira/browse/CASSANDRA-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266550#comment-14266550 ] Jonathan Shook commented on CASSANDRA-8303: --- It might be nice if the auth system was always in play (when that auth provider is set), but the system defaults are applied to a virtual role with a name like defaults. This cleans up any layering questions by casting the yaml defaults into the authz conceptual model. If a user isn't assigned to another defined role, they should be automatically assigned to the defaults role. Otherwise, explaining the result of layering them, even with precedence, might become overly cumbersome. With it, you can use both. Provide strict mode for CQL Queries - Key: CASSANDRA-8303 URL: https://issues.apache.org/jira/browse/CASSANDRA-8303 Project: Cassandra Issue Type: Improvement Reporter: Anupam Arora Fix For: 3.0 Please provide a strict mode option in cassandra that will kick out any CQL queries that are expensive, e.g. any query with ALLOWS FILTERING, multi-partition queries, secondary index queries, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8303) Provide strict mode for CQL Queries
[ https://issues.apache.org/jira/browse/CASSANDRA-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261885#comment-14261885 ] Jonathan Shook commented on CASSANDRA-8303: --- A permission that might be helpful to add to the list: UNPREPARED_STATEMENTS. I can easily see unprepared statements being disallowed in some environments, for prod app accounts. Provide strict mode for CQL Queries - Key: CASSANDRA-8303 URL: https://issues.apache.org/jira/browse/CASSANDRA-8303 Project: Cassandra Issue Type: Improvement Reporter: Anupam Arora Fix For: 3.0 Please provide a strict mode option in cassandra that will kick out any CQL queries that are expensive, e.g. any query with ALLOWS FILTERING, multi-partition queries, secondary index queries, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)