[jira] [Commented] (CASSANDRA-6694) Slightly More Off-Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944357#comment-13944357 ] Benedict commented on CASSANDRA-6694: - FTR, with a very simple and short performance comparison, simulating writing a lot of small (integer) fields, using cassandra-stress write n=40 -col size=fixed\(4\) n=fixed\(100\), I see a 25% throughput improvement using offheap_objects as the allocator type vs either on/off heap buffers. I should expect to see performance improve further as the length of the test increases, as write amplification takes its toll more rapidly on the heap buffers. Slightly More Off-Heap Memtables Key: CASSANDRA-6694 URL: https://issues.apache.org/jira/browse/CASSANDRA-6694 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 The Off Heap memtables introduced in CASSANDRA-6689 don't go far enough, as the on-heap overhead is still very large. It should not be tremendously difficult to extend these changes so that we allocate entire Cells off-heap, instead of multiple BBs per Cell (with all their associated overhead). The goal (if possible) is to reach an overhead of 16-bytes per Cell (plus 4-6 bytes per cell on average for the btree overhead, for a total overhead of around 20-22 bytes). This translates to 8-byte object overhead, 4-byte address (we will do alignment tricks like the VM to allow us to address a reasonably large memory space, although this trick is unlikely to last us forever, at which point we will have to bite the bullet and accept a 24-byte per cell overhead), and 4-byte object reference for maintaining our internal list of allocations, which is unfortunately necessary since we cannot safely (and cheaply) walk the object graph we allocate otherwise, which is necessary for (allocation-) compaction and pointer rewriting. The ugliest thing here is going to be implementing the various CellName instances so that they may be backed by native memory OR heap memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6694) Slightly More Off-Heap Memtables
[ https://issues.apache.org/jira/browse/CASSANDRA-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944357#comment-13944357 ] Benedict edited comment on CASSANDRA-6694 at 3/23/14 7:05 AM: -- FTR, with a very simple and short performance comparison, simulating writing a lot of small (integer) fields, using cassandra-stress write n=40 -col size=fixed\(4\) n=fixed\(100\), I see a 25% throughput improvement using offheap_objects as the allocator type vs either on/off heap buffers. I would expect to see performance improve further as the length of the test increases, as write amplification takes its toll more rapidly on the heap buffers, but don't intend to test this out much further as it was just for some ball park idea of how much impact it might have. was (Author: benedict): FTR, with a very simple and short performance comparison, simulating writing a lot of small (integer) fields, using cassandra-stress write n=40 -col size=fixed\(4\) n=fixed\(100\), I see a 25% throughput improvement using offheap_objects as the allocator type vs either on/off heap buffers. I should expect to see performance improve further as the length of the test increases, as write amplification takes its toll more rapidly on the heap buffers. Slightly More Off-Heap Memtables Key: CASSANDRA-6694 URL: https://issues.apache.org/jira/browse/CASSANDRA-6694 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 2.1 beta2 The Off Heap memtables introduced in CASSANDRA-6689 don't go far enough, as the on-heap overhead is still very large. It should not be tremendously difficult to extend these changes so that we allocate entire Cells off-heap, instead of multiple BBs per Cell (with all their associated overhead). The goal (if possible) is to reach an overhead of 16-bytes per Cell (plus 4-6 bytes per cell on average for the btree overhead, for a total overhead of around 20-22 bytes). This translates to 8-byte object overhead, 4-byte address (we will do alignment tricks like the VM to allow us to address a reasonably large memory space, although this trick is unlikely to last us forever, at which point we will have to bite the bullet and accept a 24-byte per cell overhead), and 4-byte object reference for maintaining our internal list of allocations, which is unfortunately necessary since we cannot safely (and cheaply) walk the object graph we allocate otherwise, which is necessary for (allocation-) compaction and pointer rewriting. The ugliest thing here is going to be implementing the various CellName instances so that they may be backed by native memory OR heap memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-6746: --- Attachment: buffered-io-tweaks.patch [~enigmacurry] Here is a patch (rebased with the latest cassandra-2.1 branch) which should improve the warm up period (it does on my SSD machine), what it does is simple - sets all RAR to FADV_RANDOM (whole file), when SegmentedFile.getSegment(position) is called on PoolingSegmentedFile (which is enabled by setting 'disk_access_mode: standard' in cassandra.yaml) it would mark first buffer, 64KB by default, as sequential area and do FADV_WILLNEED on the first page starting from position, that works as kind of of smart read-ahead (if we discard they idea that we already thashing by polling 64KB to read one small row). Can you please test it on your HDD machines to see if that actually works in the environment with higher I/O latencies?... Another useful test would be to test this code in mixed write/read mode to effectively check how good is page replacement mechanism in the kernel :) P.S. please set device read-ahead (blockdev --setra ...) back to it's default value before doing the tests. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944388#comment-13944388 ] Pavel Yaskevich edited comment on CASSANDRA-6746 at 3/23/14 9:31 AM: - [~enigmacurry] Here is a patch (rebased with the latest cassandra-2.1 branch) which should improve the warm up period (it does on my SSD machine), what it does is simple - sets all RAR to FADV_RANDOM (whole file), when SegmentedFile.getSegment(position) is called on PoolingSegmentedFile (which is enabled by setting 'disk_access_mode: standard' in cassandra.yaml) it would mark first buffer, 64KB by default, as sequential area and do FADV_WILLNEED on the first page starting from position, that works as kind of of smart read-ahead (if we discard they idea that we already thashing by polling 64KB to read one small row) because getSegment(position) for buffered files points to the start of the row. Can you please test it on your HDD machines to see if that actually works in the environment with higher I/O latencies?... Another useful test would be to test this code in mixed write/read mode to effectively check how good is page replacement mechanism in the kernel :) P.S. please set device read-ahead (blockdev --setra ...) back to it's default value before doing the tests. was (Author: xedin): [~enigmacurry] Here is a patch (rebased with the latest cassandra-2.1 branch) which should improve the warm up period (it does on my SSD machine), what it does is simple - sets all RAR to FADV_RANDOM (whole file), when SegmentedFile.getSegment(position) is called on PoolingSegmentedFile (which is enabled by setting 'disk_access_mode: standard' in cassandra.yaml) it would mark first buffer, 64KB by default, as sequential area and do FADV_WILLNEED on the first page starting from position, that works as kind of of smart read-ahead (if we discard they idea that we already thashing by polling 64KB to read one small row). Can you please test it on your HDD machines to see if that actually works in the environment with higher I/O latencies?... Another useful test would be to test this code in mixed write/read mode to effectively check how good is page replacement mechanism in the kernel :) P.S. please set device read-ahead (blockdev --setra ...) back to it's default value before doing the tests. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load
Bartłomiej Romański created CASSANDRA-6908: -- Summary: Dynamic endpoint snitch destabilizes cluster under heavy load Key: CASSANDRA-6908 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908 Project: Cassandra Issue Type: Improvement Reporter: Bartłomiej Romański We observe that with dynamic snitch disabled our cluster is much more stable than with dynamic snitch enabled. We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB RAM, 2x480 GB SSD). We mostly do reads (about 300k/s). We use Astyanax on client side with TOKEN_AWARE option enabled. It automatically direct read queries to one of the nodes responsible the given token. In that case with dynamic snitch disabled Cassandra always handles read locally. With dynamic snitch enabled Cassandra very often decides to proxy the read to some other node. This causes much higher CPU usage and produces much more garbage what results in more often GC pauses (young generation fills up quicker). By much higher and much more I mean 1.5-2x. I'm aware that higher dynamic_snitch_badness_threshold value should solve that issue. The default value is 0.1. I've looked at scores exposed in JMX and the problem is that our values seemed to be completely random. They are between usually 0.5 and 2.0, but changes randomly every time I hit refresh. Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something like that, but the result will be similar to simply disabling the dynamic switch at all (that's what we done). I've tried to understand what's the logic behind these scores and I'm not sure if I get the idea... It's a sum (without any multipliers) of two components: - ratio of recent given node latency to recent average node latency - something called 'severity', what, if I analyzed the code correctly, is a result of BackgroundActivityMonitor.getIOWait() - it's a ratio of iowait CPU time to the whole CPU time as reported in /proc/stats (the ratio is multiplied by 100) In our case the second value is something around 0-2% but varies quite heavily every second. What's the idea behind simply adding this two values without any multipliers (e.g the second one is in percentage while the first one is not)? Are we sure this is the best possible way of calculating the final score? Is there a way too force Cassandra to use (much) longer samples? In our case we probably need that to get stable values. The 'severity' is calculated for each second. The mean latency is calculated based on some magic, hardcoded values (ALPHA = 0.75, WINDOW_SIZE = 100). Am I right that there's no way to tune that without hacking the code? I'm aware that there's dynamic_snitch_update_interval_in_ms property in the config file, but that only determines how often the scores are recalculated not how long samples are taken. Is that correct? To sum up, It would be really nice to have more control over dynamic snitch behavior or at least have the official option to disable it described in the default config file (it took me some time to discover that we can just disable it instead of hacking with dynamic_snitch_badness_threshold=1000). Currently for some scenarios (like ours - optimized cluster, token aware client, heavy load) it causes more harm than good. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6909) A way to expire columns without converting to tombstones
Bartłomiej Romański created CASSANDRA-6909: -- Summary: A way to expire columns without converting to tombstones Key: CASSANDRA-6909 URL: https://issues.apache.org/jira/browse/CASSANDRA-6909 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Bartłomiej Romański Imagine the following scenario. - You need to store some data knowing that you will need them only for limited time (say 7 days). - After that you just don't care. You don't need them to be returned in the queries, but if they are returned that's not a problem at all - you won't look at them anyway. - You records are small. Row keys and column names are even longer than the actual values (e.g. ints vs strings). - You reuse rows. You add some new columns to most of the rows every day or two. This means that columns expire often, rows usually not. - You generate a lot of data and want to make sure that expired records do not consume disk space for too long. Current TTL feature do not handle that situation well. When compaction finally decides that it's worth to compact the given sstable it won't simply get rid of expired columns. Instead it will transform them into tombstones. In case of small values that's not a saving at all. Even if you set grace period to 0 tombstones cannot be removed too early because some other sstable can still have values that should be covered by this tombstone. You can get rid of tombstone only in two cases: - it's a major compaction (never happens with LCS, requires a lot of space in STCS) - bloom filters tell you that there are no others sstable with this row key The second option is not common if you usually have multiple columns in a single row that was written not at once. It's a great chance you'll have your row spread across multiple sstables. And from time to time a new ones are generated. There's very little chance they'll all meet in one compaction at some point. What's funny, bloom filters returns true if there's a tombstone for the given row in the given sstable. So you won't remove tombstones during compaction, because there's some other tombstone in another sstable for that row :/ After a while, you end up with a lot of tombstones (majority of your data) and can do nothing about that. Now image that Cassandra knows that we just don't care about data older than 7 days. Firstly, it can simply drop such columns during compactions (without converting them to tombstones or anything like that). Secondly, if it detects an sstable older than 7 days it can safely remove it at all (it cannot contain any active data). These two *guarantee* that you data will be removed after 14 days (2xTTL). If do compaction after 7 days, expired data will be removed. If we not, whole sstable will be removed after another 7 days. That's what I expected from CASSANDRA-3974, but it turned out to be a just trivial, frontend feature. I suggest to rethink this mechanism. I don't believe that it's a common scenario that someone who sets TTL for whole CF need all this strong guarantees that data will not reappear in the future in case of some issues with consistency (that's why we need this whole mess with tombstones). I believe common case with per-CF TTL is that you just want an efficient way of recover you disk space (and improve reads performance by having less sstables and less data in general). To work around this we currently periodically stop Cassandra, simply remove too old sstables, and start it back. Works OK, but does not solve problem fully (if tombstone is rewritten by compactions often, we will never remove it). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6909) A way to expire columns without converting to tombstones
[ https://issues.apache.org/jira/browse/CASSANDRA-6909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartłomiej Romański updated CASSANDRA-6909: --- Description: Imagine the following scenario. - You need to store some data knowing that you will need them only for limited time (say 7 days). - After that you just don't care. You don't need them to be returned in the queries, but if they are returned that's not a problem at all - you won't look at them anyway. - You records are small. Row keys and column names are even longer than the actual values (e.g. ints vs strings). - You reuse rows. You add some new columns to most of the rows every day or two. This means that columns expire often, rows usually not. - You generate a lot of data and want to make sure that expired records do not consume disk space for too long. Current TTL feature do not handle that situation well. When compaction finally decides that it's worth to compact the given sstable it won't simply get rid of expired columns. Instead it will transform them into tombstones. In case of small values that's not a saving at all. Even if you set grace period to 0 tombstones cannot be removed too early because some other sstable can still have values that should be covered by this tombstone. You can get rid of tombstone only in two cases: - it's a major compaction (never happens with LCS, requires a lot of space in STCS) - bloom filters tell you that there are no others sstable with this row key The second option is not common if you usually have multiple columns in a single row that was written not at once. It's a great chance you'll have your row spread across multiple sstables. And from time to time a new ones are generated. There's very little chance they'll all meet in one compaction at some point. What's funny, bloom filters returns true if there's a tombstone for the given row in the given sstable. So you won't remove tombstones during compaction, because there's some other tombstone in another sstable for that row :/ After a while, you end up with a lot of tombstones (majority of your data) and can do nothing about that. Now image that Cassandra knows that we just don't care about data older than 7 days. Firstly, it can simply drop such columns during compactions (without converting them to tombstones or anything like that). Secondly, if it detects an sstable older than 7 days it can safely remove it at all (it cannot contain any active data). These two *guarantee* that you data will be removed after 14 days (2xTTL). If we do compaction after 7 days, expired data will be removed. If we not, whole sstable will be removed after another 7 days. That's what I expected from CASSANDRA-3974, but it turned out to be a just trivial, frontend feature. I suggest to rethink this mechanism. I don't believe that it's a common scenario that someone who sets TTL for whole CF need all this strong guarantees that data will not reappear in the future in case of some issues with consistency (that's why we need this whole mess with tombstones). I believe common case with per-CF TTL is that you just want an efficient way of recover you disk space (and improve reads performance by having less sstables and less data in general). To work around this we currently periodically stop Cassandra, simply remove too old sstables, and start it back. Works OK, but does not solve problem fully (if tombstone is rewritten by compactions often, we will never remove it). was: Imagine the following scenario. - You need to store some data knowing that you will need them only for limited time (say 7 days). - After that you just don't care. You don't need them to be returned in the queries, but if they are returned that's not a problem at all - you won't look at them anyway. - You records are small. Row keys and column names are even longer than the actual values (e.g. ints vs strings). - You reuse rows. You add some new columns to most of the rows every day or two. This means that columns expire often, rows usually not. - You generate a lot of data and want to make sure that expired records do not consume disk space for too long. Current TTL feature do not handle that situation well. When compaction finally decides that it's worth to compact the given sstable it won't simply get rid of expired columns. Instead it will transform them into tombstones. In case of small values that's not a saving at all. Even if you set grace period to 0 tombstones cannot be removed too early because some other sstable can still have values that should be covered by this tombstone. You can get rid of tombstone only in two cases: - it's a major compaction (never happens with LCS, requires a lot of space in STCS) - bloom filters tell you that there are no others sstable with this row key The second option is not common if you usually have multiple columns
[jira] [Updated] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load
[ https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartłomiej Romański updated CASSANDRA-6908: --- Component/s: Core Config Dynamic endpoint snitch destabilizes cluster under heavy load - Key: CASSANDRA-6908 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908 Project: Cassandra Issue Type: Improvement Components: Config, Core Reporter: Bartłomiej Romański We observe that with dynamic snitch disabled our cluster is much more stable than with dynamic snitch enabled. We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB RAM, 2x480 GB SSD). We mostly do reads (about 300k/s). We use Astyanax on client side with TOKEN_AWARE option enabled. It automatically direct read queries to one of the nodes responsible the given token. In that case with dynamic snitch disabled Cassandra always handles read locally. With dynamic snitch enabled Cassandra very often decides to proxy the read to some other node. This causes much higher CPU usage and produces much more garbage what results in more often GC pauses (young generation fills up quicker). By much higher and much more I mean 1.5-2x. I'm aware that higher dynamic_snitch_badness_threshold value should solve that issue. The default value is 0.1. I've looked at scores exposed in JMX and the problem is that our values seemed to be completely random. They are between usually 0.5 and 2.0, but changes randomly every time I hit refresh. Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something like that, but the result will be similar to simply disabling the dynamic switch at all (that's what we done). I've tried to understand what's the logic behind these scores and I'm not sure if I get the idea... It's a sum (without any multipliers) of two components: - ratio of recent given node latency to recent average node latency - something called 'severity', what, if I analyzed the code correctly, is a result of BackgroundActivityMonitor.getIOWait() - it's a ratio of iowait CPU time to the whole CPU time as reported in /proc/stats (the ratio is multiplied by 100) In our case the second value is something around 0-2% but varies quite heavily every second. What's the idea behind simply adding this two values without any multipliers (e.g the second one is in percentage while the first one is not)? Are we sure this is the best possible way of calculating the final score? Is there a way too force Cassandra to use (much) longer samples? In our case we probably need that to get stable values. The 'severity' is calculated for each second. The mean latency is calculated based on some magic, hardcoded values (ALPHA = 0.75, WINDOW_SIZE = 100). Am I right that there's no way to tune that without hacking the code? I'm aware that there's dynamic_snitch_update_interval_in_ms property in the config file, but that only determines how often the scores are recalculated not how long samples are taken. Is that correct? To sum up, It would be really nice to have more control over dynamic snitch behavior or at least have the official option to disable it described in the default config file (it took me some time to discover that we can just disable it instead of hacking with dynamic_snitch_badness_threshold=1000). Currently for some scenarios (like ours - optimized cluster, token aware client, heavy load) it causes more harm than good. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load
[ https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944423#comment-13944423 ] Brandon Williams commented on CASSANDRA-6908: - What version are you on? We removed the latency calculation somewhat recently in 2.0. Dynamic endpoint snitch destabilizes cluster under heavy load - Key: CASSANDRA-6908 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908 Project: Cassandra Issue Type: Improvement Components: Config, Core Reporter: Bartłomiej Romański We observe that with dynamic snitch disabled our cluster is much more stable than with dynamic snitch enabled. We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB RAM, 2x480 GB SSD). We mostly do reads (about 300k/s). We use Astyanax on client side with TOKEN_AWARE option enabled. It automatically direct read queries to one of the nodes responsible the given token. In that case with dynamic snitch disabled Cassandra always handles read locally. With dynamic snitch enabled Cassandra very often decides to proxy the read to some other node. This causes much higher CPU usage and produces much more garbage what results in more often GC pauses (young generation fills up quicker). By much higher and much more I mean 1.5-2x. I'm aware that higher dynamic_snitch_badness_threshold value should solve that issue. The default value is 0.1. I've looked at scores exposed in JMX and the problem is that our values seemed to be completely random. They are between usually 0.5 and 2.0, but changes randomly every time I hit refresh. Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something like that, but the result will be similar to simply disabling the dynamic switch at all (that's what we done). I've tried to understand what's the logic behind these scores and I'm not sure if I get the idea... It's a sum (without any multipliers) of two components: - ratio of recent given node latency to recent average node latency - something called 'severity', what, if I analyzed the code correctly, is a result of BackgroundActivityMonitor.getIOWait() - it's a ratio of iowait CPU time to the whole CPU time as reported in /proc/stats (the ratio is multiplied by 100) In our case the second value is something around 0-2% but varies quite heavily every second. What's the idea behind simply adding this two values without any multipliers (e.g the second one is in percentage while the first one is not)? Are we sure this is the best possible way of calculating the final score? Is there a way too force Cassandra to use (much) longer samples? In our case we probably need that to get stable values. The 'severity' is calculated for each second. The mean latency is calculated based on some magic, hardcoded values (ALPHA = 0.75, WINDOW_SIZE = 100). Am I right that there's no way to tune that without hacking the code? I'm aware that there's dynamic_snitch_update_interval_in_ms property in the config file, but that only determines how often the scores are recalculated not how long samples are taken. Is that correct? To sum up, It would be really nice to have more control over dynamic snitch behavior or at least have the official option to disable it described in the default config file (it took me some time to discover that we can just disable it instead of hacking with dynamic_snitch_badness_threshold=1000). Currently for some scenarios (like ours - optimized cluster, token aware client, heavy load) it causes more harm than good. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-876) Support session (read-after-write) consistency
[ https://issues.apache.org/jira/browse/CASSANDRA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1396#comment-1396 ] Muhammad Adel commented on CASSANDRA-876: - Is this issue still open for the latest version of Cassandra? As far as I understand from reading different documentations and articles about MemTables they are already searched for data before searching the SSTable when performing a query. Support session (read-after-write) consistency -- Key: CASSANDRA-876 URL: https://issues.apache.org/jira/browse/CASSANDRA-876 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Priority: Minor Labels: gsoc, gsoc2010 Attachments: 876-v2.txt, CASSANDRA-876.patch In http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html and http://www.allthingsdistributed.com/2008/12/eventually_consistent.html Amazon discusses the concept of eventual consistency. Cassandra uses eventual consistency in a design similar to Dynamo. Supporting session consistency would be useful and relatively easy to add: we already have the concept of a Memtable (see http://wiki.apache.org/cassandra/MemtableSSTable ) to stage updates in before flushing to disk; if we applied mutations to a session-level memtable on the coordinator machine (that is, the machine the client is connected to), and then did a final merge from that table against query results before handing them to the client, we'd get it almost for free. Of course, the devil is in the details; thrift doesn't provide any hooks for session-level data out of the box, but we could do this with a threadlocal approach fairly easily. CASSANDRA-569 has some (probably out of date now) code that might be useful here. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-876) Support session (read-after-write) consistency
[ https://issues.apache.org/jira/browse/CASSANDRA-876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1396#comment-1396 ] Muhammad Adel edited comment on CASSANDRA-876 at 3/23/14 3:20 PM: -- Is this issue still open for the latest version of Cassandra? As far as I understand from reading different documentations and articles about MemTables, They are already searched for data before searching the SSTable when performing a query. was (Author: muhammadadel): Is this issue still open for the latest version of Cassandra? As far as I understand from reading different documentations and articles about MemTables they are already searched for data before searching the SSTable when performing a query. Support session (read-after-write) consistency -- Key: CASSANDRA-876 URL: https://issues.apache.org/jira/browse/CASSANDRA-876 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Priority: Minor Labels: gsoc, gsoc2010 Attachments: 876-v2.txt, CASSANDRA-876.patch In http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html and http://www.allthingsdistributed.com/2008/12/eventually_consistent.html Amazon discusses the concept of eventual consistency. Cassandra uses eventual consistency in a design similar to Dynamo. Supporting session consistency would be useful and relatively easy to add: we already have the concept of a Memtable (see http://wiki.apache.org/cassandra/MemtableSSTable ) to stage updates in before flushing to disk; if we applied mutations to a session-level memtable on the coordinator machine (that is, the machine the client is connected to), and then did a final merge from that table against query results before handing them to the client, we'd get it almost for free. Of course, the devil is in the details; thrift doesn't provide any hooks for session-level data out of the box, but we could do this with a threadlocal approach fairly easily. CASSANDRA-569 has some (probably out of date now) code that might be useful here. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6910) Better table structure display in cqlsh
Tupshin Harper created CASSANDRA-6910: - Summary: Better table structure display in cqlsh Key: CASSANDRA-6910 URL: https://issues.apache.org/jira/browse/CASSANDRA-6910 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Tupshin Harper Priority: Minor It should be possible to make it more immediately obvious what the structure of your CQL table is from cqlsh. Two minor enhancements could go a long way: 1) If there are no results display the column headers anyway. Right now, if you are trying to do a query and get no results, it's common to need to display the table schema to figure out what you did wrong. Having the columns displayed whenever you do a query wouldn't get in the way, and would provide a more visual way than by describing the table. 2) Along with the first one, if we could highlight the partition/clustering columns in different colors, it would be much more intuitively understandable what the underlying partition structure is. tl;dr: the forms below should each have distinct visual representation when displaying the column headers, and the column headers should always be shown. CREATE TABLE usertest ( userid text, email text, name text, PRIMARY KEY (userid) ) CREATE TABLE usertest2 ( userid text, email text, name text, PRIMARY KEY (userid, email) ) CREATE TABLE usertest3 ( userid text, email text, name text, PRIMARY KEY ((userid, email)) ) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McGuire updated CASSANDRA-6746: Attachment: 6746-buffered-io-tweaks.png [~xedin] Here's a benchmark of your buffered-io-tweaks patch: !6746-buffered-io-tweaks.png! It seemed to delay the ramp up, and shorten the duration, but it still did it. I did two trials of it to make sure. I'll get you a mixed workload benchmark soon. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6506) counters++ split counter context shards into separate cells
[ https://issues.apache.org/jira/browse/CASSANDRA-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944567#comment-13944567 ] Aleksey Yeschenko commented on CASSANDRA-6506: -- Any objections to at least committing the first two cleanup commits now (the first one is there to mostly kill all the IDEA warnings, at last, but the second one - c503d6ae89651186b9ac7fc8026eab0ace137af - does deconfuse the API a bit) ? counters++ split counter context shards into separate cells --- Key: CASSANDRA-6506 URL: https://issues.apache.org/jira/browse/CASSANDRA-6506 Project: Cassandra Issue Type: Improvement Reporter: Aleksey Yeschenko Assignee: Aleksey Yeschenko Fix For: 2.1 beta2 This change is related to, but somewhat orthogonal to CASSANDRA-6504. Currently all the shard tuples for a given counter cell are packed, in sorted order, in one binary blob. Thus reconciling N counter cells requires allocating a new byte buffer capable of holding the union of the two context's shards N-1 times. For writes, in post CASSANDRA-6504 world, it also means reading more data than we have to (the complete context, when all we need is the local node's global shard). Splitting the context into separate cells, one cell per shard, will help to improve this. We did a similar thing with super columns for CASSANDRA-3237. Incidentally, doing this split is now possible thanks to CASSANDRA-3237. Doing this would also simplify counter reconciliation logic. Getting rid of old contexts altogether can be done trivially with upgradesstables. In fact, we should be able to put the logical clock into the cell's timestamp, and use regular Cell-s and regular Cell reconcile() logic for the shards, especially once we get rid of the local/remote shards some time in the future (until then we still have to differentiate between global/remote/local shards and their priority rules). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6911) Netty dependency update broke stress
Ryan McGuire created CASSANDRA-6911: --- Summary: Netty dependency update broke stress Key: CASSANDRA-6911 URL: https://issues.apache.org/jira/browse/CASSANDRA-6911 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Ryan McGuire Assignee: Benedict I compiled stress fresh from cassandra-2.1 and running this command: {code} cassandra-stress write n=1900 -rate threads=50 -node bdplab {code} I get the following traceback: {code} Exception in thread Thread-49 java.lang.NoClassDefFoundError: org/jboss/netty/channel/ChannelFactory at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:941) at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:889) at com.datastax.driver.core.Cluster.init(Cluster.java:88) at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:144) at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:854) at org.apache.cassandra.stress.util.JavaDriverClient.connect(JavaDriverClient.java:74) at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:155) at org.apache.cassandra.stress.settings.StressSettings.getSmartThriftClient(StressSettings.java:70) at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:275) Caused by: java.lang.ClassNotFoundException: org.jboss.netty.channel.ChannelFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 9 more {code} It seems this was introduced with an updated netty jar in cbf304ebd0436a321753e81231545b705aa8dd23 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944580#comment-13944580 ] Pavel Yaskevich commented on CASSANDRA-6746: [~enigmacurry] Yes, it would not eliminate it completely just shorten the duration and speed up initial warmup, but this drop in operation is worrisome, can you check if that could be something JVM related or something on Cassandra side happening at the same time with drop in op rate? Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McGuire updated CASSANDRA-6746: Attachment: 6746.buffered_io_tweaks.logs.tar.gz I'm using java 1.7.0_51. Using a default cassandra.yaml except for the disk_access_mode:standard setting. I don't see anything weird in the logs, but I've uploaded them if you want to check them out (6746.buffered_io_tweaks.logs.tar.gz). Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6575) By default, Cassandra should refuse to start if JNA can't be initialized properly
[ https://issues.apache.org/jira/browse/CASSANDRA-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944649#comment-13944649 ] Ryan McGuire commented on CASSANDRA-6575: - This got broken. I have deleted the JNA jar file from my lib directory, and the error message telling me that cassandra refuses to start is still working. However, if I use the boot_without_jna option it suggests, I get this traceback: {code} ERROR 23:28:22 Exception in thread Thread[MemtableFlushWriter:1,5,main] java.lang.NoClassDefFoundError: com/sun/jna/Native at org.apache.cassandra.io.util.Memory.asByteBuffers(Memory.java:305) ~[main/:na] at org.apache.cassandra.io.util.AbstractDataOutput.write(AbstractDataOutput.java:326) ~[main/:na] at org.apache.cassandra.io.sstable.IndexSummary$IndexSummarySerializer.serialize(IndexSummary.java:221) ~[main/:na] at org.apache.cassandra.io.sstable.SSTableReader.saveSummary(SSTableReader.java:709) ~[main/:na] at org.apache.cassandra.io.sstable.SSTableReader.saveSummary(SSTableReader.java:696) ~[main/:na] at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:356) ~[main/:na] at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:331) ~[main/:na] at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:326) ~[main/:na] at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:363) ~[main/:na] at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:321) ~[main/:na] at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) ~[main/:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[main/:na] at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1029) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_51] Caused by: java.lang.ClassNotFoundException: com.sun.jna.Native at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_51] at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[na:1.7.0_51] at java.security.AccessController.doPrivileged(Native Method) ~[na:1.7.0_51] at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[na:1.7.0_51] at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[na:1.7.0_51] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[na:1.7.0_51] at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[na:1.7.0_51] ... 17 common frames omitted {code} This was working at the time this ticket was closed before, but it's now broken on cassandra-2.1 HEAD. By default, Cassandra should refuse to start if JNA can't be initialized properly - Key: CASSANDRA-6575 URL: https://issues.apache.org/jira/browse/CASSANDRA-6575 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Tupshin Harper Assignee: Clément Lardeur Priority: Minor Labels: lhf Fix For: 2.1 beta1 Attachments: trunk-6575-v2.patch, trunk-6575-v3.patch, trunk-6575-v4.patch, trunk-6575.patch Failure to have JNA working properly is such a common undetected problem that it would be far preferable to have Cassandra refuse to startup unless JNA is initialized. In theory, this should be much less of a problem with Cassandra 2.1 due to CASSANDRA-5872, but even there, it might fail due to native lib problems, or might otherwise be misconfigured. A yaml override, such as boot_without_jna would allow the deliberate overriding of this policy. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (CASSANDRA-6575) By default, Cassandra should refuse to start if JNA can't be initialized properly
[ https://issues.apache.org/jira/browse/CASSANDRA-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McGuire reopened CASSANDRA-6575: - By default, Cassandra should refuse to start if JNA can't be initialized properly - Key: CASSANDRA-6575 URL: https://issues.apache.org/jira/browse/CASSANDRA-6575 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Tupshin Harper Assignee: Clément Lardeur Priority: Minor Labels: lhf Fix For: 2.1 beta1 Attachments: trunk-6575-v2.patch, trunk-6575-v3.patch, trunk-6575-v4.patch, trunk-6575.patch Failure to have JNA working properly is such a common undetected problem that it would be far preferable to have Cassandra refuse to startup unless JNA is initialized. In theory, this should be much less of a problem with Cassandra 2.1 due to CASSANDRA-5872, but even there, it might fail due to native lib problems, or might otherwise be misconfigured. A yaml override, such as boot_without_jna would allow the deliberate overriding of this policy. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944654#comment-13944654 ] Ryan McGuire commented on CASSANDRA-6746: - I ran a mixed read / write workload on a number of branches. [You can see the results here|http://localhost:8000/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.json] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.1 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944654#comment-13944654 ] Ryan McGuire edited comment on CASSANDRA-6746 at 3/24/14 12:04 AM: --- I ran a mixed read / write workload on a number of branches. [You can see the results here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.json] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.1 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. was (Author: enigmacurry): I ran a mixed read / write workload on a number of branches. [You can see the results here|http://localhost:8000/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.json] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.1 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944654#comment-13944654 ] Ryan McGuire edited comment on CASSANDRA-6746 at 3/24/14 12:05 AM: --- I ran a mixed read / write workload on a number of branches. [You can see the results here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.jsonoperation=mixed] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.1 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. was (Author: enigmacurry): I ran a mixed read / write workload on a number of branches. [You can see the results here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.json] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.1 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944654#comment-13944654 ] Ryan McGuire edited comment on CASSANDRA-6746 at 3/24/14 12:08 AM: --- I ran a mixed read / write workload on a number of branches. [You can see the results here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.jsonoperation=mixed] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.0 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. was (Author: enigmacurry): I ran a mixed read / write workload on a number of branches. [You can see the results here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.mixed.jsonoperation=mixed] That chart is a bit messy, so you need to click the colored squares to only see results for a few branches at a time. The branches tested: * [~xedin]'s buffered-io-tweaks patch on cassandra-2.1 HEAD * cassandra-2.1 HEAD * cassandra-2.0 HEAD with JNA * cassandra-2.1 HEAD without JNA Similar to the buffered-io-tweaks run I did for solo-reads, it looks to improve things here as well. However, even in mixed workloads, simply disabling JNA is still working better. I cannot currently test cassandra-2.1 without JNA because of CASSANDRA-6575 which I have just now reopened. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load
[ https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944672#comment-13944672 ] Bartłomiej Romański commented on CASSANDRA-6908: We're using 2.0.5. I'm looking that code from 2.0.6 now and it looks like it's still there (however I'm not very familiar with the code so it's possible I understand something wrong). Dynamic endpoint snitch destabilizes cluster under heavy load - Key: CASSANDRA-6908 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908 Project: Cassandra Issue Type: Improvement Components: Config, Core Reporter: Bartłomiej Romański We observe that with dynamic snitch disabled our cluster is much more stable than with dynamic snitch enabled. We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB RAM, 2x480 GB SSD). We mostly do reads (about 300k/s). We use Astyanax on client side with TOKEN_AWARE option enabled. It automatically direct read queries to one of the nodes responsible the given token. In that case with dynamic snitch disabled Cassandra always handles read locally. With dynamic snitch enabled Cassandra very often decides to proxy the read to some other node. This causes much higher CPU usage and produces much more garbage what results in more often GC pauses (young generation fills up quicker). By much higher and much more I mean 1.5-2x. I'm aware that higher dynamic_snitch_badness_threshold value should solve that issue. The default value is 0.1. I've looked at scores exposed in JMX and the problem is that our values seemed to be completely random. They are between usually 0.5 and 2.0, but changes randomly every time I hit refresh. Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something like that, but the result will be similar to simply disabling the dynamic switch at all (that's what we done). I've tried to understand what's the logic behind these scores and I'm not sure if I get the idea... It's a sum (without any multipliers) of two components: - ratio of recent given node latency to recent average node latency - something called 'severity', what, if I analyzed the code correctly, is a result of BackgroundActivityMonitor.getIOWait() - it's a ratio of iowait CPU time to the whole CPU time as reported in /proc/stats (the ratio is multiplied by 100) In our case the second value is something around 0-2% but varies quite heavily every second. What's the idea behind simply adding this two values without any multipliers (e.g the second one is in percentage while the first one is not)? Are we sure this is the best possible way of calculating the final score? Is there a way too force Cassandra to use (much) longer samples? In our case we probably need that to get stable values. The 'severity' is calculated for each second. The mean latency is calculated based on some magic, hardcoded values (ALPHA = 0.75, WINDOW_SIZE = 100). Am I right that there's no way to tune that without hacking the code? I'm aware that there's dynamic_snitch_update_interval_in_ms property in the config file, but that only determines how often the scores are recalculated not how long samples are taken. Is that correct? To sum up, It would be really nice to have more control over dynamic snitch behavior or at least have the official option to disable it described in the default config file (it took me some time to discover that we can just disable it instead of hacking with dynamic_snitch_badness_threshold=1000). Currently for some scenarios (like ours - optimized cluster, token aware client, heavy load) it causes more harm than good. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944690#comment-13944690 ] Pavel Yaskevich commented on CASSANDRA-6746: [~enigmacurry] Thanks for the results, this looks promising although I one question for me remains why is there that deep for buffered-io patch, It might be related to the last compaction combining 4 sstables into one... Can you please do the following experiment - write the data, force a flush + major compaction, once all compactions complete run the buffered-io-tweaks patch to see if that deep in the middle of the run is actually caused by compaction replacing pre-heated file set with completely cold file? Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944690#comment-13944690 ] Pavel Yaskevich edited comment on CASSANDRA-6746 at 3/24/14 1:49 AM: - [~enigmacurry] Thanks for the results, this looks promising although, one question still remains - why is the dip for buffered-io patch happening. It might be related to the last compaction combining 4 sstables into one... Can you please do the following experiment - write the data, force a flush + major compaction, once all compactions complete run the buffered-io-tweaks patch to see if that deep in the middle of the run is actually caused by compaction replacing pre-heated file set with completely cold file? was (Author: xedin): [~enigmacurry] Thanks for the results, this looks promising although I one question for me remains why is there that deep for buffered-io patch, It might be related to the last compaction combining 4 sstables into one... Can you please do the following experiment - write the data, force a flush + major compaction, once all compactions complete run the buffered-io-tweaks patch to see if that deep in the middle of the run is actually caused by compaction replacing pre-heated file set with completely cold file? Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McGuire updated CASSANDRA-6746: Attachment: 6746.buffered_io_tweaks.write-read-flush-compact.png Write, then flush, then compact, then read - that seems to work well: !6746.buffered_io_tweaks.write-read-flush-compact.png! [data here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.write-flush-compact-read.jsonoperation=read] I'll run a (write, flush, compact, mixed read/write) test next to make sure that looks good too. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.buffered_io_tweaks.write-read-flush-compact.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McGuire updated CASSANDRA-6746: Attachment: 6746.buffered_io_tweaks.write-flush-compact-mixed.png write, flush, compact, mixed read/write: !6746.buffered_io_tweaks.write-flush-compact-mixed.png! [data here|http://riptano.github.io/cassandra_performance/graph/graph.html?stats=stats.6746.buffered-io-tweaks.write-flush-compact-mixed.jsonmetric=op_rateoperation=mixedsmoothing=1xmin=0xmax=381.59ymin=0ymax=98910.9] So, looks like it's eliminating any drop at the start. I don't have an explanation for the short/periodic drops here in mixed mode. I don't have a lot of experience with this new stress mode yet to know, except that 2.0 without JNA still faired better. Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.buffered_io_tweaks.write-flush-compact-mixed.png, 6746.buffered_io_tweaks.write-read-flush-compact.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944760#comment-13944760 ] Ryan McGuire edited comment on CASSANDRA-6746 at 3/24/14 5:28 AM: -- [~benedict] can you confirm if my stress options look alright for mixed mode, do you have a better suggestion? was (Author: enigmacurry): @benedict can you confirm if my stress options look alright for mixed mode, do you have a better suggestion? Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.buffered_io_tweaks.write-flush-compact-mixed.png, 6746.buffered_io_tweaks.write-read-flush-compact.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6746) Reads have a slow ramp up in speed
[ https://issues.apache.org/jira/browse/CASSANDRA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13944760#comment-13944760 ] Ryan McGuire commented on CASSANDRA-6746: - @benedict can you confirm if my stress options look alright for mixed mode, do you have a better suggestion? Reads have a slow ramp up in speed -- Key: CASSANDRA-6746 URL: https://issues.apache.org/jira/browse/CASSANDRA-6746 Project: Cassandra Issue Type: Bug Components: Core Reporter: Ryan McGuire Assignee: Benedict Labels: performance Fix For: 2.1 beta2 Attachments: 2.1_vs_2.0_read.png, 6746-buffered-io-tweaks.png, 6746-patched.png, 6746.blockdev_setra.full.png, 6746.blockdev_setra.zoomed.png, 6746.buffered_io_tweaks.logs.tar.gz, 6746.buffered_io_tweaks.write-flush-compact-mixed.png, 6746.buffered_io_tweaks.write-read-flush-compact.png, 6746.txt, buffered-io-tweaks.patch, cassandra-2.0-bdplab-trial-fincore.tar.bz2, cassandra-2.1-bdplab-trial-fincore.tar.bz2 On a physical four node cluister I am doing a big write and then a big read. The read takes a long time to ramp up to respectable speeds. !2.1_vs_2.0_read.png! [See data here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.2.1_vs_2.0_vs_1.2.retry1.jsonmetric=interval_op_rateoperation=stress-readsmoothing=1] -- This message was sent by Atlassian JIRA (v6.2#6252)