[jira] [Updated] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
[ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-2698: Attachment: patch.diff Instrument repair to be able to assess it's efficiency (precision) -- Key: CASSANDRA-2698 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Priority: Minor Labels: lhf Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
[ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609625#comment-13609625 ] Benedict commented on CASSANDRA-2698: - Hi, I've uploaded a patch for this issue (patch.diff - apologies for the potentially future-clashing name). Logging is performed in two places, both on the source (not requesting) node of any comparison: 1) On the requesting node in AntiEntropyService.Difference.run(), after the MerkleTree difference is calculated and before the StreamingRepairTask is created 2) On the source node, on which StreamingRepairTask is run, in StreamOut.createPendingFiles() In both cases we log, at debug level, a sample of the largest ranges followed by a histogram of the range size distribution. The first is achieved by inserting each range directly into an EstimatedHistogram, on which we call the new logSummary() method; the second by calling the new groupByFrequency() method on that same histogram, to yield a histogram based on the frequency of sizes present in the original (on which we simply call log()). In case 1, we construct the MerkleTree to include a size taken from the AbstractCompactedRow we compute the hash from, and use this in MerkleTree.difference to estimate the size of mismatching ranges. This tends to underestimate, versus that reported by StreamOut, by around 15%. One design decision of note here: instead of modifying AbstractCompactedRow to return a size (which would be invasive and in some cases incur an unnecessary penalty) we use a custom implementation of MessageDigest that counts the number of bytes provided to it. Case 2 is much simpler, as we already have the ranges and their sizes available to us. There are some other changes, particularly in MerkleTree, with some refactoring/renames/new subclasses as part of updating MerkleTree.difference(). In particular, TreeDifference is returned instead of TreeRange (to accommodate the extra size information), and it is used generally in place of it within this method tree where applicable; hash() and hashHelper() have also been renamed to find() and findHelper(), with a new hash() implementation depending on find(). I'm sure there are other minutiae, but hopefully nothing too opaque. If you need any clarification, feel free to ask. Instrument repair to be able to assess it's efficiency (precision) -- Key: CASSANDRA-2698 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Priority: Minor Labels: lhf Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
[ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-2698: Attachment: patch-rebased.diff Instrument repair to be able to assess it's efficiency (precision) -- Key: CASSANDRA-2698 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benedict Priority: Minor Labels: lhf Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff, patch-rebased.diff Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
[ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616569#comment-13616569 ] Benedict commented on CASSANDRA-2698: - Hi Yuki, The patch was created some time ago, and there were some minor renames/changes to MerkleTree and AntiEntropyService in the meantime. I've pulled the latest changes, merged, and regenerated the patch. This is against the main trunk. Instrument repair to be able to assess it's efficiency (precision) -- Key: CASSANDRA-2698 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benedict Priority: Minor Labels: lhf Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff, patch-rebased.diff Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
[ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616569#comment-13616569 ] Benedict edited comment on CASSANDRA-2698 at 3/28/13 8:26 PM: -- Hi Yuki, The patch was created some time ago, and there were some minor renames/changes to MerkleTree and AntiEntropyService in the meantime. I've pulled the latest changes, merged, and regenerated the patch. This is against the main trunk / HEAD branch. was (Author: benedict): Hi Yuki, The patch was created some time ago, and there were some minor renames/changes to MerkleTree and AntiEntropyService in the meantime. I've pulled the latest changes, merged, and regenerated the patch. This is against the main trunk. Instrument repair to be able to assess it's efficiency (precision) -- Key: CASSANDRA-2698 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benedict Priority: Minor Labels: lhf Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff, patch-rebased.diff Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
[ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626520#comment-13626520 ] Benedict commented on CASSANDRA-2698: - Hi Yuki, Without in some way collecting (or at least sampling) the size of the differences, I don't know what bucket sizes to use. Since I need to reinsert all the records once I've decided this anyway, I need to retain them all, which I chose to do in EstimatedHistogram as they do, in effect, constitute a histogram. I also sample the largest records which I figure could be useful for debugging purposes (though that was just a guess). I don't see why 1000s of items is a major issue? I agree that logging is suboptimal for this data. Presumably similar data for other tasks may be optionally logged in future, and so I would guess this should form part of a discussion about metric logging? {quote} fix coding style (especially whitespace) to match other code. {quote} Do you have an Eclipse formatter profile I could use for your coding convention? I did my best to keep it correct manually, but it is difficult to spot differences in an unfamiliar convention. Whitespace should be comparatively easy though. {quote} EstimatedHistogram#testGroupBy is failing. {quote} Noted - will fix and resubmit {quote} comparator in Arrays#sort in EstimatedHistogram#logSummary has the same conditions in both if and else if. {quote} Thanks, good spot. I'm surprised Eclipse didn't warn me. Instrument repair to be able to assess it's efficiency (precision) -- Key: CASSANDRA-2698 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benedict Priority: Minor Labels: lhf Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff, patch-rebased.diff Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis is that the merkle tree precision may deteriorate too much at some data size. To check this hypothesis, it would be reasonably to gather statistic during the merkle tree building of how many rows each merkle tree range account for (and the size that this represent). It is probably an interesting statistic to have anyway. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-7631) Allow Stress to write directly to SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080759#comment-14080759 ] Benedict commented on CASSANDRA-7631: - Feel free to leave this one for me, as I'll be looking at stress soon anyway. Allow Stress to write directly to SSTables -- Key: CASSANDRA-7631 URL: https://issues.apache.org/jira/browse/CASSANDRA-7631 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Russell Alexander Spitzer Assignee: Russell Alexander Spitzer One common difficulty with benchmarking machines is the amount of time it takes to initially load data. For machines with a large amount of ram this becomes especially onerous because a very large amount of data needs to be placed on the machine before page-cache can be circumvented. To remedy this I suggest we add a top level flag to Cassandra-Stress which would cause the tool to write directly to sstables rather than actually performing CQL inserts. Internally this would use CQLSStable writer to write directly to sstables while skipping any keys which are not owned by the node stress is running on. The same stress command run on each node in the cluster would then write unique sstables only containing data which that node is responsible for. Following this no further network IO would be required to distribute data as it would all already be correctly in place. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7593) Errors when upgrading through several versions to 2.1
[ https://issues.apache.org/jira/browse/CASSANDRA-7593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080825#comment-14080825 ] Benedict commented on CASSANDRA-7593: - Might be better to just expose this in CSCNT, since we have access to it when doing this. Errors when upgrading through several versions to 2.1 - Key: CASSANDRA-7593 URL: https://issues.apache.org/jira/browse/CASSANDRA-7593 Project: Cassandra Issue Type: Bug Environment: java 1.7 Reporter: Russ Hatch Assignee: Tyler Hobbs Priority: Critical Fix For: 2.1.0 Attachments: 0001-keep-clusteringSize-in-CompoundComposite.patch, 7593.txt I'm seeing two different errors cropping up in the dtest which upgrades a cluster through several versions. This is the more common error: {noformat} ERROR [GossipStage:10] 2014-07-22 13:14:30,028 CassandraDaemon.java:168 - Exception in thread Thread[GossipStage:10,5,main] java.lang.AssertionError: null at org.apache.cassandra.db.filter.SliceQueryFilter.shouldInclude(SliceQueryFilter.java:347) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter.shouldInclude(QueryFilter.java:249) ~[main/:na] at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:249) ~[main/:na] at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:60) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1873) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1681) ~[main/:na] at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:345) ~[main/:na] at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:59) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.readLocally(SelectStatement.java:293) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:302) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:60) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:263) ~[main/:na] at org.apache.cassandra.db.SystemKeyspace.getPreferredIP(SystemKeyspace.java:514) ~[main/:na] at org.apache.cassandra.net.OutboundTcpConnectionPool.init(OutboundTcpConnectionPool.java:51) ~[main/:na] at org.apache.cassandra.net.MessagingService.getConnectionPool(MessagingService.java:522) ~[main/:na] at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:536) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:689) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendReply(MessagingService.java:663) ~[main/:na] at org.apache.cassandra.service.EchoVerbHandler.doVerb(EchoVerbHandler.java:40) ~[main/:na] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_60] at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_60] {noformat} The same test sometimes fails with this exception instead: {noformat} ERROR [CompactionExecutor:4] 2014-07-22 16:18:21,008 CassandraDaemon.java:168 - Exception in thread Thread[CompactionExecutor:4,1,RMI Runtime] java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7059d3e9 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@108f1504[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 95] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) ~[na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:325) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:530) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:619) ~[na:1.7.0_60] at org.apache.cassandra.io.sstable.SSTableReader.scheduleTidy(SSTableReader.java:628)
[jira] [Commented] (CASSANDRA-7567) when the commit_log disk for a single node is overwhelmed the entire cluster slows down
[ https://issues.apache.org/jira/browse/CASSANDRA-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081045#comment-14081045 ] Benedict commented on CASSANDRA-7567: - Which mode of connectivity? smart thrift and cql native3 both use token aware routing from the Java driver (smart thrift does its own fairly dumb round-robin for a given token range), so will go directly to a random node in the cluster. Java driver I don't think we have any easy API control over what nodes we connect to, and I'm not sure there's a lot of point making smart thrift too smart, since it's only there to compare fairly against cql native3's token-aware routing. Regular thrift mode won't do this. when the commit_log disk for a single node is overwhelmed the entire cluster slows down --- Key: CASSANDRA-7567 URL: https://issues.apache.org/jira/browse/CASSANDRA-7567 Project: Cassandra Issue Type: Bug Components: Core Environment: debian 7.5, bare metal, 14 nodes, 64CPUs, 64GB RAM, commit_log disk sata, data disk SSD, vnodes, leveled compaction strategy Reporter: David O'Dell Assignee: Brandon Williams Fix For: 2.1.0 Attachments: 7567.logs.bz2, write_request_latency.png We've run into a situation where a single node out of 14 is experiencing high disk io. This can happen when a node is being decommissioned or after it joins the ring and runs into the bug cassandra-6621. When this occurs the write latency for the entire cluster spikes. From 0.3ms to 170ms. To simulate this simply run dd on the commit_log disk (dd if=/dev/zero of=/tmp/foo bs=1024) and you will see that instantly all nodes in the cluster have slowed down. BTW overwhelming the data disk does not have this same effect. Also I've tried this where the overwhelmed node isn't being connected directly from the client and it still has the same effect. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7567) when the commit_log disk for a single node is overwhelmed the entire cluster slows down
[ https://issues.apache.org/jira/browse/CASSANDRA-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081060#comment-14081060 ] Benedict commented on CASSANDRA-7567: - So Java driver then (cql3 native prepared is default) There isn't anything we can really do about this without Java Driver support when the commit_log disk for a single node is overwhelmed the entire cluster slows down --- Key: CASSANDRA-7567 URL: https://issues.apache.org/jira/browse/CASSANDRA-7567 Project: Cassandra Issue Type: Bug Components: Core Environment: debian 7.5, bare metal, 14 nodes, 64CPUs, 64GB RAM, commit_log disk sata, data disk SSD, vnodes, leveled compaction strategy Reporter: David O'Dell Assignee: Brandon Williams Attachments: 7567.logs.bz2, write_request_latency.png We've run into a situation where a single node out of 14 is experiencing high disk io. This can happen when a node is being decommissioned or after it joins the ring and runs into the bug cassandra-6621. When this occurs the write latency for the entire cluster spikes. From 0.3ms to 170ms. To simulate this simply run dd on the commit_log disk (dd if=/dev/zero of=/tmp/foo bs=1024) and you will see that instantly all nodes in the cluster have slowed down. BTW overwhelming the data disk does not have this same effect. Also I've tried this where the overwhelmed node isn't being connected directly from the client and it still has the same effect. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7658: Issue Type: Improvement (was: Bug) stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Fix For: 2.1.1 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7658: Priority: Minor (was: Major) stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081063#comment-14081063 ] Benedict commented on CASSANDRA-7658: - You're inferring a property of the 'node' option it doesn't promise; that's the list of nodes it connects to initially to get started. You're looking for a whitelist, which is a different thing, and not currently supported. For dumb routing it necessarily behaves as both, but this is a feature request not a bug. Either way, we need Java Driver support that I don't think currently exists via the API. stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081071#comment-14081071 ] Benedict commented on CASSANDRA-7658: - old-stress has no distinction between a whitelist and an initial list. stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-7631) Allow Stress to write directly to SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-7631: --- Assignee: Benedict (was: Russell Alexander Spitzer) Allow Stress to write directly to SSTables -- Key: CASSANDRA-7631 URL: https://issues.apache.org/jira/browse/CASSANDRA-7631 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Russell Alexander Spitzer Assignee: Benedict One common difficulty with benchmarking machines is the amount of time it takes to initially load data. For machines with a large amount of ram this becomes especially onerous because a very large amount of data needs to be placed on the machine before page-cache can be circumvented. To remedy this I suggest we add a top level flag to Cassandra-Stress which would cause the tool to write directly to sstables rather than actually performing CQL inserts. Internally this would use CQLSStable writer to write directly to sstables while skipping any keys which are not owned by the node stress is running on. The same stress command run on each node in the cluster would then write unique sstables only containing data which that node is responsible for. Following this no further network IO would be required to distribute data as it would all already be correctly in place. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7631) Allow Stress to write directly to SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7631: Assignee: Russell Alexander Spitzer (was: Benedict) Allow Stress to write directly to SSTables -- Key: CASSANDRA-7631 URL: https://issues.apache.org/jira/browse/CASSANDRA-7631 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Russell Alexander Spitzer Assignee: Russell Alexander Spitzer One common difficulty with benchmarking machines is the amount of time it takes to initially load data. For machines with a large amount of ram this becomes especially onerous because a very large amount of data needs to be placed on the machine before page-cache can be circumvented. To remedy this I suggest we add a top level flag to Cassandra-Stress which would cause the tool to write directly to sstables rather than actually performing CQL inserts. Internally this would use CQLSStable writer to write directly to sstables while skipping any keys which are not owned by the node stress is running on. The same stress command run on each node in the cluster would then write unique sstables only containing data which that node is responsible for. Following this no further network IO would be required to distribute data as it would all already be correctly in place. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7631) Allow Stress to write directly to SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081116#comment-14081116 ] Benedict commented on CASSANDRA-7631: - Ok, in that case some various random thoughts on this: 1) I suspect you're not blocking on ABQ, but on the single thread you have consuming from it (and having this separate thread is bad anyway). It's likely you're getting some misattribution in your profiler due to rapid thread sleeping/waking there. 2) We should for now complain if the whole partition isn't being inserted for this mode 3) We should create the CF on each individual thread, and we should append them unsorted onto a ConcurrentLinkedQueue, track the total memory used in the buffer, and have a separate thread that sorts the partition keys and flushes out to disk once we exceed our threshold for doing so (much like memtable flushing) 4) We should modify the PartitionGenerator to support sorting the clustering components it generates; this way we can reduce the sorting cost fairly dramatically, as sorting individual components is much cheaper than sorting all components at once 5) Ideally we would visit the partition keys in approximately sorted order, so that we can flush a single file, as this will be most efficient for loading. This will require a minor portion of the changes I'll be introducing soon for more realistic workload generation, and then a custom SeedGenerator that (externally) pre-sorts the seeds based on the partitions they generate. Allow Stress to write directly to SSTables -- Key: CASSANDRA-7631 URL: https://issues.apache.org/jira/browse/CASSANDRA-7631 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Russell Alexander Spitzer Assignee: Russell Alexander Spitzer One common difficulty with benchmarking machines is the amount of time it takes to initially load data. For machines with a large amount of ram this becomes especially onerous because a very large amount of data needs to be placed on the machine before page-cache can be circumvented. To remedy this I suggest we add a top level flag to Cassandra-Stress which would cause the tool to write directly to sstables rather than actually performing CQL inserts. Internally this would use CQLSStable writer to write directly to sstables while skipping any keys which are not owned by the node stress is running on. The same stress command run on each node in the cluster would then write unique sstables only containing data which that node is responsible for. Following this no further network IO would be required to distribute data as it would all already be correctly in place. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7644) tracing does not log commitlog/memtable ops when the coordinator is a replica
[ https://issues.apache.org/jira/browse/CASSANDRA-7644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081129#comment-14081129 ] Benedict commented on CASSANDRA-7644: - LGTM tracing does not log commitlog/memtable ops when the coordinator is a replica - Key: CASSANDRA-7644 URL: https://issues.apache.org/jira/browse/CASSANDRA-7644 Project: Cassandra Issue Type: Bug Components: Core Reporter: Brandon Williams Assignee: Jonathan Ellis Priority: Minor Fix For: 2.1.0 Attachments: 7644.txt For instance: {noformat} session_id | event_id | activity | source | source_elapsed | thread --+--+---+---++- bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1c4bc1-176f-11e4-8893-4b4842ed69b9 | Parsing insert into Standard1 (key, C0) VALUES ( 0xff, 0xff); | 10.208.8.123 | 86 | SharedPool-Worker-5 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1c72d0-176f-11e4-8893-4b4842ed69b9 | Preparing statement | 10.208.8.123 |434 | SharedPool-Worker-5 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1c72d1-176f-11e4-8893-4b4842ed69b9 | Determining replicas for mutation | 10.208.8.123 |534 | SharedPool-Worker-5 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1c72d2-176f-11e4-8893-4b4842ed69b9 | Sending message to /10.208.8.63 | 10.208.8.123 | 1157 | WRITE-/10.208.8.63 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1c99e0-176f-11e4-8893-4b4842ed69b9 | Sending message to /10.208.35.225 | 10.208.8.123 | 1975 |WRITE-/10.208.35.225 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d0f10-176f-11e4-8893-4b4842ed69b9 |Message received from /10.208.8.63 | 10.208.8.123 | 4732 |Thread-5 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d0f11-176f-11e4-8893-4b4842ed69b9 | Message received from /10.208.35.225 | 10.208.8.123 | 5086 |Thread-4 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3620-176f-11e4-8893-4b4842ed69b9 | Processing response from /10.208.8.63 | 10.208.8.123 | 5288 | SharedPool-Worker-7 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3620-176f-11e4-93e6-517bcdb23258 | Message received from /10.208.8.123 | 10.208.35.225 | 76 |Thread-4 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3620-176f-11e4-9b20-3b546d897db7 | Message received from /10.208.8.123 | 10.208.8.63 |317 |Thread-4 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3621-176f-11e4-8893-4b4842ed69b9 | Processing response from /10.208.35.225 | 10.208.8.123 | 5332 | SharedPool-Worker-7 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3621-176f-11e4-93e6-517bcdb23258 |Appending to commitlog | 10.208.35.225 |322 | SharedPool-Worker-4 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3622-176f-11e4-93e6-517bcdb23258 | Adding to Standard1 memtable | 10.208.35.225 |386 | SharedPool-Worker-4 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d3623-176f-11e4-93e6-517bcdb23258 | Enqueuing response to /10.208.8.123 | 10.208.35.225 |451 | SharedPool-Worker-4 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d5d30-176f-11e4-93e6-517bcdb23258 | Sending message to bw-1/10.208.8.123 | 10.208.35.225 | 1538 | WRITE-bw-1/10.208.8.123 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d5d30-176f-11e4-9b20-3b546d897db7 |Appending to commitlog | 10.208.8.63 | 1191 | SharedPool-Worker-7 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d5d31-176f-11e4-9b20-3b546d897db7 | Adding to Standard1 memtable | 10.208.8.63 | 1226 | SharedPool-Worker-7 bb1c4bc0-176f-11e4-8893-4b4842ed69b9 | bb1d5d32-176f-11e4-9b20-3b546d897db7 | Enqueuing response to /10.208.8.123 | 10.208.8.63 | 1277 | SharedPool-Worker-7
[jira] [Commented] (CASSANDRA-7593) Errors when upgrading through several versions to 2.1
[ https://issues.apache.org/jira/browse/CASSANDRA-7593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14081135#comment-14081135 ] Benedict commented on CASSANDRA-7593: - Yes, this is what I meant when I said expose it in CSCNT; didn't spot it was already exposed Errors when upgrading through several versions to 2.1 - Key: CASSANDRA-7593 URL: https://issues.apache.org/jira/browse/CASSANDRA-7593 Project: Cassandra Issue Type: Bug Environment: java 1.7 Reporter: Russ Hatch Assignee: Tyler Hobbs Priority: Critical Fix For: 2.1.0 Attachments: 0001-keep-clusteringSize-in-CompoundComposite.patch, 7593.txt I'm seeing two different errors cropping up in the dtest which upgrades a cluster through several versions. This is the more common error: {noformat} ERROR [GossipStage:10] 2014-07-22 13:14:30,028 CassandraDaemon.java:168 - Exception in thread Thread[GossipStage:10,5,main] java.lang.AssertionError: null at org.apache.cassandra.db.filter.SliceQueryFilter.shouldInclude(SliceQueryFilter.java:347) ~[main/:na] at org.apache.cassandra.db.filter.QueryFilter.shouldInclude(QueryFilter.java:249) ~[main/:na] at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:249) ~[main/:na] at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:60) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1873) ~[main/:na] at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1681) ~[main/:na] at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:345) ~[main/:na] at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:59) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.readLocally(SelectStatement.java:293) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:302) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.executeInternal(SelectStatement.java:60) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:263) ~[main/:na] at org.apache.cassandra.db.SystemKeyspace.getPreferredIP(SystemKeyspace.java:514) ~[main/:na] at org.apache.cassandra.net.OutboundTcpConnectionPool.init(OutboundTcpConnectionPool.java:51) ~[main/:na] at org.apache.cassandra.net.MessagingService.getConnectionPool(MessagingService.java:522) ~[main/:na] at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:536) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:689) ~[main/:na] at org.apache.cassandra.net.MessagingService.sendReply(MessagingService.java:663) ~[main/:na] at org.apache.cassandra.service.EchoVerbHandler.doVerb(EchoVerbHandler.java:40) ~[main/:na] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_60] at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_60] {noformat} The same test sometimes fails with this exception instead: {noformat} ERROR [CompactionExecutor:4] 2014-07-22 16:18:21,008 CassandraDaemon.java:168 - Exception in thread Thread[CompactionExecutor:4,1,RMI Runtime] java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7059d3e9 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@108f1504[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 95] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) ~[na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:325) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:530) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:619) ~[na:1.7.0_60] at org.apache.cassandra.io.sstable.SSTableReader.scheduleTidy(SSTableReader.java:628)
[jira] [Commented] (CASSANDRA-7511) Always flush on TRUNCATE
[ https://issues.apache.org/jira/browse/CASSANDRA-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082295#comment-14082295 ] Benedict commented on CASSANDRA-7511: - Looking at 2.1, it is actually still affected by this bug. I don't mind which solution we go for in 2.1; always flush, or grab the last replay position from the memtable (either are pretty trivial) Always flush on TRUNCATE Key: CASSANDRA-7511 URL: https://issues.apache.org/jira/browse/CASSANDRA-7511 Project: Cassandra Issue Type: Bug Environment: CentOS 6.5, Oracle Java 7u60, C* 2.0.6, 2.0.9, including earlier 1.0.* versions. Reporter: Viktor Jevdokimov Assignee: Jeremiah Jordan Priority: Minor Labels: commitlog Fix For: 2.0.10 Attachments: 7511-2.0-v2.txt, 7511-v3-remove-renewMemtable.txt, 7511-v3-test.txt, 7511-v3.txt, 7511.txt Commit log grows infinitely after CF truncate operation via cassandra-cli, regardless CF receives writes or not thereafter. CF's could be non-CQL Standard and Super column type. Creation of snapshots after truncate is turned off. Commit log may start grow promptly, may start grow later, on a few only or on all nodes at once. Nothing special in the system log. No idea how to reproduce. After rolling restart commit logs are cleared and back to normal. Just annoying to do rolling restart after each truncate. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7631) Allow Stress to write directly to SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082854#comment-14082854 ] Benedict commented on CASSANDRA-7631: - I'd be inclined to keep this feature for user commands only, to keep maintenance complexity down. The writing is on the wall for the legacy mode anyway, for anything other than a very quick benchmark of general server performance. I don't see a reason to build the sstables up in advance for that kind of use case. Allow Stress to write directly to SSTables -- Key: CASSANDRA-7631 URL: https://issues.apache.org/jira/browse/CASSANDRA-7631 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Russell Alexander Spitzer Assignee: Russell Alexander Spitzer One common difficulty with benchmarking machines is the amount of time it takes to initially load data. For machines with a large amount of ram this becomes especially onerous because a very large amount of data needs to be placed on the machine before page-cache can be circumvented. To remedy this I suggest we add a top level flag to Cassandra-Stress which would cause the tool to write directly to sstables rather than actually performing CQL inserts. Internally this would use CQLSStable writer to write directly to sstables while skipping any keys which are not owned by the node stress is running on. The same stress command run on each node in the cluster would then write unique sstables only containing data which that node is responsible for. Following this no further network IO would be required to distribute data as it would all already be correctly in place. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7631) Allow Stress to write directly to SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082880#comment-14082880 ] Benedict commented on CASSANDRA-7631: - I think it's a bit early for that. Let's pencil that in for 3.0. Allow Stress to write directly to SSTables -- Key: CASSANDRA-7631 URL: https://issues.apache.org/jira/browse/CASSANDRA-7631 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Russell Alexander Spitzer Assignee: Russell Alexander Spitzer One common difficulty with benchmarking machines is the amount of time it takes to initially load data. For machines with a large amount of ram this becomes especially onerous because a very large amount of data needs to be placed on the machine before page-cache can be circumvented. To remedy this I suggest we add a top level flag to Cassandra-Stress which would cause the tool to write directly to sstables rather than actually performing CQL inserts. Internally this would use CQLSStable writer to write directly to sstables while skipping any keys which are not owned by the node stress is running on. The same stress command run on each node in the cluster would then write unique sstables only containing data which that node is responsible for. Following this no further network IO would be required to distribute data as it would all already be correctly in place. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6276) CQL: Map can not be created with the same name as a previously dropped list
[ https://issues.apache.org/jira/browse/CASSANDRA-6276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14084921#comment-14084921 ] Benedict commented on CASSANDRA-6276: - 3.0 storage engine won't be universal for a while (maybe never for thrift), but will index directly into columns (i.e. won't touch any not requested), so could trivially avoid retrieving data for dropped columns. The only problem is we'd need to track the range of sstables for which they were previously dropped (and maybe contains stale data), and which we now apply the new comparator too, which would be a bit ugly/annoying. CQL: Map can not be created with the same name as a previously dropped list --- Key: CASSANDRA-6276 URL: https://issues.apache.org/jira/browse/CASSANDRA-6276 Project: Cassandra Issue Type: Bug Environment: Cassandra 2.0.2 | CQL spec 3.1.0 centos 64 bit Java(TM) SE Runtime Environment (build 1.7.0-b147) Reporter: Oli Schacher Assignee: Benjamin Lerer Priority: Minor Labels: cql Fix For: 2.1.1 Attachments: CASSANDRA-6276.txt If create a list, drop it and create a map with the same name, i get Bad Request: comparators do not match or are not compatible. {quote} cqlsh:os_test1 create table thetable(id timeuuid primary key, somevalue text); cqlsh:os_test1 alter table thetable add mycollection listtext; cqlsh:os_test1 alter table thetable drop mycollection; cqlsh:os_test1 alter table thetable add mycollection maptext,text; Bad Request: comparators do not match or are not compatible. {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7546: Attachment: 7546.20_6.txt I've attached a slightly tweaked version, making things a little clearer (IMO) and removing some of the unnecessary comments, as well as fixing a couple of bugs and removing the AtomicReferenceHolder to recoup the extra space we're now using in the Holder. I must admit I'm still not madly keen on the nested synchronized() calls - I think they're a little ugly, and also increase call depth which is not ideal. I also cannot find any evidence that invoking unsafe.monitorenter/monitorexit would result in negative optimisations (this discussion on the relevant mailing list makes no such assertion whilst discussing its potential removal [http://openjdk.5641.n7.nabble.com/Unsafe-removing-the-monitorEnter-monitorExit-tryMonitorEnter-methods-td179462.html], but suggests exposing them more safely), however mostly I think the usage is clearer than nested calls passing the state of the method (isSynchronized is esp. ugly to me). I am not deadset against it though. Perhaps [~iamaleksey] can offer a third opinion? Otherwise, WDYT [~graham sanderson]? Could you give this patch a test and see how it behaves? AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: 7704.txt Attaching a patch that I think addresses this. There are a number of concurrency bugs here, and whilst we could fix them with more advanced lock-freedom, there is no compelling reason this class doesn't use synchronized everywhere, which would probably have avoided this problem in the first place. There is only one place where the execution is not guaranteed to be prompt, and I have left this out of the synchronization. I have at the same time simplified the logic, and fixed the logic for cancelling timeouts, as well as made the scheduled executor for timeouts globally shared (there's no good reason to spinup a new executor for each set of transfers) In this particular instance the issue seems to have been a lack of atomicity between abort() and complete(); an ACK arrived at the same time as abort() was cancelling all transfers, causing a reference to be released twice. This could also occur with the timeouts, but since they occur only every 12hrs, the risk is low. FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Attachments: 7704.txt, backtrace.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-7704: --- Assignee: Benedict FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Attachments: 7704.txt, backtrace.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087577#comment-14087577 ] Benedict commented on CASSANDRA-7704: - [~rbranson], ftr, could we get the earlier stack traces you saw and other related info? I suspect it's possible the earlier failing transfer caused a file to be deleted prematurely, which then caused this failure. Both the same bug. FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Attachments: 7704.txt, backtrace.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7705) Safer Resource Management
Benedict created CASSANDRA-7705: --- Summary: Safer Resource Management Key: CASSANDRA-7705 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0 We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed, we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7705) Safer Resource Management
[ https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7705: Description: We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed (or in places where this safety is probably not as necessary, e.g. SerializingCache), we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme for these more expensive/infrequent actions. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) was: We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed, we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) Safer Resource Management - Key: CASSANDRA-7705 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0 We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed (or in places where this safety is probably not as necessary, e.g. SerializingCache), we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme for these more expensive/infrequent actions. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087814#comment-14087814 ] Benedict commented on CASSANDRA-7546: - Well, technically we never ever call addColumn() directly, but in 2.0 we haven't removed / UnsupportedOperationException'd that path, so I'm not totally comfortable leaving it as a regular int, as an external call to addColumn would break it (but then, this probably isn't the end of the world). However, I actually introduced a double counting bug in changing that :/ ... and since we don't want to incur the incAndGet every change, and we don't want to dup code, let's settle for the possible race for maintaining size if somebody uses the API in a way it isn;t in the codebase right now. However I think I would prefer to make size final in this case. AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087814#comment-14087814 ] Benedict edited comment on CASSANDRA-7546 at 8/6/14 3:56 PM: - Well, technically we never ever call addColumn() directly, but in 2.0 we haven't removed / UnsupportedOperationException'd that path, so I'm not totally comfortable leaving it as a regular int, as an external call to addColumn would break it (but then, this probably isn't the end of the world). However, I actually introduced a double counting bug in changing that :/ ... and since we don't want to incur the incAndGet every change, and we don't want to dup code, let's settle for the possible race for maintaining size if somebody uses the API in a way it isn;t in the codebase right now. -However I think I would prefer to make size final in this case.- Looking again, it's too ugly to make it final, so let's settle for the ugliness of it being non-final, and revert to your behaviour here. This bit is soon to be superceded by 2.1 anyway, so let's not agonise over the beauty of it. was (Author: benedict): Well, technically we never ever call addColumn() directly, but in 2.0 we haven't removed / UnsupportedOperationException'd that path, so I'm not totally comfortable leaving it as a regular int, as an external call to addColumn would break it (but then, this probably isn't the end of the world). However, I actually introduced a double counting bug in changing that :/ ... and since we don't want to incur the incAndGet every change, and we don't want to dup code, let's settle for the possible race for maintaining size if somebody uses the API in a way it isn;t in the codebase right now. However I think I would prefer to make size final in this case. AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087850#comment-14087850 ] Benedict commented on CASSANDRA-7546: - bq. We probably mean to the left of... before or after are a bit confusing here! Yep, good catch! bq. Volatile read of the wasteTracker in the fast path. At the moment we mostly optimise for x86 for the moment, and it's essentially free here as you say. Even on platforms it isn't, it's unlikely to be a significant part of the overall costs, so better to keep it cleaner bq. Adjacent in memory CASed vars in the AtomicSortedColumns - Again not majorly worried here... I don't think the (CASed) variables themselves are highly contended, it is more that we are doing lots of slow concurrent work, and then failing the CAS. Absolutely not worried about this. Like you say, most of the cost is elsewhere. Would be much worse to pollute the cache with padding to avoid it. AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087885#comment-14087885 ] Benedict commented on CASSANDRA-7546: - Sounds good, thanks! AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089019#comment-14089019 ] Benedict commented on CASSANDRA-7282: - Just pushed a minor update removing an extraneous comment and making the resize threshold triggering uniform + clearer. Agree it would be good to get some performance numbers on this, but I'm not sure which magical facilities you're referring to? New stress isn't likely to stress this bit out any more interestingly than old stress, and we don't yet have a working magical performance service I don't think... Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7695) Inserting the same row in parallel causes bad data to be returned to the client
[ https://issues.apache.org/jira/browse/CASSANDRA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089282#comment-14089282 ] Benedict commented on CASSANDRA-7695: - LGTM. Would prefer we call the test NativeTransportBufferRecycleTest, and comment it to explain. Also remove the LOCAL_QUORUM, since it's meaningless here and don't want to confuse future readers. It's also not clear why we're bothering to 'dump keys', but this is a test so I'm not going to vet it too hard. Inserting the same row in parallel causes bad data to be returned to the client --- Key: CASSANDRA-7695 URL: https://issues.apache.org/jira/browse/CASSANDRA-7695 Project: Cassandra Issue Type: Bug Environment: Linux 3.12.21, JVM 1.7u60 Cassandra server 2.1.0 RC 5 Cassandra datastax client version 2.1.0RC1 Reporter: Johan Bjork Assignee: T Jake Luciani Priority: Blocker Fix For: 2.1.0 Attachments: 7695-workaround.txt, PutFailureRepro.java, bad-data-tid43-get, bad-data-tid43-put Running the attached test program against a cassandra 2.1 server results in scrambled data returned by the SELECT statement. Running it against latest stable works fine. Attached: * Program that reproduces the failure * Example output files from mentioned test-program with the scrambled output. Failure mode: The value returned by 'get' is scrambled, the size is correct but some bytes have shifted locations in the returned buffer. Cluster info: For the test we set up a single cassandra node using the stock configuration file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7628) Tools java driver needs to be updated
[ https://issues.apache.org/jira/browse/CASSANDRA-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089959#comment-14089959 ] Benedict commented on CASSANDRA-7628: - FTR, 2.1.0-rc1 works fine dropped in as well - only thing that doesn't compile is CqlPagingRecordReader, which needs a couple of delegate methods auto-generating. Tools java driver needs to be updated - Key: CASSANDRA-7628 URL: https://issues.apache.org/jira/browse/CASSANDRA-7628 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 When you run stress currently you get a bunch of harmless stacktraces like: {noformat} ERROR 21:11:51 Error parsing schema options for table system_traces.sessions: Cluster.getMetadata().getKeyspace(system_traces).getTable(sessions).getOptions() will return null java.lang.IllegalArgumentException: populate_io_cache_on_flush is not a column defined in this metadata at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ArrayBackedRow.isNull(ArrayBackedRow.java:56) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata$Options.init(TableMetadata.java:529) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:119) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:131) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:92) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:293) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:230) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:170) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:78) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1029) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:270) [cassandra-driver-core-2.0.1.jar:na] at org.apache.cassandra.stress.util.JavaDriverClient.connect(JavaDriverClient.java:90) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:177) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:159) [stress/:na] at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:264) [stress/:na] {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7628) Tools java driver needs to be updated
[ https://issues.apache.org/jira/browse/CASSANDRA-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090426#comment-14090426 ] Benedict commented on CASSANDRA-7628: - Since 2.1.0 is still rc, I'd say that's a good idea Tools java driver needs to be updated - Key: CASSANDRA-7628 URL: https://issues.apache.org/jira/browse/CASSANDRA-7628 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 When you run stress currently you get a bunch of harmless stacktraces like: {noformat} ERROR 21:11:51 Error parsing schema options for table system_traces.sessions: Cluster.getMetadata().getKeyspace(system_traces).getTable(sessions).getOptions() will return null java.lang.IllegalArgumentException: populate_io_cache_on_flush is not a column defined in this metadata at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ArrayBackedRow.isNull(ArrayBackedRow.java:56) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata$Options.init(TableMetadata.java:529) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:119) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:131) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:92) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:293) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:230) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:170) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:78) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1029) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:270) [cassandra-driver-core-2.0.1.jar:na] at org.apache.cassandra.stress.util.JavaDriverClient.connect(JavaDriverClient.java:90) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:177) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:159) [stress/:na] at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:264) [stress/:na] {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7628) Tools java driver needs to be updated
[ https://issues.apache.org/jira/browse/CASSANDRA-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090557#comment-14090557 ] Benedict commented on CASSANDRA-7628: - Looks like you didn't update the tools/lib directory Tools java driver needs to be updated - Key: CASSANDRA-7628 URL: https://issues.apache.org/jira/browse/CASSANDRA-7628 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 When you run stress currently you get a bunch of harmless stacktraces like: {noformat} ERROR 21:11:51 Error parsing schema options for table system_traces.sessions: Cluster.getMetadata().getKeyspace(system_traces).getTable(sessions).getOptions() will return null java.lang.IllegalArgumentException: populate_io_cache_on_flush is not a column defined in this metadata at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ArrayBackedRow.isNull(ArrayBackedRow.java:56) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata$Options.init(TableMetadata.java:529) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:119) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:131) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:92) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:293) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:230) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:170) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:78) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1029) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:270) [cassandra-driver-core-2.0.1.jar:na] at org.apache.cassandra.stress.util.JavaDriverClient.connect(JavaDriverClient.java:90) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:177) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:159) [stress/:na] at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:264) [stress/:na] {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7628) Tools java driver needs to be updated
[ https://issues.apache.org/jira/browse/CASSANDRA-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090569#comment-14090569 ] Benedict commented on CASSANDRA-7628: - I think we had to upgrade it independently at some point...? Can't remember, I think there was discussion about it a while back. Tools java driver needs to be updated - Key: CASSANDRA-7628 URL: https://issues.apache.org/jira/browse/CASSANDRA-7628 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 When you run stress currently you get a bunch of harmless stacktraces like: {noformat} ERROR 21:11:51 Error parsing schema options for table system_traces.sessions: Cluster.getMetadata().getKeyspace(system_traces).getTable(sessions).getOptions() will return null java.lang.IllegalArgumentException: populate_io_cache_on_flush is not a column defined in this metadata at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ArrayBackedRow.isNull(ArrayBackedRow.java:56) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata$Options.init(TableMetadata.java:529) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:119) ~[cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:131) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:92) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:293) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:230) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:170) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:78) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1029) [cassandra-driver-core-2.0.1.jar:na] at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:270) [cassandra-driver-core-2.0.1.jar:na] at org.apache.cassandra.stress.util.JavaDriverClient.connect(JavaDriverClient.java:90) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:177) [stress/:na] at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:159) [stress/:na] at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:264) [stress/:na] {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout
[ https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090655#comment-14090655 ] Benedict commented on CASSANDRA-7447: - bq. CASSANDRA-7443 is good to clean up the code base and should be done first This, and further related patches, are a necessary prerequisite to this ticket, yes. bq. Using 32bit instead of 64bit pointers could also save some space. I would prefer not to go down this route just yet, as it is error prone to be optimising this in the first version. Any optimisations that can be made universally (i.e. guaranteed to be safe for all file sizes) I'm onboard with, but obfuscating code dependent on file size I'm not. Especially as this introduces an extra condition to execute on every single field access, potentially stalling the processor pipeline more readily. bq. Trie + byte ordered types: would this mean to do some special serialization e.g. for timeuuid to make them binary comparable? Yes bq. If one partition only contains one row, plain row-oriented storage seems to be more efficient. Is this what small partition layout is meant for? No, it is because it requires fewer disk accesses to have it all packed into the same block (or we can have smaller blocks, increasing IOPS esp. on SSDs). In fact it is quite reasonable to assume that even with single row partitions the column oriented storage will be more efficient, as the columns do not care about partitions; they extend across all partitions, and so the serialization costs are reduced even if there are no clustering columns. I should note that the presentation at ngcc is only for historical reference and to get familiar with the general discussion. As mentioned in the description of this ticket, I now favour a row-oriented approach backed by the new index structures for many of the non-optimal column-oriented use cases, which *may* reduce the necessity of a compact column-oriented form, although it would still be useful as just described. bq. Column names (CQL): I'd prefer to extend the table definition schema with an integer column id and use that. Could save lots of String.hashCode/equals() calls - even if the column-id is also used in native protocol. (Think this was discussed elsewhere) There is a separate ticket for this, and I consider it to be an orthogonal effort. We can more easily deliver it here than we can cross-cluster (personally I favour cross-cluster names to be supported by a general enum type (CASSANDRA-6917)) bq. Bikeshed: Is the term sstable still correct? The original sstable was only imposing a sort-order on the partition keys. This will still be imposed, so yes, but I don't have any strong attachment to it. bq. I didn't catch the point why only maps and sets don't naturally fit into columnar format but lists, strings and blobs do. Or is it just because of their mean serialized size? They don't logically fit because they are an extra dimension, much as static columns are one _fewer_ dimension. Columnar layouts really need fixed dimensionality. You can flatten maps, sets and lists (my list was not exhaustive), but this incurs significant cost and complexity on reading these across multiple sstables, as opposed to relying on the standard machinery. Strings and blobs can more trivially be split out into an extra file if they are too large (for simplicity of first delivery we can just append all values larger than some limit to a file, and replace them with their location in the file), but storing large strings in a columnar layout is generally not sensible/beneficial anyway. In all likelihood I think the best approach may be to permit collections and statics on column oriented tables by splitting them into a separate row-oriented sstable, at least in the near-term. The heap-blocks outlined in the ngcc talk could be delivered later, although I might be inclined to tell users that column oriented storage is not for them if they want to store these things in the table. New sstable format with support for columnar layout --- Key: CASSANDRA-7447 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance, storage Fix For: 3.0 Attachments: ngcc-storage.odp h2. Storage Format Proposal C* has come a long way over the past few years, and unfortunately our storage format hasn't kept pace with the data models we are now encouraging people to utilise. This ticket proposes a collections of storage primitives that can be combined to serve these data models more optimally. It would probably help to first state the data model at the most abstract
[jira] [Commented] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091053#comment-14091053 ] Benedict commented on CASSANDRA-7704: - My mistake. I thought on IRC you said there were errors preceding it that might be related. Not necessary at all, just thought they might be explicable. FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Attachments: 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout
[ https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091084#comment-14091084 ] Benedict commented on CASSANDRA-7447: - bq. Is there any reason why you want to put the row index block next to the data? If we are going out of cache, we may as well read the index + data, rather than then index _then_ data. With HDDs this should avoid any penalty. Bear in mind the index also _is_ the data in this brave new world. My goal with the new format is to, as far as possible, guarantee as many or fewer seeks to the old format (even if SSDs are becoming more prevalent), whilst reducing the total amount of space necessary (so reduce requisite disk bandwidth and improve cache occupancy). bq. Is there any reason why you want to put the row index block next to the data? This actually makes it tricky to make sstables pluggable since right now we would put this index in the index.db file. It could be in both places I suppose since it would help with recovery to have multiple copies. Why does this make pluggability hard? The index is an artefact of the sstable type (or it should be, before we roll this out), so it shouldn't matter? bq. Also if you plan of putting the index at the front of the row you would need to do some kind of two pass to write the partition. Maybe. I'd prefer not to get down to this level of specifics just yet, I'm pretty sure it's solvable either way. It would be preferable to focus mostly on the overall design, featureset, etc. for the moment. The format is likely to be agnostic to where the two records live with respect to each other, but there are some optimisations possible on read if they're adjacent, assuming the records are all smaller than a page. If they are much larger than that, no optimisation is likely to help so it doesn't matter too much, and if they are smaller we only have to buffer two pages. New sstable format with support for columnar layout --- Key: CASSANDRA-7447 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance, storage Fix For: 3.0 Attachments: ngcc-storage.odp h2. Storage Format Proposal C* has come a long way over the past few years, and unfortunately our storage format hasn't kept pace with the data models we are now encouraging people to utilise. This ticket proposes a collections of storage primitives that can be combined to serve these data models more optimally. It would probably help to first state the data model at the most abstract level. We have a fixed three-tier structure: We have the partition key, the clustering columns, and the data columns. Each have their own characteristics and so require their own specialised treatment. I should note that these changes will necessarily be delivered in stages, and that we will be making some assumptions about what the most useful features to support initially will be. Any features not supported will require sticking with the old format until we extend support to all C* functionality. h3. Partition Key * This really has two components: the partition, and the value. Although the partition is primarily used to distribute across nodes, it can also be used to optimise lookups for a given key within a node * Generally partitioning is by hash, and for the moment I want to focus this ticket on the assumption that this is the case * Given this, it makes sense to optimise our storage format to permit O(1) searching of a given partition. It may be possible to achieve this with little overhead based on the fact we store the hashes in order and know they are approximately randomly distributed, as this effectively forms an immutable contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282), so we only need to store an amount of data based on how imperfectly distributed the hashes are, or at worst a single value per block. * This should completely obviate the need for a separate key-cache, which will be relegated to supporting the old storage format only h3. Primary Key / Clustering Columns * Given we have a hierarchical data model, I propose the use of a cache-oblivious trie * The main advantage of the trie is that it is extremely compact and _supports optimally efficient merges with other tries_ so that we can support more efficient reads when multiple sstables are touched * The trie will be preceded by a small amount of related data; the full partition key, a timestamp epoch (for offset-encoding timestamps) and any other partition level optimisation data, such as (potentially) a min/max timestamp to abort
[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout
[ https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091220#comment-14091220 ] Benedict commented on CASSANDRA-7447: - bq. you already have all the code in place to put the index next to the partition location I'm not sure I follow, but one of the goals is to permit faster _in memory_ (cached) performance, which means being more targeted with where we hit data inside our pages so that we can cache with finer granularity (and so have a higher cache hit rate), so we don't want to scan entire pages if we can avoid it. We have one index right now, and one data file, within both of which we persist clustering keys. This new scheme has one partition index, one data file, and one hybrid dataset, which can live by itself or in the datafile, but behaves as both a clustering index and the data itself. So when we're talking about an index things can get confusing. However we want to be able to support (and improve upon) the current ability to seek directly within partitions, and we want to be able to do so efficiently, without extra disk seeks, so we ideally want these clustering key records to be cached independently of the rest of the data since they will/may be referenced more frequently. New sstable format with support for columnar layout --- Key: CASSANDRA-7447 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance, storage Fix For: 3.0 Attachments: ngcc-storage.odp h2. Storage Format Proposal C* has come a long way over the past few years, and unfortunately our storage format hasn't kept pace with the data models we are now encouraging people to utilise. This ticket proposes a collections of storage primitives that can be combined to serve these data models more optimally. It would probably help to first state the data model at the most abstract level. We have a fixed three-tier structure: We have the partition key, the clustering columns, and the data columns. Each have their own characteristics and so require their own specialised treatment. I should note that these changes will necessarily be delivered in stages, and that we will be making some assumptions about what the most useful features to support initially will be. Any features not supported will require sticking with the old format until we extend support to all C* functionality. h3. Partition Key * This really has two components: the partition, and the value. Although the partition is primarily used to distribute across nodes, it can also be used to optimise lookups for a given key within a node * Generally partitioning is by hash, and for the moment I want to focus this ticket on the assumption that this is the case * Given this, it makes sense to optimise our storage format to permit O(1) searching of a given partition. It may be possible to achieve this with little overhead based on the fact we store the hashes in order and know they are approximately randomly distributed, as this effectively forms an immutable contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282), so we only need to store an amount of data based on how imperfectly distributed the hashes are, or at worst a single value per block. * This should completely obviate the need for a separate key-cache, which will be relegated to supporting the old storage format only h3. Primary Key / Clustering Columns * Given we have a hierarchical data model, I propose the use of a cache-oblivious trie * The main advantage of the trie is that it is extremely compact and _supports optimally efficient merges with other tries_ so that we can support more efficient reads when multiple sstables are touched * The trie will be preceded by a small amount of related data; the full partition key, a timestamp epoch (for offset-encoding timestamps) and any other partition level optimisation data, such as (potentially) a min/max timestamp to abort merges earlier * Initially I propose to limit the trie to byte-order comparable data types only (the number of which we can expand through translations of the important types that are not currently) * Crucially the trie will also encapsulate any range tombstones, so that these are merged early in the process and avoids re-iterating the same data * Results in true bidirectional streaming without having to read entire range into memory h3. Values There are generally two approaches to storing rows of data: columnar, or row-oriented. The above two data structures can be combined with a value storage scheme that is based on either.
[jira] [Created] (CASSANDRA-7735) Remove ref-counting of netty buffers
Benedict created CASSANDRA-7735: --- Summary: Remove ref-counting of netty buffers Key: CASSANDRA-7735 URL: https://issues.apache.org/jira/browse/CASSANDRA-7735 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Critical Fix For: 2.1 rc5 This has turned out to be more bug prone than we'd hoped, and it no longer seems to be a justified risk factor, since the performance gains were generally quite modest. When there's some time we can reengineer the API to make it safer to produce more obviously correct usage, but in the meantime I propose rolling back this change before general availability of 2.1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7736) Clean-up, justify (and reduce) each use of @Inline
Benedict created CASSANDRA-7736: --- Summary: Clean-up, justify (and reduce) each use of @Inline Key: CASSANDRA-7736 URL: https://issues.apache.org/jira/browse/CASSANDRA-7736 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Minor Fix For: 2.1.0 \@Inline is a delicate tool, and should in all cases we've used it (and use it in future) be accompanied by a comment justifying its use in the given context both theoretically and, preferably, with some brief description of/link to steps taken to demonstrate its benefit. We should aim to not use it unless we are very confident we can do better than the normal behaviour, as poor use can result in a polluted instruction cache, which can yield better results in tight benchmarks, but worse results in general use. It looks to me that we have too many uses already. I'll look over each one as well, and we can compare notes. If there's disagreement on any use, we can discuss, and if still there is any dissent should always err in favour of *not* using \@Inline. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7736) Clean-up, justify (and reduce) each use of @Inline
[ https://issues.apache.org/jira/browse/CASSANDRA-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092484#comment-14092484 ] Benedict commented on CASSANDRA-7736: - Thanks. In general inlining is unlikely to ever have a material difference if it impacts only a handful of calls for each database operation. We should restrict its use to methods invoked disproportionately often and, especially, in tight loops, where we know the instruction cache pollution will pay off (ie where the heuristics fall down). Clean-up, justify (and reduce) each use of @Inline -- Key: CASSANDRA-7736 URL: https://issues.apache.org/jira/browse/CASSANDRA-7736 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Minor Fix For: 2.1.0 \@Inline is a delicate tool, and should in all cases we've used it (and use it in future) be accompanied by a comment justifying its use in the given context both theoretically and, preferably, with some brief description of/link to steps taken to demonstrate its benefit. We should aim to not use it unless we are very confident we can do better than the normal behaviour, as poor use can result in a polluted instruction cache, which can yield better results in tight benchmarks, but worse results in general use. It looks to me that we have too many uses already. I'll look over each one as well, and we can compare notes. If there's disagreement on any use, we can discuss, and if still there is any dissent should always err in favour of *not* using \@Inline. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7738) Permit CL overuse to be explicitly bounded
Benedict created CASSANDRA-7738: --- Summary: Permit CL overuse to be explicitly bounded Key: CASSANDRA-7738 URL: https://issues.apache.org/jira/browse/CASSANDRA-7738 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 As mentioned in CASSANDRA-7554, we do not currently offer any way to explicitly bound CL growth, which can be problematic in some scenarios (e.g. EC2 where the system drive is quite small). We should offer a configurable amount of headroom, beyond which we stop accepting writes until the backlog clears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093206#comment-14093206 ] Benedict commented on CASSANDRA-7743: - Are you running with memtable_allocation_type: offheap_buffers? If so, switch to the offheap_objects. If not, it's surprising to be hitting that limit with netty buffers, as we don't allocate them anywhere else. Either way, the fact that this is failing inside netty is surprising, since this is prior to the fix for CASSANDRA-7695, so we shouldn't in principle be allocating direct buffers with netty. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27
[jira] [Resolved] (CASSANDRA-7732) Counter replication mutation can have corrupt cell name values (via pooled Netty buffers)
[ https://issues.apache.org/jira/browse/CASSANDRA-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-7732. - Resolution: Not a Problem Closing as we're removing ref-counting for 2.1 until we can come up with a safer strategy Counter replication mutation can have corrupt cell name values (via pooled Netty buffers) - Key: CASSANDRA-7732 URL: https://issues.apache.org/jira/browse/CASSANDRA-7732 Project: Cassandra Issue Type: Bug Reporter: Andrew Montalenti Assignee: Aleksey Yeschenko Priority: Critical Fix For: 2.1.0 Attachments: 7732.txt Counter replication mutation can have corrupt cell name values (via pooled Netty buffers), because CM.apply()-created replication mutation doesn't copy the cell names/partition key from the original mutation AND eventually doesn't have the same source frame, making ref counting not work there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7735) Remove ref-counting of netty buffers
[ https://issues.apache.org/jira/browse/CASSANDRA-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093323#comment-14093323 ] Benedict commented on CASSANDRA-7735: - Yes. Linked/closed CASSANDRA-7732. Remove ref-counting of netty buffers Key: CASSANDRA-7735 URL: https://issues.apache.org/jira/browse/CASSANDRA-7735 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Critical Labels: correctness, performance Fix For: 2.1.0 Attachments: 7735.txt This has turned out to be more bug prone than we'd hoped, and it no longer seems to be a justified risk factor, since the performance gains were generally quite modest. When there's some time we can reengineer the API to make it safer to produce more obviously correct usage, but in the meantime I propose rolling back this change before general availability of 2.1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7732) Counter replication mutation can have corrupt cell name values (via pooled Netty buffers)
[ https://issues.apache.org/jira/browse/CASSANDRA-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093348#comment-14093348 ] Benedict commented on CASSANDRA-7732: - Yes, see the linked (superceded-by) issue, CASSANDRA-7735 Counter replication mutation can have corrupt cell name values (via pooled Netty buffers) - Key: CASSANDRA-7732 URL: https://issues.apache.org/jira/browse/CASSANDRA-7732 Project: Cassandra Issue Type: Bug Reporter: Andrew Montalenti Assignee: Aleksey Yeschenko Priority: Critical Fix For: 2.1.0 Attachments: 7732.txt Counter replication mutation can have corrupt cell name values (via pooled Netty buffers), because CM.apply()-created replication mutation doesn't copy the cell names/partition key from the original mutation AND eventually doesn't have the same source frame, making ref counting not work there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093450#comment-14093450 ] Benedict commented on CASSANDRA-7704: - [~yukim] was that comment a +1 on the 2.0 patch, and asking for a corresponding 2.1 patch? FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Attachments: 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: 7704.20.v2.txt FTR, there was a (probably innocuous) mistake in that patch; fixed version attached. FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Attachments: 7704.20.v2.txt, 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7735) Remove ref-counting of netty buffers
[ https://issues.apache.org/jira/browse/CASSANDRA-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093476#comment-14093476 ] Benedict commented on CASSANDRA-7735: - LGTM. nit: BatchStatement, ModificationStatement, Mutation, QueryState each have an unused Frame import, ResponseVerbHandler has an unused IMutation import Remove ref-counting of netty buffers Key: CASSANDRA-7735 URL: https://issues.apache.org/jira/browse/CASSANDRA-7735 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Critical Labels: correctness, performance Fix For: 2.1.0 Attachments: 7735.txt This has turned out to be more bug prone than we'd hoped, and it no longer seems to be a justified risk factor, since the performance gains were generally quite modest. When there's some time we can reengineer the API to make it safer to produce more obviously correct usage, but in the meantime I propose rolling back this change before general availability of 2.1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7728) ConcurrentModificationException after upgrade to trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093499#comment-14093499 ] Benedict commented on CASSANDRA-7728: - Looks related to CASSANDRA-7116 - looks like 2.0 may possibly be affected ConcurrentModificationException after upgrade to trunk -- Key: CASSANDRA-7728 URL: https://issues.apache.org/jira/browse/CASSANDRA-7728 Project: Cassandra Issue Type: Bug Reporter: Russ Hatch Trying to repro another issue, I ran across this exception. It occurred during a rolling upgrade to trunk. It happening during or right after the test script checks counters to see if they are correct. {noformat} ERROR [Thrift:2] 2014-08-11 13:47:09,668 CustomTThreadPoolServer.java:219 - Error occurred during processing of message. java.util.ConcurrentModificationException: null at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) ~[na:1.7.0_65] at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65] at org.apache.cassandra.service.RowDigestResolver.getData(RowDigestResolver.java:40) ~[main/:na] at org.apache.cassandra.service.RowDigestResolver.getData(RowDigestResolver.java:28) ~[main/:na] at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) ~[main/:na] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) ~[main/:na] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1262) ~[main/:na] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1188) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:61) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:186) ~[main/:na] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:205) ~[main/:na] at org.apache.cassandra.thrift.CassandraServer.execute_cql3_query(CassandraServer.java:1916) ~[main/:na] at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4588) ~[thrift/:na] at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4572) ~[thrift/:na] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] {noformat} It's not happening 100% of the time, but may be triggered by running this dtest: {noformat} nosetests -vs upgrade_through_versions_test.py:TestUpgradeThroughVersions.upgrade_test_mixed {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093976#comment-14093976 ] Benedict commented on CASSANDRA-7743: - Could we get some heap dumps? Sounds to me like it's possibly a netty bug, or a ref counting bug coupled with a leaked/held reference somewhere. We need to see where these ByteBuffer references are being retained and why. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1
[jira] [Commented] (CASSANDRA-7750) Do not flush on truncate if durable_writes is false.
[ https://issues.apache.org/jira/browse/CASSANDRA-7750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094071#comment-14094071 ] Benedict commented on CASSANDRA-7750: - I'd rather we did not reintroduce the 'renew memtable' method, as it is inherently dangerous. If we are to do so, it should have clear danger warnings around it, OR it should explicitly clear the CL of any records it contains. Do not flush on truncate if durable_writes is false. -- Key: CASSANDRA-7750 URL: https://issues.apache.org/jira/browse/CASSANDRA-7750 Project: Cassandra Issue Type: Bug Components: Core Reporter: Jeremiah Jordan Assignee: Jeremiah Jordan Priority: Minor Fix For: 2.0.10, 2.1.1 Attachments: 7750-2.0.txt, 7750-2.1.txt CASSANDRA-7511 changed truncate so it will always flush to fix commit log issues. If durable_writes is false, then there will not be able data in the commit log for the table, so we can safely just drop the memtables and not flush. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095411#comment-14095411 ] Benedict commented on CASSANDRA-7743: - No, but I don't think it's likely to be related, since they would still be collected when unreferenced, so we'd likely see LEAK DETECTOR warnings from netty at which time the associated resources would also be freed, so we'd be somwhat unlikely to see the bug. No harm in trying, of course, but it sounds like it takes a few days to reproduce. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0%
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095780#comment-14095780 ] Benedict commented on CASSANDRA-7743: - It looks like the problem is caused by a number of changes in 2.1 composing to yield especially bad behaviour. We use pooled buffers in netty, but we also introduced an SEPWorker pool that has many threads (more than the number that actually service any single pool), and all threads may eventually service work on the netty executor side. This gives us ~130 threads periodically performing this work, and each of them apparently allocates a buffer at some point. These buffers are unfortunately allocated from a threadlocal pool, which starts at 16Mb, so each thread retains at least 16Mb of largely useless memory. The best fix will be to stop the SEPWorker tasks from allocating any buffers, but [~tjake] has pointed out we can also tweak some settings to mitigate the negative impact of this kind of problem as well. I'll look into a patch tomorrow. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB
[jira] [Assigned] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-7743: --- Assignee: Benedict Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 UN 10.240.72.183 896.57 KB 256 100.0% 15735c4d-98d4-4ea4-a305-7ab2d92f65fc RAC1 $ echo 'tracing on; select count(*) from duration_test.ints;' | ./bin/cqlsh
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096619#comment-14096619 ] Benedict commented on CASSANDRA-7743: - Hmm. So, looking at this a little more closely, I think this may effectively be a netty bug after all. It looks like no matter what pool/thread a pooled bytebuf is allocated on, it gets returned to the pool of the thread that _releases_ it. This means it simply accumulates indefinitely (up to the pool limit, which defaults to 32Mb) in the SEPWorkers, since they never themselves _allocate_, only release. [~norman] is that analysis correct? If so, it looks like this behaviour is somewhat unexpected and not ideal. However we can work around it for now. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096631#comment-14096631 ] Benedict commented on CASSANDRA-7743: - I haven't got to that stage yet, I'm just analysing the code right now. It's why I asked for your input, was hoping you could disabuse me if I'm completely wrong. I don't 100% understand the control flow, as it doesn't make much sense (to me) to be adding it to a different cache. However if you look in PooledByteBuf.deallocate(), it calls PoolArena.free() to release the memory, which in turn calls parent.threadCache.get().add() to cache its memory; obviously the threadCache.get() is grabbing the threadlocal cache for the thread releasing, not the source PoolThreadCache. Also worth noting I'm not convinced that, even if I'm correct, this fully explains the behaviour. We should only release on a different thread if an exception occurs during processing anyway, so I'm still digging for a more satisfactory full explanation of the behaviour. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree:
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096638#comment-14096638 ] Benedict commented on CASSANDRA-7743: - We're conflating two pools maybe :) I mean the pool of memory the thread can allocate from. So, to confirm I have this right, if you have two threads A and B, A only allocating and B only releasing, you would get memory accumulating up to max pool size in B, and A always allocating new memory? Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0%
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096651#comment-14096651 ] Benedict commented on CASSANDRA-7743: - bq. well it will be released after a while if not used. how long? it shouldn't ever be used, and it looks like it accumulates gigabytes in total over the course of a few days (around 16-32Mb per thread) bq. just pass in 0 for int tinyCacheSize, int smallCacheSize, int normalCacheSize. Won't that obviate most of the benefit of the pooled buffers? I plan to simply prevent our deallocating on the other threads. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns
[jira] [Commented] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096779#comment-14096779 ] Benedict commented on CASSANDRA-7704: - Not cleaning up resources is really not ideal in my book, however there is absolutely no reason we need to cancel with interruption - note this does *not* always result in a cancelled state if it ran, only if it was in the middle of running at the time (but still completed), and this can be fixed by not permitting it to be interrupted. However this is not the problem - in the test it will often be the case that the task was _genuinely_ successfully cancelled. In my opinion the test is broken, since previously there was _no_ guarantee that all cancellations would run (although the cancellation in the test case will); after the last task completes successfully the scheduled tasks were all removed from the queue (but *not* cancelled), so the behaviour of the future in this case would be to never return, which is much more surprising and inconsistent in my book. I'm not entirely sure what the offending line is intended to test, anyway? FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Attachments: 7704.20.v2.txt, 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6726) Recycle CompressedRandomAccessReader/RandomAccessReader buffers independently of their owners, and move them off-heap when possible
[ https://issues.apache.org/jira/browse/CASSANDRA-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6726: Assignee: Branimir Lambov Recycle CompressedRandomAccessReader/RandomAccessReader buffers independently of their owners, and move them off-heap when possible --- Key: CASSANDRA-6726 URL: https://issues.apache.org/jira/browse/CASSANDRA-6726 Project: Cassandra Issue Type: Improvement Reporter: Benedict Assignee: Branimir Lambov Priority: Minor Labels: performance Fix For: 3.0 Whilst CRAR and RAR are pooled, we could and probably should pool the buffers independently, so that they are not tied to a specific sstable. It may be possible to move the RAR buffer off-heap, and the CRAR sometimes (e.g. Snappy may possibly support off-heap buffers) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-5902) Dealing with hints after a topology change
[ https://issues.apache.org/jira/browse/CASSANDRA-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-5902: Assignee: Branimir Lambov Dealing with hints after a topology change -- Key: CASSANDRA-5902 URL: https://issues.apache.org/jira/browse/CASSANDRA-5902 Project: Cassandra Issue Type: Bug Reporter: Jonathan Ellis Assignee: Branimir Lambov Priority: Minor Hints are stored and delivered by destination node id. This allows them to survive IP changes in the target, while making scan all the hints for a given destination an efficient operation. However, we do not detect and handle new node assuming responsibility for the hinted row via bootstrap before it can be delivered. I think we have to take a performance hit in this case -- we need to deliver such a hint to *all* replicas, since we don't know which is the new one. This happens infrequently enough, however -- requiring first the target node to be down to create the hint, then the hint owner to be down long enough for the target to both recover and stream to a new node -- that this should be okay. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7039) DirectByteBuffer compatible LZ4 methods
[ https://issues.apache.org/jira/browse/CASSANDRA-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7039: Assignee: Branimir Lambov (was: Lyuben Todorov) DirectByteBuffer compatible LZ4 methods --- Key: CASSANDRA-7039 URL: https://issues.apache.org/jira/browse/CASSANDRA-7039 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Priority: Minor Labels: performance Fix For: 3.0 As we move more things off-heap, it's becoming more and more essential to be able to use DirectByteBuffer (or native pointers) in various places. Unfortunately LZ4 doesn't currently support this operation, despite being JNI based - this means we both have to perform unnecessary copies to de/compress data from DBB, but also we can stall GC as any JNI method operating over a java array using the GetPrimitiveArrayCritical enters a critical section that prevents GC for its duration. This means STWs will be at least as long any running compression/decompression (and no GC will happen until they complete, so it's additive). We should temporarily fork (and then resubmit upstream) jpountz-lz4 to support operating over a native pointer, so that we can pass a DBB or a raw pointer we have allocated ourselves. This will help improve performance when flushing the new offheap memtables, as well as enable us to implement CASSANDRA-6726 and finish CASSANDRA-4338. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7546: Fix Version/s: 2.1.1 2.0.11 AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Fix For: 2.0.11, 2.1.1 Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7561) On DROP we should invalidate CounterKeyCache as well as Key/Row cache
[ https://issues.apache.org/jira/browse/CASSANDRA-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7561: Fix Version/s: 2.1.0 On DROP we should invalidate CounterKeyCache as well as Key/Row cache - Key: CASSANDRA-7561 URL: https://issues.apache.org/jira/browse/CASSANDRA-7561 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Fix For: 2.1.0 We should also probably ensure we don't attempt to auto save _any_ of the caches while they are in an inconsistent state (i.e. there are keys present to be saved that should not be restored, or that would throw exceptions when we save (e.g. CounterCacheKey)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Fix Version/s: 2.1.0 2.0.10 FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Fix For: 2.0.10, 2.1.0 Attachments: 7704.20.v2.txt, 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-3852) use LIFO queueing policy when queue size exceeds thresholds
[ https://issues.apache.org/jira/browse/CASSANDRA-3852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-3852: Fix Version/s: 3.0 use LIFO queueing policy when queue size exceeds thresholds --- Key: CASSANDRA-3852 URL: https://issues.apache.org/jira/browse/CASSANDRA-3852 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Assignee: Peter Schuller Labels: performance Fix For: 3.0 A strict FIFO policy for queueing (between stages) is detrimental to latency and forward progress. Whenever a node is saturated beyond incoming request rate, *all* requests become slow. If it is consistently saturated, you start effectively timing out on *all* requests. A much better strategy from the point of view of latency is to serve a subset requests quickly, and letting some time out, rather than letting all either time out or be slow. Care must be taken such that: * We still guarantee that requests are processed reasonably timely (we couldn't go strict LIFO for example as that would result in requests getting stuck potentially forever on a loaded node). * Maybe, depending on the previous point's solution, ensure that some requests bypass the policy and get prioritized (e.g., schema migrations, or anything internal to a node). A possible implementation is to go LIFO whenever there are requests in the queue that are older than N milliseconds (or a certain queue size, etc). Benefits: * All cases where the client is directly, or is indirectly affecting through other layers, a system which has limited concurrency (e.g., thread pool size of X to serve some incoming request rate), it is *much* better for a few requests to time out while most are serviced quickly, than for all requests to become slow, as it doesn't explode concurrency. Think any random non-super-advanced php app, ruby web app, java servlet based app, etc. Essentially, it optimizes very heavily for improved average latencies. * Systems with strict p95/p99/p999 requirements on latencies should greatly benefit from such a policy. For example, suppose you have a system at 85% of capacity, and it takes a write spike (or has a hiccup like GC pause, blocking on a commit log write, etc). Suppose the hiccup racks up 500 ms worth of requests. At 15% margin at steady state, that takes 500ms * 100/15 = 3.2 seconds to recover. Instead of *all* requests for an entire 3.2 second window being slow, we'd serve requests quickly for 2.7 of those seconds, with the incoming requests during that 500 ms interval being the ones primarily affected. The flip side though is that once you're at the point where more than N percent of requests end up having to wait for others to take LIFO priority, the p(100-N) latencies will actually be *worse* than without this change (but at this point you have to consider what the root reason for those pXX requirements are). * In the case of complete saturation, it allows forward progress. Suppose you're taking 25% more traffic than you are able to handle. Instead of getting backed up and ending up essentially timing out *every single request*, you will succeed in processing up to 75% of them (I say up to because it depends; for example on a {{QUORUM}} request you need at least two of the requests from the co-ordinator to succeed so the percentage is brought down) and allowing clients to make forward progress and get work done, rather than being stuck. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7542) Reduce CAS contention
[ https://issues.apache.org/jira/browse/CASSANDRA-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096805#comment-14096805 ] Benedict commented on CASSANDRA-7542: - [~kohlisankalp] any news? Reduce CAS contention - Key: CASSANDRA-7542 URL: https://issues.apache.org/jira/browse/CASSANDRA-7542 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Benedict Fix For: 2.0.10 CAS updates on same CQL partition can lead to heavy contention inside C*. I am looking for simple ways(no algorithmic changes) to reduce contention as the penalty of it is high in terms of latency, specially for reads. We can put some sort of synchronization on CQL partition at StorageProxy level. This will reduce contention at least for all requests landing on one box for same partition. Here is an example of why it will help: 1) Say 1 write and 2 read CAS requests for the same partition key is send to C* in parallel. 2) Since client is token-aware, it sends these 3 request to the same C* instance A. (Lets assume that all 3 requests goto same instance A) 3) In this C* instance A, all 3 CAS requests will contend with each other in Paxos. (This is bad) To improve contention in 3), what I am proposing is to add a lock on partition key similar to what we do in PaxosState.java to serialize these 3 requests. This will remove the contention and improve performance as these 3 requests will not collide with each other. Another improvement we can do in client is to pick a deterministic live replica for a given partition doing CAS. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-6780) Memtable OffHeap GC Statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-6780. - Resolution: Later Since offheap GC has been postponed indefinitely, this ticket should also be closed to revisit later. Memtable OffHeap GC Statistics -- Key: CASSANDRA-6780 URL: https://issues.apache.org/jira/browse/CASSANDRA-6780 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor As mentioned in CASSANDRA-6689, it would be nice to expose via JMX some statistics on GC behaviour, instead of just optionally debug logging it (and maybe expand to cover some more things): - Time spent in GC - Amount of memory reclaimed - Number of collections (per CFS?), and average reclaimed per collection -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-6709) Changes to KeyCache
[ https://issues.apache.org/jira/browse/CASSANDRA-6709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-6709. - Resolution: Later Closing as the new sstable format most likely makes this unnecessary by eliminating the need for a separate key cache, although we *may* want to revisit this at some point afterwards since a separate cache could still be beneficial by improving memory occupancy rate, so closing as later instead of duplicate. Changes to KeyCache --- Key: CASSANDRA-6709 URL: https://issues.apache.org/jira/browse/CASSANDRA-6709 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor It seems to me that KeyCache can be improved in a number of ways, but first let's state the basic goal of KeyCache: to reduce the average query response time by providing an exact seek position in a file for a given key. As it stands, KeyCache is both 100% accurate, but requires a lot of overhead per entry. I propose to make KeyCache *mostly* accurate (say 99.%), by which I means it will always fail accurately, but may rarely return an incorrect address, and code the end users of it to be able to retry to confirm they seeked to the correct position in the file, and to retry without the cache if they did not. The advantage of this is that we can both take the cache off-heap easily, and pack a lot more items into the cache. If we permit collisions across files and simply use the (full 128-bit) murmur hash of the key + cfId + file generation, we should get enough uniqueness to rarely have an erroneuous collision, however we will be using only 20 bytes per key, instead of the current 100 + key length bytes. This should allow us to answer far more queries from the key cache than before, so the positive improvement to performance should be greater than the negative drain. For the structure I propose an associative cache, where a single contiguous address space is broken up into regions of, say, 8 entries, plus one counter. The counter tracks the recency of access of each of the entries, so that on write the least recently accessed/written can be replaced. A linear probe within the region is used to determine if the entry we're looking for is present. This should be very quick, as the entire region should fit into one or two lines of L1. Advantage: we may see 5x bump in cache hit-rate, or even more for large keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-6802) Row cache improvements
[ https://issues.apache.org/jira/browse/CASSANDRA-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-6802. - Resolution: Later Since offheap GC has been postponed indefinitely, this ticket should also be closed to revisit later. Row cache improvements -- Key: CASSANDRA-6802 URL: https://issues.apache.org/jira/browse/CASSANDRA-6802 Project: Cassandra Issue Type: Improvement Reporter: Marcus Eriksson Labels: performance Fix For: 3.0 There are a few things we could do; * Start using the native memory constructs from CASSANDRA-6694 to avoid serialization/deserialization costs and to minimize the on-heap overhead * Stop invalidating cached rows on writes (update on write instead). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-5019) Still too much object allocation on reads
[ https://issues.apache.org/jira/browse/CASSANDRA-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-5019. - Resolution: Duplicate Since the read path will be rewritten as part of efforts to introduce CASSANDRA-7447 (both regards internal APIs, and the implementation details for the new format), this ticket should be addressed by doing things right here. This may mean the legacy format continues to be somewhat inefficient, but this may or may not eventually be retired entirely, so there is probably not much point spending a lot of time optimising it, esp. when the impact is unknown and probably not dramatic in relation to the other costs associated with this format. Still too much object allocation on reads - Key: CASSANDRA-5019 URL: https://issues.apache.org/jira/browse/CASSANDRA-5019 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Labels: performance Fix For: 3.0 ArrayBackedSortedColumns was a step in the right direction but it's still relatively heavyweight thanks to allocating individual Columns. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7029) Investigate alternative transport protocols for both client and inter-server communications
[ https://issues.apache.org/jira/browse/CASSANDRA-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096817#comment-14096817 ] Benedict commented on CASSANDRA-7029: - mTCP is not stable enough, nor universal enough, to be useful to us. It requires very specific linux kernel versions, and very specific network interfaces, in order to work. If it matures it will be worth revisiting. Investigate alternative transport protocols for both client and inter-server communications --- Key: CASSANDRA-7029 URL: https://issues.apache.org/jira/browse/CASSANDRA-7029 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: performance Fix For: 3.0 There are a number of reasons to think we can do better than TCP for our communications: 1) We can actually tolerate sporadic small message losses, so guaranteed delivery isn't essential (although for larger messages it probably is) 2) As shown in \[1\] and \[2\], Linux can behave quite suboptimally with regard to TCP message delivery when the system is under load. Judging from the theoretical description, this is likely to apply even when the system-load is not high, but the number of processes to schedule is high. Cassandra generally has a lot of threads to schedule, so this is quite pertinent for us. UDP performs substantially better here. 3) Even when the system is not under load, UDP has a lower CPU burden, and that burden is constant regardless of the number of connections it processes. 4) On a simple benchmark on my local PC, using non-blocking IO for UDP and busy spinning on IO I can actually push 20-40% more throughput through loopback (where TCP should be optimal, as no latency), even for very small messages. Since we can see networking taking multiple CPUs' worth of time during a stress test, using a busy-spin for ~100micros after last message receipt is almost certainly acceptable, especially as we can (ultimately) process inter-server and client communications on the same thread/socket in this model. 5) We can optimise the threading model heavily: since we generally process very small messages (200 bytes not at all implausible), the thread signalling costs on the processing thread can actually dramatically impede throughput. In general it costs ~10micros to signal (and passing the message to another thread for processing in the current model requires signalling). For 200-byte messages this caps our throughput at 20MB/s. I propose to knock up a highly naive UDP-based connection protocol with super-trivial congestion control over the course of a few days, with the only initial goal being maximum possible performance (not fairness, reliability, or anything else), and trial it in Netty (possibly making some changes to Netty to mitigate thread signalling costs). The reason for knocking up our own here is to get a ceiling on what the absolute limit of potential for this approach is. Assuming this pans out with performance gains in C* proper, we then look to contributing to/forking the udt-java project and see how easy it is to bring performance in line with what we can get with our naive approach (I don't suggest starting here, as the project is using blocking old-IO, and modifying it with latency in mind may be challenging, and we won't know for sure what the best case scenario is). \[1\] http://test-docdb.fnal.gov/0016/001648/002/Potential%20Performance%20Bottleneck%20in%20Linux%20TCP.PDF \[2\] http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=1968;filename=Performance%20Analysis%20of%20Linux%20Networking%20-%20Packet%20Receiving%20(Official).pdf;version=2 Further related reading: http://public.dhe.ibm.com/software/commerce/doc/mft/cdunix/41/UDTWhitepaper.pdf https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/14482/ChoiUndPerTcp.pdf?sequence=1 https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.3762rep=rep1type=pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7542) Reduce CAS contention
[ https://issues.apache.org/jira/browse/CASSANDRA-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098326#comment-14098326 ] Benedict commented on CASSANDRA-7542: - OK. Not sure if it is worth our pursuing this right now then, at least as far as a 2.0 delivery is concerned. When I get some more free time I'll create some benchmarks to test how much of an improvement these (or future) changes have. Reduce CAS contention - Key: CASSANDRA-7542 URL: https://issues.apache.org/jira/browse/CASSANDRA-7542 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: Benedict Fix For: 2.0.10 CAS updates on same CQL partition can lead to heavy contention inside C*. I am looking for simple ways(no algorithmic changes) to reduce contention as the penalty of it is high in terms of latency, specially for reads. We can put some sort of synchronization on CQL partition at StorageProxy level. This will reduce contention at least for all requests landing on one box for same partition. Here is an example of why it will help: 1) Say 1 write and 2 read CAS requests for the same partition key is send to C* in parallel. 2) Since client is token-aware, it sends these 3 request to the same C* instance A. (Lets assume that all 3 requests goto same instance A) 3) In this C* instance A, all 3 CAS requests will contend with each other in Paxos. (This is bad) To improve contention in 3), what I am proposing is to add a lock on partition key similar to what we do in PaxosState.java to serialize these 3 requests. This will remove the contention and improve performance as these 3 requests will not collide with each other. Another improvement we can do in client is to pick a deterministic live replica for a given partition doing CAS. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: 7704-2.1.txt Attaching a new version which does not cancel the task that was run, and updates the unit tests to match the new behaviour FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Fix For: 2.0.10, 2.1.0 Attachments: 7704-2.1.txt, 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7704: Attachment: (was: 7704.20.v2.txt) FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Fix For: 2.0.10, 2.1.0 Attachments: 7704-2.1.txt, 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7763) cql_tests static_with_empty_clustering test failure
[ https://issues.apache.org/jira/browse/CASSANDRA-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098406#comment-14098406 ] Benedict commented on CASSANDRA-7763: - It's a shame we've spot this now, as it's a bit late to optimise this again for 2.1, but we should perhaps revisit later (for 3.0), as the introduction of these virtual method invocations was a large part of the reason for CASSANDRA-6934 in the first place. It should be possible to avoid these invocations on most calls, since we only actually incur static columns infrequently, but let's leave it for now. This patch does need to include the changes to the AbstractCType.compareUnsigned, WithCollection.compare() and AbstractNativeCell.compare() methods as well though cql_tests static_with_empty_clustering test failure --- Key: CASSANDRA-7763 URL: https://issues.apache.org/jira/browse/CASSANDRA-7763 Project: Cassandra Issue Type: Bug Reporter: Ryan McGuire Assignee: Sylvain Lebresne Fix For: 2.1 rc6 Attachments: 7763.txt {code} == FAIL: static_with_empty_clustering_test (cql_tests.TestCQL) -- Traceback (most recent call last): File /home/ryan/git/datastax/cassandra-dtest/tools.py, line 213, in wrapped f(obj) File /home/ryan/git/datastax/cassandra-dtest/cql_tests.py, line 4082, in static_with_empty_clustering_test assert_one(cursor, SELECT * FROM test, ['partition1', '', 'static value', 'value']) File /home/ryan/git/datastax/cassandra-dtest/assertions.py, line 40, in assert_one assert res == [expected], res AssertionError: [[u'partition1', u'', None, None], [u'partition1', u'', None, None], [u'partition1', u'', None, u'value']] begin captured logging dtest: DEBUG: cluster ccm directory: /tmp/dtest-Ex54V7 - end captured logging - -- Ran 1 test in 6.866s FAILED (failures=1) {code} regression from CASSANDRA-7455? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7561) On DROP we should invalidate CounterKeyCache as well as Key/Row cache
[ https://issues.apache.org/jira/browse/CASSANDRA-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098811#comment-14098811 ] Benedict commented on CASSANDRA-7561: - bq. Well. It shouldn't be throwing any exceptions, AFAIK CounterCacheKey.getPathInfo() is called during serialization, which is not safe if the CF has been dropped (since it will get a null cf back). So we still need to address preventing an autosave happening whilst the map contains keys that are in a dropped CF, or we need getPathInfo() at least to be safe during this (and return a result that is valid for all use cases), whichever is easiest. On DROP we should invalidate CounterKeyCache as well as Key/Row cache - Key: CASSANDRA-7561 URL: https://issues.apache.org/jira/browse/CASSANDRA-7561 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Fix For: 2.1.0 Attachments: 7561.txt We should also probably ensure we don't attempt to auto save _any_ of the caches while they are in an inconsistent state (i.e. there are keys present to be saved that should not be restored, or that would throw exceptions when we save (e.g. CounterCacheKey)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-7561) On DROP we should invalidate CounterKeyCache as well as Key/Row cache
[ https://issues.apache.org/jira/browse/CASSANDRA-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098811#comment-14098811 ] Benedict edited comment on CASSANDRA-7561 at 8/15/14 5:46 PM: -- bq. Well. It shouldn't be throwing any exceptions, AFAIK CounterCacheKey.getPathInfo() is called during serialization, which is not safe if the CF has been dropped (since it will get a null cf back). So we still need to address preventing an autosave happening whilst the map contains keys that are in a dropped CF, or we need getPathInfo() at least to be safe during this (and return a result that is valid for all use cases), whichever is easiest. It looks like this bug may affect the row cache as well, except that we've simply never noticed it since the window is too small. I filed this ticket a long time ago so cannot remember where/why I saw this happen. Mea culpa for not filling it into the ticket in the first place. was (Author: benedict): bq. Well. It shouldn't be throwing any exceptions, AFAIK CounterCacheKey.getPathInfo() is called during serialization, which is not safe if the CF has been dropped (since it will get a null cf back). So we still need to address preventing an autosave happening whilst the map contains keys that are in a dropped CF, or we need getPathInfo() at least to be safe during this (and return a result that is valid for all use cases), whichever is easiest. On DROP we should invalidate CounterKeyCache as well as Key/Row cache - Key: CASSANDRA-7561 URL: https://issues.apache.org/jira/browse/CASSANDRA-7561 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Fix For: 2.1.0 Attachments: 7561.txt We should also probably ensure we don't attempt to auto save _any_ of the caches while they are in an inconsistent state (i.e. there are keys present to be saved that should not be restored, or that would throw exceptions when we save (e.g. CounterCacheKey)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7561) On DROP we should invalidate CounterKeyCache as well as Key/Row cache
[ https://issues.apache.org/jira/browse/CASSANDRA-7561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098858#comment-14098858 ] Benedict commented on CASSANDRA-7561: - Since this is holding up 2.1-rc6, I'm comfortable splitting the remainder of the fix out into a separate ticket. The code as it stands at least reduces the bug to a window of risk after DROP rather than a guaranteed failure. On DROP we should invalidate CounterKeyCache as well as Key/Row cache - Key: CASSANDRA-7561 URL: https://issues.apache.org/jira/browse/CASSANDRA-7561 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Fix For: 2.1.0 Attachments: 7561.txt We should also probably ensure we don't attempt to auto save _any_ of the caches while they are in an inconsistent state (i.e. there are keys present to be saved that should not be restored, or that would throw exceptions when we save (e.g. CounterCacheKey)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown
Benedict created CASSANDRA-7784: --- Summary: DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown Key: CASSANDRA-7784 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor It looks like this is quite a realistic race to hit reasonably often, since we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a lengthy window to overlap with an auto-save -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reopened CASSANDRA-7743: - Tester: Pierre Laporte I'd like to get confirmation this bug is fixed before resolving it, but no reason to hold up rc6 for that. [~pingtimeout] do you think you'll be able to try this out? Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1 rc6 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 UN
[jira] [Updated] (CASSANDRA-6809) Compressed Commit Log
[ https://issues.apache.org/jira/browse/CASSANDRA-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6809: Assignee: Branimir Lambov Compressed Commit Log - Key: CASSANDRA-6809 URL: https://issues.apache.org/jira/browse/CASSANDRA-6809 Project: Cassandra Issue Type: Improvement Reporter: Benedict Assignee: Branimir Lambov Priority: Minor Labels: performance Fix For: 3.0 It seems an unnecessary oversight that we don't compress the commit log. Doing so should improve throughput, but some care will need to be taken to ensure we use as much of a segment as possible. I propose decoupling the writing of the records from the segments. Basically write into a (queue of) DirectByteBuffer, and have the sync thread compress, say, ~64K chunks every X MB written to the CL (where X is ordinarily CLS size), and then pack as many of the compressed chunks into a CLS as possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6572) Workload recording / playback
[ https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6572: Assignee: (was: Lyuben Todorov) Workload recording / playback - Key: CASSANDRA-6572 URL: https://issues.apache.org/jira/browse/CASSANDRA-6572 Project: Cassandra Issue Type: New Feature Components: Core, Tools Reporter: Jonathan Ellis Fix For: 2.1.1 Attachments: 6572-trunk.diff Write sample mode gets us part way to testing new versions against a real world workload, but we need an easy way to test the query side as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7468) Add time-based execution to cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099525#comment-14099525 ] Benedict commented on CASSANDRA-7468: - FTR, I'm planning to address this once CASSANDRA-7519 is committed, since this is not super-high priority. Add time-based execution to cassandra-stress Key: CASSANDRA-7468 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Matt Kennedy Assignee: Matt Kennedy Priority: Minor Fix For: 2.1.1 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7232) Enable live replay of commit logs
[ https://issues.apache.org/jira/browse/CASSANDRA-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099524#comment-14099524 ] Benedict commented on CASSANDRA-7232: - Missed this due to status != Patch Available I'm not keen on passing properties around using System.get/setProperty after system startup. We should modify CommitLogReplay so we can instantiate it with a specific PIT, and construct one specifically for this out-of-band restore. Also the comment is inaccurate, stating it is the point to restore _from_, not _to_. However it would be useful to be able to provide both, as presumably the commitlog archive directory will have more logs than needed. Enable live replay of commit logs - Key: CASSANDRA-7232 URL: https://issues.apache.org/jira/browse/CASSANDRA-7232 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Patrick McFadin Assignee: Lyuben Todorov Priority: Minor Fix For: 2.0.10 Attachments: 0001-Expose-CommitLog-recover-to-JMX-add-nodetool-cmd-for.patch, 0001-TRUNK-JMX-and-nodetool-cmd-for-commitlog-replay.patch Replaying commit logs takes a restart but restoring sstables can be an online operation with refresh. In order to restore a point-in-time without a restart, the node needs to live replay the commit logs from JMX and a nodetool command. nodetool refreshcommitlogs keyspace table -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (CASSANDRA-7468) Add time-based execution to cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099525#comment-14099525 ] Benedict edited comment on CASSANDRA-7468 at 8/16/14 6:31 AM: -- FTR, I'm planning to address this (the lack of presence on user commands, not the behaviour with auto mode running the test multiple times, as this is not a bug) once CASSANDRA-7519 is committed, since this is not super-high priority. was (Author: benedict): FTR, I'm planning to address this once CASSANDRA-7519 is committed, since this is not super-high priority. Add time-based execution to cassandra-stress Key: CASSANDRA-7468 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Matt Kennedy Assignee: Matt Kennedy Priority: Minor Fix For: 2.1.1 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099527#comment-14099527 ] Benedict commented on CASSANDRA-7704: - Committed to 2.0, 2.1.0 and 2.1 branches. I overwrote 2.0's contents with 2.1's, only removing the repairedAt property, since the only other difference was the lack of aborted property preventing inconsistent state. FileNotFoundException during STREAM-OUT triggers 100% CPU usage --- Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Benedict Fix For: 2.0.10, 2.1.0 Attachments: 7704-2.1.txt, 7704.txt, backtrace.txt, other-errors.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (CASSANDRA-7754) FileNotFoundException in MemtableFlushWriter
[ https://issues.apache.org/jira/browse/CASSANDRA-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-7754. - Resolution: Not a Problem [~shalupov] the first exception you posted is occurring during creation of the initial file for writing, the last exception you posted is not related to the other two, and the middle exception appears to be thrown during abort of a write due to some other error which then finds the data it had been writing is now missing, so I suggest you most likely have some problems with your file system. I would check your ACLs are all in order, and look for background cleanup / archive processes. I currently doubt there is a problem with C* from the information you've posted, especially as this code is exercised regularly and we haven't seen any issues elsewhere, but if after further investigation you continue to be convinced there is a bug, please reopen the ticket with some more information and reproduction steps so we can try to replicate it ourselves. FileNotFoundException in MemtableFlushWriter Key: CASSANDRA-7754 URL: https://issues.apache.org/jira/browse/CASSANDRA-7754 Project: Cassandra Issue Type: Bug Environment: Linux, OpenJDK 1.7 Reporter: Leonid Shalupov Priority: Critical Exception in cassandra logs, after upgrade to 2.1: [MemtableFlushWriter:91] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[MemtableFlushWriter:91,5,main] java.lang.RuntimeException: java.io.FileNotFoundException: /xxx/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/system-batchlog-tmp-ka-186-Index.db (No such file or directory) at org.apache.cassandra.io.util.SequentialWriter.init(SequentialWriter.java:75) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.io.util.SequentialWriter.open(SequentialWriter.java:104) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.io.util.SequentialWriter.open(SequentialWriter.java:99) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.init(SSTableWriter.java:550) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.io.sstable.SSTableWriter.init(SSTableWriter.java:134) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.db.Memtable$FlushRunnable.createFlushWriter(Memtable.java:383) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:330) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:314) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1054) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_65] Caused by: java.io.FileNotFoundException: /xxx/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/system-batchlog-tmp-ka-186-Index.db (No such file or directory) at java.io.RandomAccessFile.open(Native Method) ~[na:1.7.0_65] at java.io.RandomAccessFile.init(RandomAccessFile.java:241) ~[na:1.7.0_65] at org.apache.cassandra.io.util.SequentialWriter.init(SequentialWriter.java:71) ~[cassandra-all-2.1.0-rc5.jar:2.1.0-rc5] ... 14 common frames omitted -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100197#comment-14100197 ] Benedict commented on CASSANDRA-7743: - Did you see the actual error, or have more info than meminfo? Because that is not at all conclusive by itself. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1 rc6 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 UN 10.240.72.183 896.57 KB 256 100.0%
[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100237#comment-14100237 ] Benedict commented on CASSANDRA-7519: - bq. I plan to run some test workloads to double check the logic, but first cut of the code looked good. I left a couple comments on the github branch Thanks! bq. I'm not very keen on the new labels you've chosen for the insert section of the yaml file, They should be more verbose Nomenclature is always tricky, certainly not fixed on them. Although by making these more verbose we'll need to make the command line correspondingly more verbose to keep them in sync, which I'm not super keen on, but not too fussed about either. bq. partitions_per_batch maybe? perhaps partitions_per_operation? because per_batch implies we might change the number of partitions between batches, whereas we work with the same partitions for the duration of an 'operation' (the n= declared on command line)... bq. batch_split_count batches_per_operation? Further stress improvements to generate more realistic workloads Key: CASSANDRA-7519 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Labels: tools Fix For: 2.1.1 We generally believe that the most common workload is for reads to exponentially prefer most recently written data. However as stress currently behaves we have two id generation modes: sequential and random (although random can be distributed). I propose introducing a new mode which is somewhat like sequential, except we essentially 'look back' from the current id by some amount defined by a distribution. I may possibly make the position only increment as it's first written to also, so that this mode can be run from a clean slate with a mixed workload. This should allow is to generate workloads that are more representative. At the same time, I will introduce a timestamp value generator for primary key columns that is strictly ascending, i.e. has some random component but is based off of the actual system time (or some shared monotonically increasing state) so that we can again generate a more realistic workload. This may be challenging to tie in with the new procedurally generated partitions, but I'm sure it can be done without too much difficulty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7705) Safer Resource Management
[ https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100260#comment-14100260 ] Benedict commented on CASSANDRA-7705: - Linked four related tickets Safer Resource Management - Key: CASSANDRA-7705 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0 We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed (or in places where this safety is probably not as necessary, e.g. SerializingCache), we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme for these more expensive/infrequent actions. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7220) Nodes hang with 100% CPU load
[ https://issues.apache.org/jira/browse/CASSANDRA-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100261#comment-14100261 ] Benedict commented on CASSANDRA-7220: - Any other exceptions in the logs? Looks related to CASSANDRA-7262, CASSANDRA-7704, CASSANDRA-7705. It's likely this has been fixed in a newer release. Nodes hang with 100% CPU load - Key: CASSANDRA-7220 URL: https://issues.apache.org/jira/browse/CASSANDRA-7220 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.7 4 nodes cluster on 12 core machines Reporter: Robert Stupp Assignee: Ryan McGuire Attachments: c-12-read-100perc-cpu.zip I've ran a test that both reads and writes rows. After some time, all writes succeeded and all reads stopped. Two of the four nodes have 16 of 16 threads of the ReadStage thread pool running. The number of pending task continuouly grows on these nodes. I have attached outputs of the stack traces and some diagnostic output from nodetool tpstats nodetool status shows all nodes as UN. I had run that test previously without any issues in with the same configuration. Some specials from cassandra.yaml: - key_cache_size_in_mb: 1024 - row_cache_size_in_mb: 8192 The nodes running at 100% CPU are node2 and node3. node1node4 are fine. I'm not sure if it is reproducable - but it's definitly not a good behaviour. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7220) Nodes hang with 100% CPU load
[ https://issues.apache.org/jira/browse/CASSANDRA-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100290#comment-14100290 ] Benedict commented on CASSANDRA-7220: - [~rarudduck] it looks like the issue that killed your server was OOM. I can't see a reason for this in the logs, so it's possible you simply need to increase your heap size, however upgrading may help as there are a LOT of exceptions related to CASSANDRA-7756 logged, and it's possible that's somehow causing a knock on effect of some kind. Nodes hang with 100% CPU load - Key: CASSANDRA-7220 URL: https://issues.apache.org/jira/browse/CASSANDRA-7220 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.7 4 nodes cluster on 12 core machines Reporter: Robert Stupp Assignee: Ryan McGuire Attachments: c-12-read-100perc-cpu.zip, system.log I've ran a test that both reads and writes rows. After some time, all writes succeeded and all reads stopped. Two of the four nodes have 16 of 16 threads of the ReadStage thread pool running. The number of pending task continuouly grows on these nodes. I have attached outputs of the stack traces and some diagnostic output from nodetool tpstats nodetool status shows all nodes as UN. I had run that test previously without any issues in with the same configuration. Some specials from cassandra.yaml: - key_cache_size_in_mb: 1024 - row_cache_size_in_mb: 8192 The nodes running at 100% CPU are node2 and node3. node1node4 are fine. I'm not sure if it is reproducable - but it's definitly not a good behaviour. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7786) Cassandra is shutting down out of no apparent reason
[ https://issues.apache.org/jira/browse/CASSANDRA-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100293#comment-14100293 ] Benedict commented on CASSANDRA-7786: - Are you sure you haven't somehow sent the message over JMX / nodeprobe somehow? Possibly a script accidentally has the command embedded? There doesn't seem to be any code path that could shutdown the server without first logging an exception. Cassandra is shutting down out of no apparent reason Key: CASSANDRA-7786 URL: https://issues.apache.org/jira/browse/CASSANDRA-7786 Project: Cassandra Issue Type: Bug Components: Core Environment: C* 2.0.9 Reporter: Or Sher We've recently start facing an issue when one of the C* node in our dev and CI cluster (Thanks god didn't happen yet in Prod) is shutting down from time to time without any exceptions or errors. There is usually something like that in the logs: INFO [MemoryMeter:1] 2014-08-15 01:32:43,266 Memtable.java (line 481) CFS(Keyspace='system', ColumnFamily='sstable_activity') liveRatio is 14.597030881851438 (just-counted was 14.596825396825396). calculation took 2ms for 84 cells INFO [StorageServiceShutdownHook] 2014-08-15 01:40:58,954 ThriftServer.java (line 141) Stop listening to thrift clients INFO [StorageServiceShutdownHook] 2014-08-15 01:40:59,007 Server.java (line 182) Stop listening for CQL clients INFO [StorageServiceShutdownHook] 2014-08-15 01:40:59,011 Gossiper.java (line 1279) Announcing shutdown INFO [StorageServiceShutdownHook] 2014-08-15 01:41:01,011 MessagingService.java (line 683) Waiting for messaging service to quiesce INFO [ACCEPT-/192.168.27.241] 2014-08-15 01:41:01,012 MessagingService.java (line 923) MessagingService has terminated the accept() thread INFO [main] 2014-08-17 09:50:56,647 CassandraDaemon.java (line 135) Logging initialized You can see the last line in the log is usually written at least 5 minutes before the shutdown, sometimes 30 minutes before. I can't reproduce it as I have know idea why is that happening and how the attack this issue. I believe I'm not the only one suffering from this issue as there was a thread about such behavior in the user mail distribution. Any thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)