[jira] [Comment Edited] (CASSANDRA-7245) Out-of-Order keys with stress + CQL3
[ https://issues.apache.org/jira/browse/CASSANDRA-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018257#comment-14018257 ] Benedict edited comment on CASSANDRA-7245 at 6/4/14 10:08 PM: -- [~jasobrown] I got this one if you've got better stuff to do :-) [~tjake]: * I think the change you made doesn't actually totally eliminate the bug: since it extends DroppableRunnable, the finally block may never be run * In SP.mutate() you can't release the mutations until after we get the all clear from the replicas, as we may have to use the mutations to write local hints. It might be nice, however, to have the response handler release() the reference once we receive enough responses from our replicas, so that we don't keep all of the references if we're just waiting for one mutation * Why do we need a ThreadLocal in ClientState to store the sourceFrame. Can't we just store it directly in a field in QueryState? Nit: * WS in ClientState constructor; would also prefer to just create ThreadLocal in var initialiser since it's always init'd empty Also, this is off-topic, but I wonder if we shouldn't replace the NBHM in ServerConnection with a simple array. We know the range is 32K (.1K for older clients), and each index is accessed by a single thread at any given time, so we'd just need to be able to atomically swap in larger arrays if we wanted dynamic sizing. Otherwise LGTM was (Author: benedict): [~jasobrown] I got this one if you've got better stuff to do :-) [~tjake]: * I think the change you made doesn't actually totally eliminate the bug: since it extends DroppableRunnable, the finally block may never be run * I think it would be good in SP.mutate() you can't release the mutations until after we get the all clear from the replicas, as we may have to use the mutations to write local hints. It might be nice, however, to have the response handler release() the reference once we receive enough responses from our replicas, so that we don't keep all of the references if we're just waiting for one mutation * Why do we need a ThreadLocal in ClientState to store the sourceFrame. Can't we just store it directly in a field in QueryState? Nit: * WS in ClientState constructor; would also prefer to just create ThreadLocal in var initialiser since it's always init'd empty Also, this is off-topic, but I wonder if we shouldn't replace the NBHM in ServerConnection with a simple array. We know the range is 32K (.1K for older clients), and each index is accessed by a single thread at any given time, so we'd just need to be able to atomically swap in larger arrays if we wanted dynamic sizing. Otherwise LGTM Out-of-Order keys with stress + CQL3 Key: CASSANDRA-7245 URL: https://issues.apache.org/jira/browse/CASSANDRA-7245 Project: Cassandra Issue Type: Bug Components: Core Reporter: Pavel Yaskevich Assignee: T Jake Luciani Fix For: 2.1.0 Attachments: 7245-v2.txt, 7245.txt, 7245v3-rebase.txt, 7245v3.txt, 7245v4.txt, netty-all-4.0.19.Final.jar We have been generating data (stress with CQL3 prepared) for CASSANDRA-4718 and found following problem almost in every SSTable generated (~200 GB of data and 821 SSTables). We set up they keys to be 10 bytes in size (default) and population between 1 and 6. Once I ran 'sstablekeys' on the generated SSTable files I got following exceptions: _There is a problem with sorting of normal looking keys:_ 30303039443538353645 30303039443745364242 java.io.IOException: Key out of order! DecoratedKey(-217680888487824985, *30303039443745364242*) DecoratedKey(-1767746583617597213, *30303039443437454333*) 0a30303033343933 3734441388343933 java.io.IOException: Key out of order! DecoratedKey(5440473860101999581, *3734441388343933*) DecoratedKey(-7565486415339257200, *30303033344639443137*) 30303033354244363031 30303033354133423742 java.io.IOException: Key out of order! DecoratedKey(2687072396429900180, *30303033354133423742*) DecoratedKey(-7838239767410066684, *30303033354145344534*) 30303034313442354137 3034313635363334 java.io.IOException: Key out of order! DecoratedKey(1516003874415400462, *3034313635363334*) DecoratedKey(-9106177395653818217, *3030303431444238*) 30303035373044373435 30303035373044334631 java.io.IOException: Key out of order! DecoratedKey(-3645715702154616540, *30303035373044334631*) DecoratedKey(-4296696226469000945, *30303035373132364138*) _And completely different ones:_ 30303041333745373543 7cd045c59a90d7587d8d java.io.IOException: Key out of order! DecoratedKey(-3595402345023230196, *7cd045c59a90d7587d8d*) DecoratedKey(-5146766422778260690, *30303041333943303232*) 3030303332314144
[jira] [Commented] (CASSANDRA-6572) Workload recording / playback
[ https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018301#comment-14018301 ] Benedict commented on CASSANDRA-6572: - * You need to move the getAndSet() on line 76 back inside the if statement * recycleQueue souldn't compareAndSet its position to zero, it should just set it always, and always recycle * I'd probably make a static helper method for checking if a keyspace is one we want to avoid tracing Otherwise this batch of changes LGTM Workload recording / playback - Key: CASSANDRA-6572 URL: https://issues.apache.org/jira/browse/CASSANDRA-6572 Project: Cassandra Issue Type: New Feature Components: Core, Tools Reporter: Jonathan Ellis Assignee: Lyuben Todorov Fix For: 2.1.1 Attachments: 6572-trunk.diff Write sample mode gets us part way to testing new versions against a real world workload, but we need an easy way to test the query side as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7468) Add time-based execution to cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7468: Assignee: Matt Kennedy (was: Benedict) Add time-based execution to cassandra-stress Key: CASSANDRA-7468 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Matt Kennedy Assignee: Matt Kennedy Priority: Minor Fix For: 2.1.1 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-7468) Add time-based execution to cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-7468. - Resolution: Fixed This was a really minor issue of not passing the Duration settings object, so ninja fixed and, at the same time, applied to 2.1 - no point keeping merges more difficult whilst simultaneously depriving users of features for something small like this Add time-based execution to cassandra-stress Key: CASSANDRA-7468 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Matt Kennedy Assignee: Matt Kennedy Priority: Minor Fix For: 2.1.1 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7658: Attachment: 7658.txt Patch attached stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 Attachments: 7658.txt If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-7646) DROP may not clean commit log segments when auto snapshot is false
[ https://issues.apache.org/jira/browse/CASSANDRA-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-7646. - Resolution: Not a Problem I was mistaken when I thought this was also an issue: we don't special case drop at all, we always flush on it. No calls to renewMemtable which is the danger zone. DROP may not clean commit log segments when auto snapshot is false -- Key: CASSANDRA-7646 URL: https://issues.apache.org/jira/browse/CASSANDRA-7646 Project: Cassandra Issue Type: Bug Components: Core Reporter: Jeremiah Jordan Assignee: Benedict Fix For: 2.0.11 Per discussion on CASSANDRA-7511 DROP may also be affected by the same issue. If auto snapshot is false, commit log segments may stay dirty because they are not flushed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7282: Attachment: reads.svg writes.svg Okay, so I made some minor tweaks to this and did some more serious local testing (and graphing of results). The updated branch (also rebased to trunk) is [here|https://github.com/belliottsmith/cassandra/tree/7282-fastmemtablemap.trunk] On the whole it looks to me that bdplab is as usual showing its reticence to exhibit performance improvements. I suspect it's bottlenecking more readily on kernel operations. On my local machine I see around a 15-20% improvement in throughput for both reads and writes, making this a very sensible addition. It also sees considerably _reduced_ GC time (though not amount of garbage generated) on writes, and sees a reduction in latency almost across the board (max latency on a read-only workload is slightly bumped, but since write workloads have the largest effect on latency, this seems worth overlooking, and due to how closely run we are, it's possible this is noise in the measurement, since the median p999/pMax are almost exactly the same) I've also separately improved stress to collect GC data over JMX and created a patch to generate these pretty graphs from stress output automatically, which I'll be posting separately. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
Benedict created CASSANDRA-7916: --- Summary: Stress should collect and cross-cluster GC statistics Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132017#comment-14132017 ] Benedict commented on CASSANDRA-7916: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/stress-jmx] Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress
Benedict created CASSANDRA-7918: --- Summary: Provide graphing tool along with cassandra-stress Key: CASSANDRA-7918 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Whilst cstar makes some pretty graphs, they're a little limited and also require you to run your tests through it. It would be useful to be able to graph results from any stress run easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132020#comment-14132020 ] Benedict commented on CASSANDRA-7918: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/stress-multiplot] See CASSANDRA-7282 for sample output. This patch relies upon gnuplot. Provide graphing tool along with cassandra-stress - Key: CASSANDRA-7918 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Whilst cstar makes some pretty graphs, they're a little limited and also require you to run your tests through it. It would be useful to be able to graph results from any stress run easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7915) Waiting for sync on the commit log could happen after writing to memtable
[ https://issues.apache.org/jira/browse/CASSANDRA-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132022#comment-14132022 ] Benedict commented on CASSANDRA-7915: - I'm not totally convinced by this, although it's debatable. The issue, of course, is that the data becomes visible and can be read before it is considered 'durable' - meaning that you could lose the cluster, restore from CL, and find data is missing that was previously present. Users relying on batch CL probably would not like this scenario. Waiting for sync on the commit log could happen after writing to memtable - Key: CASSANDRA-7915 URL: https://issues.apache.org/jira/browse/CASSANDRA-7915 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Branimir Lambov Priority: Minor Currently the sync wait is part of CommitLog.add, which is executed in whole before any memtable write. The time for executing the latter is thus added on top of the time for file sync, which seems unnecessary. Moving the wait to a call at the end of Keystore.apply should hide the memtable write time and may improve performance, especially for the batch sync strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132485#comment-14132485 ] Benedict commented on CASSANDRA-7918: - Well, I developed this on 12-hr flight, so doing it with something I need Google to achieve wasn't an option, although d3.js has some strengths. I really dislike python, however, and have found every time I try to use a scripting language to develop a tool they would be considered more suitable for, I waste time doing so. I am very productive in Java. I did not want to spend longer on this than necessary, so I stuck with Java this time. I'm very happy with the end result - these graphs are extremely informative. With a single glance a lot of comparisons can easily be made. If somebody wants to develop something of equal utility with different tools, that's fine by me, but in the mean time I don't see why that should prevent this being made use of. Since this is an adhoc tool, dropping it in favour of a future improvement is easily done, it's a non-breaking change. Provide graphing tool along with cassandra-stress - Key: CASSANDRA-7918 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Whilst cstar makes some pretty graphs, they're a little limited and also require you to run your tests through it. It would be useful to be able to graph results from any stress run easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132487#comment-14132487 ] Benedict commented on CASSANDRA-7919: - See my comments on CASSANDRA-6108 for details of why I don't think this is suitable. Since we specifically want this for RAMP transactions we need guaranteed uniqueness cross-cluster, which timeuuids only achieve if created _server side_. We will absolutely have to modify the semantics of TimeUUID if we want them to be guaranteed unique still, at which point we may as well consider the 64-bit representation I suggested on that ticket. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132494#comment-14132494 ] Benedict commented on CASSANDRA-7282: - Yes, we want to restrict the workload to memtables only, and we want to make them big to generate enough numbers on insert to avoid noise. So from a fresh cluster, I start it with a memtable_cleanup_threshold of 0.99, memtable_allocation_type: offheap_objects, and I run a stress test with only _one column_ per partition, and make that column size 1 (although any size is fine if it's offheap). I'm just trying to make the memtable as large as possible, which with my 4Gb laptop is difficult. I then set the memtable_heap_space_in_mb to 1024 (feel free to make it much bigger), and then insert around 5M items. If you stick to one column, ensure your offheap space is sufficiently large for the space you insert into it, then 5M items per node per Gb of on-heap space is achievable (my math tells me around 9M should be possible, but I overshot slightly and decided to be conservative). I then follow up immediately with a read run hitting a random selection of those PKs. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132501#comment-14132501 ] Benedict commented on CASSANDRA-7919: - I'm also a little uncomfortable making it explicitly a TimeUUID at the client level, since this may mean clients start expecting the exact TimeUUID back with each request also. This would definitely be a bad thing, and prevent almost all of the timestamp optimisation work we have pencilled for 3.0, as we would have to retain its full bytes forever, which are half random, the state space of which will grow linearly with cluster up-time. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132501#comment-14132501 ] Benedict edited comment on CASSANDRA-7919 at 9/13/14 3:26 AM: -- I'm also a little uncomfortable making it explicitly a TimeUUID at the client level, since this may mean clients start expecting the exact TimeUUID back with each request also. This would definitely be a bad thing, and prevent almost all of the timestamp optimisation work we have pencilled for 3.0, as we would have to retain its full bytes forever, the least significant bits of which become meaningless after repair, and are 'random' (a hash of interface + time salt), the state space of which will grow linearly with cluster up-time. was (Author: benedict): I'm also a little uncomfortable making it explicitly a TimeUUID at the client level, since this may mean clients start expecting the exact TimeUUID back with each request also. This would definitely be a bad thing, and prevent almost all of the timestamp optimisation work we have pencilled for 3.0, as we would have to retain its full bytes forever, which are half random, the state space of which will grow linearly with cluster up-time. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132508#comment-14132508 ] Benedict commented on CASSANDRA-7282: - No. You benchmark to isolate performance improvements. These numbers demonstrate a benefit _to any portion of a workload that fits these characteristics_. Different use cases will exhibit different ratios of this effect; some considerable, some not so much. Certainly the read performance enhancements will be more generally applicable, but being able to fill your memtables faster and with less garbage is also not a bad thing, even if you end up bottlenecking on disk. Especially since server characteristics are changing rapidly, so that disk bottlenecks are disappearing. As optimisation work goes deeper, isolating the specific portion of work that is affected is pretty essential to clearly delineating the benefit. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132516#comment-14132516 ] Benedict commented on CASSANDRA-7919: - We can have multiple clients per interface, we can have multiple clients starts/stop at the same time. C* instances can pretty much guarantee this doesn't happen; clients cannot. Once we have a collision on the LSB, we are exposed indefinitely to conflicts. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132516#comment-14132516 ] Benedict edited comment on CASSANDRA-7919 at 9/13/14 4:09 AM: -- We can have multiple clients per interface (esp. cross DC, as could have multiple DCs with the same network address range that do not communicate directly, but communicate with a shared C* data centre), we can have multiple clients starts/stop at the same time. C* instances can pretty much guarantee this doesn't happen; clients cannot. Once we have a collision on the LSB, we are exposed indefinitely to conflicts. was (Author: benedict): We can have multiple clients per interface, we can have multiple clients starts/stop at the same time. C* instances can pretty much guarantee this doesn't happen; clients cannot. Once we have a collision on the LSB, we are exposed indefinitely to conflicts. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132518#comment-14132518 ] Benedict commented on CASSANDRA-7919: - bq. Once we have a collision on the LSB, we are exposed indefinitely to conflicts. As soon as we collide on LSB, the only thing delivering uniqueness is timestamp, which is definitely insufficient. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132523#comment-14132523 ] Benedict commented on CASSANDRA-7282: - The new code is not significantly more complex. In fact, its arguably less complex, the only difference being some of that complexity is not derived from an external source (CSLM is significantly more complex than the NBHOM). However it is a pretty simple piece of code. The point of isolating is that we cannot possibly test all possible real world use cases. DSE, for instance, delivers Memory-Only SSTables, which depend on memtables, so all workloads hitting these would be affected fully by this change. All other workloads will be hit to different degrees. In general we have to make a judgement call about how common such workloads are, vs how significant the effect is for those workloads. However trying to define a success metric based on a narrow definition of 'realistic workloads' that do not isolate the behaviour doesn't buy us anything in my book. Our hardware and workload definitions are not sufficiently universal. If we see a 15% bump for accesses to memtables, that is IMO worth incorporating, esp. as the bump will increase over time. As we move memtables fully offheap we can support larger and larger memtables, and since the algorithmic quality of this change is that the performance does not increase as memtable size grows, whereas CSLM grows logarithmically, this change also has the likelihood of paying further future dividends. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132529#comment-14132529 ] Benedict commented on CASSANDRA-7919: - Well, I've defined situations in which it can happen: 2+ applications on the same server, machines connecting over NAT (e.g. over the internet), or machines connected to a common data centre that share address spaces. As client counts increase, risk increases. The number of affected customers is likely to be low, but we can guarantee with the size of our user base it _will_ happen. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132536#comment-14132536 ] Benedict commented on CASSANDRA-7282: - bq. But there definitely is some value in heavily battle-tested code when you work on a database. Sure, agreed :) Ok, I'm not disputing there's value in _seeing extra data_ (there always is), but if the effect is not seen, or is lost in the noise, on the specific hardware / use case we're benchmarking that _doesn't_ isolate, my only point is that this doesn't IMO reflect badly on the change. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132537#comment-14132537 ] Benedict commented on CASSANDRA-7919: - Are they really used that widely in use cases that aren't server generated? Also, they're generated considerably less frequently than we will be generating them if we switch to all updates requiring them, so the attack vector increases. For RAMP transactions, I'm pretty sure the conflict becomes more dangerous as well. It's a while since I thought about them, but I remember reaching the conclusion that collisions would have more severe consequences. Either way, I still think they're a a very bad idea based on my concerns about storage implications. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7923) When preparing a statement, do not parse the provided string if we already have the parsed statement cached
Benedict created CASSANDRA-7923: --- Summary: When preparing a statement, do not parse the provided string if we already have the parsed statement cached Key: CASSANDRA-7923 URL: https://issues.apache.org/jira/browse/CASSANDRA-7923 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor If there are many clients preparing the same statement (or the same client preparing it multiple times), there's no point parsing the statement each times. We already have it prepared, we should ship back the prior result. I would like us separately to consider introducing some checks to ensure that we never have a hash collision (and error if we do, asking the user to salt their query string), but this change in no way increases the risk profile here, since all we did was overwrite the prior statement with the new one. This change means that clients referencing the old statement continue to function and the client registering the colliding statement will not execute the correct statement, but this is in no way worse than the reverse situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7923) When preparing a statement, do not parse the provided string if we already have the parsed statement cached
[ https://issues.apache.org/jira/browse/CASSANDRA-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7923: Attachment: 7923.txt When preparing a statement, do not parse the provided string if we already have the parsed statement cached --- Key: CASSANDRA-7923 URL: https://issues.apache.org/jira/browse/CASSANDRA-7923 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor Attachments: 7923.txt If there are many clients preparing the same statement (or the same client preparing it multiple times), there's no point parsing the statement each times. We already have it prepared, we should ship back the prior result. I would like us separately to consider introducing some checks to ensure that we never have a hash collision (and error if we do, asking the user to salt their query string), but this change in no way increases the risk profile here, since all we did was overwrite the prior statement with the new one. This change means that clients referencing the old statement continue to function and the client registering the colliding statement will not execute the correct statement, but this is in no way worse than the reverse situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7923) When preparing a statement, do not parse the provided string if we already have the parsed statement cached
[ https://issues.apache.org/jira/browse/CASSANDRA-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7923: Labels: cql performance (was: ) When preparing a statement, do not parse the provided string if we already have the parsed statement cached --- Key: CASSANDRA-7923 URL: https://issues.apache.org/jira/browse/CASSANDRA-7923 Project: Cassandra Issue Type: Improvement Reporter: Benedict Assignee: Benedict Priority: Minor Labels: cql, performance Fix For: 2.1.1 Attachments: 7923.txt If there are many clients preparing the same statement (or the same client preparing it multiple times), there's no point parsing the statement each times. We already have it prepared, we should ship back the prior result. I would like us separately to consider introducing some checks to ensure that we never have a hash collision (and error if we do, asking the user to salt their query string), but this change in no way increases the risk profile here, since all we did was overwrite the prior statement with the new one. This change means that clients referencing the old statement continue to function and the client registering the colliding statement will not execute the correct statement, but this is in no way worse than the reverse situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7819) In progress compactions should not prevent deletion of stale sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132568#comment-14132568 ] Benedict commented on CASSANDRA-7819: - Looking at this a little more closely, the first patch actually looks completely safe. We never have a MergeTask outstanding inbetween iterations. In fact, I cannot see why we have the MergeTask submittedly asynchronously at all, since we immediately synchronously wait on the result. The Deserializer is the only work done asynchronously, and it doesn't touch this code. If this weren't true, the changes above would not be sufficient; we would need to introduce some other synchronization, or simply delay this change until 2.1. ParallelCI returns an Unwrapper(MergeIterator); the _Reducer_ inside this MergeIterator, on a call to getReduced(), submits a Runnable to an executor, during which we need to be certain the sstables in that collection remain available. This getReduced() method is called on computeNext(), which is called on either hasNext(), or next(). The Unwrapper always calls next() immediately after hasNext(), so there are never any left dangling, and _immediately_ calls FBUtilities.waitOnFuture()) on the result. The only other places we call shouldPurge are: inside the regular CompactionIterable Reducer, which is not asynchronously executed, and synchronously inside doCleanupCompaction and Scrubber.scrub. I am a little sleep deprived this second, so if you could double check my logic, that would be great. But it looks solid to me. In progress compactions should not prevent deletion of stale sstables - Key: CASSANDRA-7819 URL: https://issues.apache.org/jira/browse/CASSANDRA-7819 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Labels: compaction Fix For: 2.0.11 Attachments: 0001-7819-v2.patch, 7819.txt Compactions retain references to potentially many sstables that existed when they were started but that are now obsolete; many concurrent compactions can compound this dramatically, and with very large files in size tiered compaction it is possible to inflate disk utilisation dramatically beyond what is necessary. I propose, during compaction, periodically checking which sstables are obsolete and simply replacing them with the sstable that replaced it. These sstables are by definition only used for lookup, since we are in the process of obsoleting the sstables we're compacting, they're only used to reference overlapping ranges which may be covered by tombstones. A simplest solution might even be to simply detect obsoletion and recalculate our overlapping tree afresh. This is a pretty quick operation in the grand scheme of things, certainly wrt compaction, so nothing lost to do this at the rate we obsolete sstables. See CASSANDRA-7139 for original discussion of the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7819) In progress compactions should not prevent deletion of stale sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132572#comment-14132572 ] Benedict commented on CASSANDRA-7819: - I'm tempted to say we should just get rid of that executor step to absolutely guarantee it, since it really doesn't buy us anything at all - it almost certainly just slows things down. In progress compactions should not prevent deletion of stale sstables - Key: CASSANDRA-7819 URL: https://issues.apache.org/jira/browse/CASSANDRA-7819 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Labels: compaction Fix For: 2.0.11 Attachments: 0001-7819-v2.patch, 7819.txt Compactions retain references to potentially many sstables that existed when they were started but that are now obsolete; many concurrent compactions can compound this dramatically, and with very large files in size tiered compaction it is possible to inflate disk utilisation dramatically beyond what is necessary. I propose, during compaction, periodically checking which sstables are obsolete and simply replacing them with the sstable that replaced it. These sstables are by definition only used for lookup, since we are in the process of obsoleting the sstables we're compacting, they're only used to reference overlapping ranges which may be covered by tombstones. A simplest solution might even be to simply detect obsoletion and recalculate our overlapping tree afresh. This is a pretty quick operation in the grand scheme of things, certainly wrt compaction, so nothing lost to do this at the rate we obsolete sstables. See CASSANDRA-7139 for original discussion of the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132586#comment-14132586 ] Benedict commented on CASSANDRA-7282: - Perhaps this is a good opportunity, for our second set of data points, to try taking the latest 'realistic workload' features of cassandra-stress for a whirl Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: reads.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7282: Attachment: profile.yaml run1.svg Ok, so I ran a more realistic workload with the attached profile.yaml, 50/50 read/writes, with reads favouring recently written partitions following an extreme distribution. i.e. the following stress command: ./tools/bin/cassandra-stress user profile=profile.yaml ops\(insert=5,read=5\) n=2000 -pop seq=1..10M read-lookback=extreme\(1..1M,2\) -rate threads=200 -mode cql3 native prepared This is still a workload geared towards exhibiting favourable behaviour, but it is certainly a larger than memory workload. The graph comparing the results (run1.svg) attached demonstrates it is still showing a clear improvement, of around 10% throughput, reduced latencies, reduced total GC work. It also results in less frequent flushes, presumably due to it requiring slightly less memory than CSLM. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132620#comment-14132620 ] Benedict commented on CASSANDRA-7919: - It should be, but +1 reducing the vector of this problem, regardless of the discussion around using TimeUUID for timestamps. It's more important this get contributed to the various drivers, but since some users run multiple C* instances on the same box I'd support including this in C* as well, just in case. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132625#comment-14132625 ] Benedict commented on CASSANDRA-7658: - I figured I must have configured it correctly since there was only one method to provide it, and if it worked or not depended on the Java Driver. But I was mistaken, you have to provide it in the builder. I've double checked, and the updated patch works correctly now. stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 Attachments: 7658.txt If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7724) Native-Transport threads get stuck in StorageProxy.preparePaxos with no one making progress
[ https://issues.apache.org/jira/browse/CASSANDRA-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132628#comment-14132628 ] Benedict commented on CASSANDRA-7724: - It might be worth trying the (beta) patch for CASSANDRA-7542, to see if that has any impact. If it does, we could accelerate polishing it off. FTR, it is not totally bizarre that all of them should be waiting - if another node is making progress on a round, all of the threads on a given node could be paused Native-Transport threads get stuck in StorageProxy.preparePaxos with no one making progress --- Key: CASSANDRA-7724 URL: https://issues.apache.org/jira/browse/CASSANDRA-7724 Project: Cassandra Issue Type: Bug Components: Core Environment: Linux 3.13.11-4 #4 SMP PREEMPT x86_64 Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz GenuineIntel java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) cassandra 2.0.9 Reporter: Anton Lebedevich Attachments: cassandra.threads2 We've got a lot of write timeouts (cas) when running INSERT INTO cas_demo(pri_id, sec_id, flag, something) VALUES(?, ?, ?, ?) IF NOT EXISTS from 16 connections in parallel using the same pri_id and different sec_id. Doing the same from 4 connections in parallel works ok. All configuration values are at their default values. CREATE TABLE cas_demo ( pri_id varchar, sec_id varchar, flag boolean, something setvarchar, PRIMARY KEY (pri_id, sec_id) ); CREATE INDEX cas_demo_flag ON cas_demo(flag); Full thread dump is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-6726) Recycle CompressedRandomAccessReader/RandomAccessReader buffers independently of their owners, and move them off-heap when possible
[ https://issues.apache.org/jira/browse/CASSANDRA-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-6726. - Resolution: Won't Fix Recycle CompressedRandomAccessReader/RandomAccessReader buffers independently of their owners, and move them off-heap when possible --- Key: CASSANDRA-6726 URL: https://issues.apache.org/jira/browse/CASSANDRA-6726 Project: Cassandra Issue Type: Improvement Reporter: Benedict Assignee: Branimir Lambov Priority: Minor Labels: performance Fix For: 3.0 Attachments: cassandra-6726.patch Whilst CRAR and RAR are pooled, we could and probably should pool the buffers independently, so that they are not tied to a specific sstable. It may be possible to move the RAR buffer off-heap, and the CRAR sometimes (e.g. Snappy may possibly support off-heap buffers) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown
[ https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132641#comment-14132641 ] Benedict commented on CASSANDRA-7784: - Any news on a patch for this? We should really have got this into 2.1.0, but should definitely fix it for 2.1.1 DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown --- Key: CASSANDRA-7784 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor It looks like this is quite a realistic race to hit reasonably often, since we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a lengthy window to overlap with an auto-save -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-6696) Partition sstables by token range
[ https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6696: Labels: compaction correctness performance (was: ) Partition sstables by token range - Key: CASSANDRA-6696 URL: https://issues.apache.org/jira/browse/CASSANDRA-6696 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Marcus Eriksson Labels: compaction, correctness, performance Fix For: 3.0 In JBOD, when someone gets a bad drive, the bad drive is replaced with a new empty one and repair is run. This can cause deleted data to come back in some cases. Also this is true for corrupt stables in which we delete the corrupt stable and run repair. Here is an example: Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. row=sankalp col=sankalp is written 20 days back and successfully went to all three nodes. Then a delete/tombstone was written successfully for the same row column 15 days back. Since this tombstone is more than gc grace, it got compacted in Nodes A and B since it got compacted with the actual data. So there is no trace of this row column in node A and B. Now in node C, say the original data is in drive1 and tombstone is in drive2. Compaction has not yet reclaimed the data and tombstone. Drive2 becomes corrupt and was replaced with new empty drive. Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp has come back to life. Now after replacing the drive we run repair. This data will be propagated to all nodes. Note: This is still a problem even if we run repair every gc grace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7658) stress connects to all nodes when it shouldn't
[ https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7658: Attachment: 7658.v2.txt stress connects to all nodes when it shouldn't -- Key: CASSANDRA-7658 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Brandon Williams Assignee: Benedict Priority: Minor Fix For: 2.1.1 Attachments: 7658.txt, 7658.v2.txt If you tell stress -node 1,2 in cluster with more nodes, stress appears to do ring discovery and connect to them all anyway (checked via netstat.) This led to the confusion on CASSANDRA-7567 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132810#comment-14132810 ] Benedict edited comment on CASSANDRA-7919 at 9/13/14 4:17 PM: -- That is stale info, unfortunately :) If we correct the risk of collision within a single application server, which is pretty essential, our LSB will suddenly explode regardless of how static our client set-of-ips is so optimising storage becomes impossible. And relying on a static client set-of-ips for performance is really unpleasant, anyway. It would be a very strange scenario for users to discover that their database performance degrades over time for reasons unrelated to their load. That all said, if we only promise to store (and deliver back to users) the _time_ component *_only_*, I think we can do it. was (Author: benedict): That is stale info, unfortunately :) If we correct the risk of collision within a single application server, which is pretty essential, our LSB will suddenly explode regardless of how static our client set-of-ips is so optimising storage becomes impossible. And relying on a static client set-of-ips for performance is really unpleasant, anyway. It would be a very strange scenario for users to discover that their database performance degrades over time for reasons unrelated to their load. That all said, if we promise to store (and deliver back to users) the _time_ component *_only_*, I think we can do it. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132810#comment-14132810 ] Benedict commented on CASSANDRA-7919: - That is stale info, unfortunately :) If we correct the risk of collision within a single application server, which is pretty essential, our LSB will suddenly explode regardless of how static our client set-of-ips is so optimising storage becomes impossible. And relying on a static client set-of-ips for performance is really unpleasant, anyway. It would be a very strange scenario for users to discover that their database performance degrades over time for reasons unrelated to their load. That all said, if we promise to store (and deliver back to users) the _time_ component *_only_*, I think we can do it. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132815#comment-14132815 ] Benedict commented on CASSANDRA-7916: - I doc'd them in the MXBean interface comment :) Not sure we want to pollute the README with this info, but definitely happy to insert it in other places.. bq. It might also be nice to summarize the total gc collections in the Results summary. Good idea. Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132819#comment-14132819 ] Benedict commented on CASSANDRA-7916: - Updated repository with results summary change Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132830#comment-14132830 ] Benedict commented on CASSANDRA-7919: - Well, we're talking about making _every field a timeuuid_, meaning this decision has rather more impact than just those specific TimeUUID fields (which I've no issue not optimising the storage of, or only managing to do so partially) Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7926) Stress can OOM on merging of timing samples
Benedict created CASSANDRA-7926: --- Summary: Stress can OOM on merging of timing samples Key: CASSANDRA-7926 URL: https://issues.apache.org/jira/browse/CASSANDRA-7926 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 {noformat} Exception in thread StressMetrics:2 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2343) at org.apache.cassandra.stress.util.SampleOfLongs.merge(SampleOfLongs.java:76) at org.apache.cassandra.stress.util.TimingInterval.merge(TimingInterval.java:95) at org.apache.cassandra.stress.util.Timing.snapInterval(Timing.java:95) at org.apache.cassandra.stress.StressMetrics.update(StressMetrics.java:124) at org.apache.cassandra.stress.StressMetrics.access$200(StressMetrics.java:36) at org.apache.cassandra.stress.StressMetrics$1.run(StressMetrics.java:72) at java.lang.Thread.run(Thread.java:744) {noformat} This is partially down to recently increasing the per-thread sample size, but also because we allocate temporary space linear in size to total sample size in all threads during merge. This can easily be avoided. We should also scale per-thread sample size based on total number of threads, so we limit total memory use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132850#comment-14132850 ] Benedict commented on CASSANDRA-7916: - AH, I see. That makes sense. I don't think we have a stress-specific README, though? Might be worth creating one, but probably a separate ticket... Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132887#comment-14132887 ] Benedict commented on CASSANDRA-7282: - For the Murmur3Partitioner, the Token (which is key for this map) is sorted by hash, so all we require is that the hashCode() method returns a prefix of this hash. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132887#comment-14132887 ] Benedict edited comment on CASSANDRA-7282 at 9/13/14 7:14 PM: -- For the Murmur3Partitioner, the Token (which is key for this map) is sorted by hash (represented as a Long), so all we require is that the hashCode() method returns a prefix of this hash, or the top 32-bits of the long value. was (Author: benedict): For the Murmur3Partitioner, the Token (which is key for this map) is sorted by hash, so all we require is that the hashCode() method returns a prefix of this hash. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132906#comment-14132906 ] Benedict commented on CASSANDRA-7282: - As already stated, I disagree. Always, when making these decisions, the important factors are: 1) what workloads / portions of workloads are affected; and 2) do we consider these common enough (or to become common enough) to warrant inclusion? Since we are very constrained on our hardware and workload generation, picking a realistic workload that we can perform at best helps us rule out regressions if it is not compatible with exhibiting the change. The important question is simply: do we consider it likely it will impact other workloads we are not capable of benchmarking, given the known information we have from isolating its effect in a manner we _can_. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132935#comment-14132935 ] Benedict commented on CASSANDRA-7919: - bq. Regarding the lsb, a given client shouldn't really change it during it's lifetime (per the timeuuid spec) This is not true if we incorporate CASSANDRA-7925 If we fix this, which we should, and we use TimeUUID for solving this problem generally, we should _not_ deliver the TimeUUID back to clients, and we should not store it fully indefinitely. We should only use it for guaranteeing uniqueness, and after that truncate it to a simple timestamp. There is no benefit to storing it indefinitely. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953 ] Benedict commented on CASSANDRA-7282: - bq. Since this is the case, should we restrict this method to Murmur tokens only? Well, if I can spend the time ensuring safety we should be able to use this for RandomPartitioner also. bq. I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. There will be collisions with or without truncation. The fact that there are collisions doesn't affect the constraint I've imposed upon the data; we assume nothing about the dataset when two hashCode()s are equal, and simply resort to the underlying token comparator, so I think the constraint is sufficient. Non-strict equality is surely more broken, however? It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode(). bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Perhaps. This was the simplest approach, and since it *is* a hash key used to index a hash table it seems suitable to use hashCode(), and impose the extra constraint contextually. I'm fairly neutral though; we certainly could introduce a new interface. I'll see how it looks. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953 ] Benedict edited comment on CASSANDRA-7282 at 9/13/14 9:49 PM: -- bq. Since this is the case, should we restrict this method to Murmur tokens only? Well, if I can spend the time ensuring safety we should be able to use this for RandomPartitioner also. bq. I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. -There will be collisions with or without truncation. The fact that there are collisions doesn't affect the constraint I've imposed upon the data; we assume nothing about the dataset when two hashCode()s are equal, and simply resort to the underlying token comparator, so I think the constraint is sufficient. Non-strict equality is surely more broken, however? It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().- I just reread your comment and realise you meant the RHS only. Agreed this should be changed. bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Perhaps. This was the simplest approach, and since it *is* a hash key used to index a hash table it seems suitable to use hashCode(), and impose the extra constraint contextually. I'm fairly neutral though; we certainly could introduce a new interface. I'll see how it looks. was (Author: benedict): bq. Since this is the case, should we restrict this method to Murmur tokens only? Well, if I can spend the time ensuring safety we should be able to use this for RandomPartitioner also. bq. I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. There will be collisions with or without truncation. The fact that there are collisions doesn't affect the constraint I've imposed upon the data; we assume nothing about the dataset when two hashCode()s are equal, and simply resort to the underlying token comparator, so I think the constraint is sufficient. Non-strict equality is surely more broken, however? It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode(). bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Perhaps. This was the simplest approach, and since it *is* a hash key used to index a hash table it seems suitable to use hashCode(), and impose the extra constraint contextually. I'm fairly neutral though; we certainly could introduce a new interface. I'll see how it looks. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid
[ https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132979#comment-14132979 ] Benedict commented on CASSANDRA-7919: - bq. a given client process will have one lsb for its lifetime. Sure, but when discussing state space we have to assume applications lifecycle periodically :-) bq. I don't have a problem with that. Ok, if we can agree to that, fix the LSB and make clear to users they should not connect through NAT, I think that's all of my concerns mostly dealt with. This will still be more annoying to store, but surmountable without severe loss of efficiency. Change timestamp representation to timeuuid --- Key: CASSANDRA-7919 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Priority: Minor Fix For: 3.0 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we need to migrate to a better timestamp representation for cells. Since drivers already support timeuuid it makes sense to migrate to this internally (see CASSANDRA-7056) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953 ] Benedict edited comment on CASSANDRA-7282 at 9/13/14 9:55 PM: -- bq. Since this is the case, should we restrict this method to Murmur tokens only? Well, if I can spend the time ensuring safety we should be able to use this for RandomPartitioner also. bq. I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. -There will be collisions with or without truncation. The fact that there are collisions doesn't affect the constraint I've imposed upon the data; we assume nothing about the dataset when two hashCode()s are equal, and simply resort to the underlying token comparator, so I think the constraint is sufficient. Non-strict equality is surely more broken, however? It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().- I just reread your comment and realise you meant the RHS only. Agreed this should be changed. However we should probably opt for the stronger bidirectional constraint, since it is still more correct. bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Perhaps. This was the simplest approach, and since it *is* a hash key used to index a hash table it seems suitable to use hashCode(), and impose the extra constraint contextually. I'm fairly neutral though; we certainly could introduce a new interface. I'll see how it looks. was (Author: benedict): bq. Since this is the case, should we restrict this method to Murmur tokens only? Well, if I can spend the time ensuring safety we should be able to use this for RandomPartitioner also. bq. I would still change the condition in the preface to use non-strict inequality, though, because cropping tokens to 32 bits will introduce collisions. -There will be collisions with or without truncation. The fact that there are collisions doesn't affect the constraint I've imposed upon the data; we assume nothing about the dataset when two hashCode()s are equal, and simply resort to the underlying token comparator, so I think the constraint is sufficient. Non-strict equality is surely more broken, however? It would hold true for k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().- I just reread your comment and realise you meant the RHS only. Agreed this should be changed. bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Perhaps. This was the simplest approach, and since it *is* a hash key used to index a hash table it seems suitable to use hashCode(), and impose the extra constraint contextually. I'm fairly neutral though; we certainly could introduce a new interface. I'll see how it looks. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132991#comment-14132991 ] Benedict commented on CASSANDRA-7282: - Hmm. Yes. I shouldn't get into these discussions at this time of night. Agreed. Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7931) Single Host CPU profiling data dump
[ https://issues.apache.org/jira/browse/CASSANDRA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133090#comment-14133090 ] Benedict commented on CASSANDRA-7931: - for (1): take a look at CASSANDRA-4718. The costs of thread signalling for our small messages are always dominant, however we have already reduced this substantially with the SEPExecutor. I will be blogging about this in the near future. Further optimisation may well be possible, but avoiding blocking for the executor queues is exactly why this executor is more efficient. Thread signalling is expensive, and creates bottlenecks. for (3): You are incorrect. The park() is only entered if the data is _not_ already available. For local requests the SEPExecutor permits optimising the call directly into the calling thread _if there are fewer requests than permitted to run on the read stage_. So if you tuned your read stage higher than your max_rpc count, this call would never happen for local only requests. for (2) and (4), we can certainly do more: see CASSANDRA-7907 and CASSANDRA-7029 Single Host CPU profiling data dump --- Key: CASSANDRA-7931 URL: https://issues.apache.org/jira/browse/CASSANDRA-7931 Project: Cassandra Issue Type: Improvement Environment: 2.1.0 on my local machine Reporter: Michael Nelson Priority: Minor Attachments: traces.txt At Boot Camp today I did some CPU profiling and wanted to turn in my findings somehow. So here they are in a JIRA. I ran cassandra-stress read test against my local machine (single node: SSD, 8 proc, plenty of RAM). I'm using straight 2.1.0 with the Lightweight Java Profiler by Jeremy Manson (https://code.google.com/p/lightweight-java-profiler/). This is a lower-level profiler that more accurately reports what the CPU is actually doing rather than where time is going. Attached is the report. Here are a few high points with some commentary from some investigation I did today: 1. The number one _consumer_of_CPU_ is this stack trace: sun.misc.Unsafe.park(Unsafe.java:-1) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349) org.apache.cassandra.concurrent.SEPWorker.doWaitSpin(SEPWorker.java:236) org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:74) java.lang.Thread.run(Thread.java:745) There is a CPU cost to going into and out of Unsafe.park() and the SEPWorkers are doing it so much when they are spinning that they were the largest single consumer of CPU. Note that this profiler is sampling the CPU so it isn't the _blocking_ that this call does that is the issue. It is the CPU actually being consumed as it prepares to block and then later comes out. This kind of lower level fine grained control of threads is tricky and there appears to be room for improvement here. I tried a few approaches that didn't pan out. There really needs to be a way for threads to block waiting for the executor queues. That way they will be immediately available (rather than waiting for them to poll the executors) and will not consume CPU when they aren't needed. Maybe block for some short period of time and then become available for other queues after that? 2. Second is Netty writing to sockets. I didn't investigate this. Netty is pretty optimized. Someone mentioned there are native options for Netty. Probably worth trying. 3. Third is similar to #1: sun.misc.Unsafe.park(Unsafe.java:-1) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349) org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(WaitQueue.java:303) org.apache.cassandra.utils.concurrent.SimpleCondition.await(SimpleCondition.java:63) org.apache.cassandra.service.ReadCallback.await(ReadCallback.java:90) org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100) org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1226) I looked at this a little. The asynchronous callback using Unsafe.park() is being invoked even for local reads (like my local laptop). It only rarely blocked. It just paid the CPU cost of going in and immediately coming out because the read was already satisfied from the buffer cache. :-( 4. Fourth is Netty doing epoll. Again, worth considering Netty's native optimizations to help here. The others are interesting but, for this little benchmark, not as big a deal. For example, ArrayList.grow() shows up more than I would have expected. The Boot Camp was great. Feel free to contact me if I can help somehow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7930) Warn when evicting prepared statements from cache
[ https://issues.apache.org/jira/browse/CASSANDRA-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133091#comment-14133091 ] Benedict commented on CASSANDRA-7930: - Please use StorageService.scheduledTasks instead of creating a new executor for such low volume work Warn when evicting prepared statements from cache - Key: CASSANDRA-7930 URL: https://issues.apache.org/jira/browse/CASSANDRA-7930 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Robbie Strickland Assignee: Robbie Strickland Attachments: cassandra-2.0-v2.txt, cassandra-2.0-v3.txt, cassandra-2.0-v4.txt, cassandra-2.0.txt, cassandra-2.1.txt The prepared statement cache is an LRU, with a max size of maxMemory / 256. There is currently no warning when statements are evicted, which could be problematic if the user is unaware that this is happening. At the very least, we should provide a JMX metric and possibly a log message indicating this is happening. At some point it may also be worthwhile to make this tunable for users with large numbers of statements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7705) Safer Resource Management
[ https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133110#comment-14133110 ] Benedict commented on CASSANDRA-7705: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/7705-resourcemgmt] This patch traps reference leaks as well as double-releasing references. This version only modifies SSTableReader resource management, but it can be rolled out for any heavy-weight objects we manage, i.e. all instances of RefCountedMemory except those stored in SerializingCache. Safer Resource Management - Key: CASSANDRA-7705 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0 We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed (or in places where this safety is probably not as necessary, e.g. SerializingCache), we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme for these more expensive/infrequent actions. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources
Benedict created CASSANDRA-7932: --- Summary: Corrupt SSTable Cleanup Leak Resources Key: CASSANDRA-7932 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.1 CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. I've tracked this down to DataTracker.removeUnreadableSSTables() , which does not release the reference to any sstables from the tracker. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources
[ https://issues.apache.org/jira/browse/CASSANDRA-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7932: Attachment: 7932.txt Patch attached Corrupt SSTable Cleanup Leak Resources -- Key: CASSANDRA-7932 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.1 Attachments: 7932.txt CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. I've tracked this down to DataTracker.removeUnreadableSSTables() , which does not release the reference to any sstables from the tracker. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133122#comment-14133122 ] Benedict commented on CASSANDRA-7282: - bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an ordering key or ordering prefix? Creating an interface would be problematic, since we need to have our map key be a shared type for both CSLM keys and NBHOM keys. So I'm going to stick with the current situation. If you meant from a purely documentation point of view, it is absolutely essential that the value is a _hash_, otherwise performance will be O\(n\^2\), so whilst it may be worth clarifying it is essential we call it a hashCode(). To elaborate on this in documentation, I've included the following extra comment {quote} This data structure essentially only works for keys that are first sorted by some hash value (and may then be sorted within those hashes arbitrarily), where a 32-bit prefix of the hash we sort by is returned by hashCode() {quote} Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7933) Update cassandra-stress README
Benedict created CASSANDRA-7933: --- Summary: Update cassandra-stress README Key: CASSANDRA-7933 URL: https://issues.apache.org/jira/browse/CASSANDRA-7933 Project: Cassandra Issue Type: Task Reporter: Benedict Priority: Minor There is a README in the tools/stress directory. It is completely out of date. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133127#comment-14133127 ] Benedict commented on CASSANDRA-7916: - Committed, and created a new ticket for sprucing up the README (since it's completely stale, not just this that needs elaborating) Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources
[ https://issues.apache.org/jira/browse/CASSANDRA-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133155#comment-14133155 ] Benedict commented on CASSANDRA-7932: - I was mistaken. Whilst that patch fixes what appears to be a leak we could encounter from the same code path, the actual leak was more involved. I've patched on top of the leak detection code for trunk, however if we get a thumbs up we can backport to 2.1 [Patch|https://github.com/belliottsmith/cassandra/tree/7932-corruptleak] Corrupt SSTable Cleanup Leak Resources -- Key: CASSANDRA-7932 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.1 Attachments: 7932.txt CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. I've tracked this down to DataTracker.removeUnreadableSSTables() , which does not release the reference to any sstables from the tracker. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources
[ https://issues.apache.org/jira/browse/CASSANDRA-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7932: Attachment: (was: 7932.txt) Corrupt SSTable Cleanup Leak Resources -- Key: CASSANDRA-7932 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.1 CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. I've tracked this down to DataTracker.removeUnreadableSSTables() , which does not release the reference to any sstables from the tracker. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7282) Faster Memtable map
[ https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133164#comment-14133164 ] Benedict commented on CASSANDRA-7282: - Uploaded a version with updated comments [here|https://github.com/belliottsmith/cassandra/tree/7282-fastmemtablemap] Faster Memtable map --- Key: CASSANDRA-7282 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Labels: performance Fix For: 3.0 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in our memtables. Maintaining this is an O(lg(n)) operation; since the vast majority of users use a hash partitioner, it occurs to me we could maintain a hybrid ordered list / hash map. The list would impose the normal order on the collection, but a hash index would live alongside as part of the same data structure, simply mapping into the list and permitting O(1) lookups and inserts. I've chosen to implement this initial version as a linked-list node per item, but we can optimise this in future by storing fatter nodes that permit a cache-line's worth of hashes to be checked at once, further reducing the constant factor costs for lookups. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7926) Stress can OOM on merging of timing samples
[ https://issues.apache.org/jira/browse/CASSANDRA-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133191#comment-14133191 ] Benedict commented on CASSANDRA-7926: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/7926-stressoom] I've made a few small changes: * The number of samples we collect/accumulate at any point are all now configurable with the \-samples setting * When merging multiple samples, we no longer merge them altogether and _then_ downsample, but instead downsample as we merge, ensuring our memory use is bounded much lower * Switched to ThreadLocalRandom instead of Random for generating probabilities Stress can OOM on merging of timing samples --- Key: CASSANDRA-7926 URL: https://issues.apache.org/jira/browse/CASSANDRA-7926 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 {noformat} Exception in thread StressMetrics:2 java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2343) at org.apache.cassandra.stress.util.SampleOfLongs.merge(SampleOfLongs.java:76) at org.apache.cassandra.stress.util.TimingInterval.merge(TimingInterval.java:95) at org.apache.cassandra.stress.util.Timing.snapInterval(Timing.java:95) at org.apache.cassandra.stress.StressMetrics.update(StressMetrics.java:124) at org.apache.cassandra.stress.StressMetrics.access$200(StressMetrics.java:36) at org.apache.cassandra.stress.StressMetrics$1.run(StressMetrics.java:72) at java.lang.Thread.run(Thread.java:744) {noformat} This is partially down to recently increasing the per-thread sample size, but also because we allocate temporary space linear in size to total sample size in all threads during merge. This can easily be avoided. We should also scale per-thread sample size based on total number of threads, so we limit total memory use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7934) Remove FBUtilities.threadLocalRandom
Benedict created CASSANDRA-7934: --- Summary: Remove FBUtilities.threadLocalRandom Key: CASSANDRA-7934 URL: https://issues.apache.org/jira/browse/CASSANDRA-7934 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 We should use ThreadLocalRandom.current() instead, as it is not only more standard, it is considerably faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7738) Permit CL overuse to be explicitly bounded
[ https://issues.apache.org/jira/browse/CASSANDRA-7738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7738: Assignee: (was: Benedict) Permit CL overuse to be explicitly bounded -- Key: CASSANDRA-7738 URL: https://issues.apache.org/jira/browse/CASSANDRA-7738 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Priority: Minor Fix For: 3.0 As mentioned in CASSANDRA-7554, we do not currently offer any way to explicitly bound CL growth, which can be problematic in some scenarios (e.g. EC2 where the system drive is quite small). We should offer a configurable amount of headroom, beyond which we stop accepting writes until the backlog clears. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-7915) Waiting for sync on the commit log could happen after writing to memtable
[ https://issues.apache.org/jira/browse/CASSANDRA-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-7915. - Resolution: Not a Problem We don't actually do very much work at all besides the BTree merge on apply, which we cannot help but delay until the last moment without risking the work being wasted. So there's very little work to save whilst maintaining the current behaviour wrt correctness. Waiting for sync on the commit log could happen after writing to memtable - Key: CASSANDRA-7915 URL: https://issues.apache.org/jira/browse/CASSANDRA-7915 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Branimir Lambov Priority: Minor Currently the sync wait is part of CommitLog.add, which is executed in whole before any memtable write. The time for executing the latter is thus added on top of the time for file sync, which seems unnecessary. Moving the wait to a call at the end of Keystore.apply should hide the memtable write time and may improve performance, especially for the batch sync strategy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-7705) Safer Resource Management
[ https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-7705: --- Assignee: Benedict Safer Resource Management - Key: CASSANDRA-7705 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0 We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed (or in places where this safety is probably not as necessary, e.g. SerializingCache), we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme for these more expensive/infrequent actions. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7724) Native-Transport threads get stuck in StorageProxy.preparePaxos with no one making progress
[ https://issues.apache.org/jira/browse/CASSANDRA-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7724: Attachment: aggregateddump.txt Attached is your thread dump reformatted to be easier to digest. Looking at it, there appears to be an incoming read on the IncomingTcpConnection, which is quite likely one of the prepare requests or responses being processed by the node, although it's hard to say for certain. There are multiple threads involved - the native transport requests start the work, however the mutation stage processes the paxos messages on receipt, and the incoming/outbound tcp connections deliver those messages. It's still a bit funny that you can never see these threads live, though, in any of your dumps, and I would be interested in getting a few to double check. Either way, this raises the sensible prospect of optimising cas when RF=1. Native-Transport threads get stuck in StorageProxy.preparePaxos with no one making progress --- Key: CASSANDRA-7724 URL: https://issues.apache.org/jira/browse/CASSANDRA-7724 Project: Cassandra Issue Type: Bug Components: Core Environment: Linux 3.13.11-4 #4 SMP PREEMPT x86_64 Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz GenuineIntel java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) cassandra 2.0.9 Reporter: Anton Lebedevich Attachments: aggregateddump.txt, cassandra.threads2 We've got a lot of write timeouts (cas) when running INSERT INTO cas_demo(pri_id, sec_id, flag, something) VALUES(?, ?, ?, ?) IF NOT EXISTS from 16 connections in parallel using the same pri_id and different sec_id. Doing the same from 4 connections in parallel works ok. All configuration values are at their default values. CREATE TABLE cas_demo ( pri_id varchar, sec_id varchar, flag boolean, something setvarchar, PRIMARY KEY (pri_id, sec_id) ); CREATE INDEX cas_demo_flag ON cas_demo(flag); Full thread dump is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7937: Labels: performance (was: ) Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.0 Reporter: Piotr Kołaczkowski Labels: performance When writing huge amounts of data into C* cluster from analytic tools like Hadoop or Apache Spark, we can see that often C* can't keep up with the load. This is because analytic tools typically write data as fast as they can in parallel, from many nodes and they are not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark nodes (writers) and although possible write performance is higher, the problem still remains. We observe the following behavior: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. the available memory limit for memtables is reached and writes are no longer accepted 3. the application gets hit by write timeout, and retries repeatedly, in vain 4. after several failed attempts to write, the job gets aborted Desired behaviour: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. after exceeding some memtable fill threshold, C* applies rate limiting to writes - the more the buffers are filled-up, the less writes/s are accepted, however writes still occur within the write timeout. 3. thanks to slowed down data ingestion, now flush can happen before all the memory gets used Of course the details how rate limiting could be done are up for a discussion. It may be also worth considering putting such logic into the driver, not C* core, but then C* needs to expose at least the following information to the driver, so we could calculate the desired maximum data rate: 1. current amount of memory available for writes before they would completely block 2. total amount of data queued to be flushed and flush progress (amount of data to flush remaining for the memtable currently being flushed) 3. average flush write speed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes
[ https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133862#comment-14133862 ] Benedict commented on CASSANDRA-7937: - This should certainly be dealt with by the cluster. We cannot rely on well behaved clients, and clients cannot easily calculate a safe data-rate cross cluster, so any client change would at best help direct writes only, which with RF1 is not much help. Nor could it be as responsive. My preferred solution to this is CASSANDRA-6812, which should keep the server responding to writes within the timeout window even as it blocks for lengthy flushes, but during these windows writes would be acked much more slowly, at a steady drip. This solution won't make it into 2.0 or 2.1, and possibly not even 3.0, though. Apply backpressure gently when overloaded with writes - Key: CASSANDRA-7937 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.0 Reporter: Piotr Kołaczkowski Labels: performance When writing huge amounts of data into C* cluster from analytic tools like Hadoop or Apache Spark, we can see that often C* can't keep up with the load. This is because analytic tools typically write data as fast as they can in parallel, from many nodes and they are not artificially rate-limited, so C* is the bottleneck here. Also, increasing the number of nodes doesn't really help, because in a collocated setup this also increases number of Hadoop/Spark nodes (writers) and although possible write performance is higher, the problem still remains. We observe the following behavior: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. the available memory limit for memtables is reached and writes are no longer accepted 3. the application gets hit by write timeout, and retries repeatedly, in vain 4. after several failed attempts to write, the job gets aborted Desired behaviour: 1. data is ingested at an extreme fast pace into memtables and flush queue fills up 2. after exceeding some memtable fill threshold, C* applies rate limiting to writes - the more the buffers are filled-up, the less writes/s are accepted, however writes still occur within the write timeout. 3. thanks to slowed down data ingestion, now flush can happen before all the memory gets used Of course the details how rate limiting could be done are up for a discussion. It may be also worth considering putting such logic into the driver, not C* core, but then C* needs to expose at least the following information to the driver, so we could calculate the desired maximum data rate: 1. current amount of memory available for writes before they would completely block 2. total amount of data queued to be flushed and flush progress (amount of data to flush remaining for the memtable currently being flushed) 3. average flush write speed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-3017) add a Message size limit
[ https://issues.apache.org/jira/browse/CASSANDRA-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133863#comment-14133863 ] Benedict commented on CASSANDRA-3017: - This is definitely a good idea. At the same time I think it might be worth considering introducing an upper limit on either the total size of requests we've currently got in-flight for MessagingService, or the total number, or possibly both. Once the threshold is exceeded we stop consuming input from all IncomingTcpConnection(s). This is not dramatically different to our imposition of a max rpc count, but stops a single server being overloaded through a hotspot of queries driven by non-token aware clients (but also from only a slight variant on the malicious oversized payload attack). add a Message size limit Key: CASSANDRA-3017 URL: https://issues.apache.org/jira/browse/CASSANDRA-3017 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Priority: Minor Labels: lhf Attachments: 0001-use-the-thrift-max-message-size-for-inter-node-messa.patch, trunk-3017.txt We protect the server from allocating huge buffers for malformed message with the Thrift frame size (CASSANDRA-475). But we don't have similar protection for the inter-node Message objects. Adding this would be good to deal with malicious adversaries as well as a malfunctioning cluster participant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7402) limit the on heap memory available to requests
[ https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7402: Labels: ops performance stability (was: ops) limit the on heap memory available to requests -- Key: CASSANDRA-7402 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Labels: ops, performance, stability Fix For: 3.0 When running a production cluster one common operational issue is quantifying GC pauses caused by ongoing requests. Since different queries return varying amount of data you can easily get your self into a situation where you Stop the world from a couple of bad actors in the system. Or more likely the aggregate garbage generated on a single node across all in flight requests causes a GC. We should be able to set a limit on the max heap we can allocate to all outstanding requests and track the garbage per requests to stop this from happening. It should increase a single nodes availability substantially. In the yaml this would be {code} total_request_memory_space_mb: 400 {code} It would also be nice to have either a log of queries which generate the most garbage so operators can track this. Also a histogram. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134042#comment-14134042 ] Benedict commented on CASSANDRA-7916: - ninja fixed Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7716) cassandra-stress: provide better error messages
[ https://issues.apache.org/jira/browse/CASSANDRA-7716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134049#comment-14134049 ] Benedict commented on CASSANDRA-7716: - I think the old behaviour of StressProfile.select() makes more sense...? If you've specified a property key, but no value, that should probably fail rather than using the default don't you think? Otherwise LGTM, and no really strong feeling on that issue. cassandra-stress: provide better error messages --- Key: CASSANDRA-7716 URL: https://issues.apache.org/jira/browse/CASSANDRA-7716 Project: Cassandra Issue Type: Improvement Reporter: Robert Stupp Assignee: T Jake Luciani Priority: Trivial Fix For: 2.1.1 Attachments: 7166v2.txt, 7716.txt Just tried new stress tool. It would be great if the stress tool gives better error messages by telling the user what option or config parameter/value caused an error. YAML parse errors are meaningful (gives code snippets etc). Examples are: {noformat} WARN 16:59:39 Setting caching options with deprecated syntax. Exception in thread main java.lang.NullPointerException at java.util.regex.Matcher.getTextLength(Matcher.java:1234) at java.util.regex.Matcher.reset(Matcher.java:308) at java.util.regex.Matcher.init(Matcher.java:228) at java.util.regex.Pattern.matcher(Pattern.java:1088) at org.apache.cassandra.stress.settings.OptionDistribution.get(OptionDistribution.java:67) at org.apache.cassandra.stress.StressProfile.init(StressProfile.java:151) at org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482) at org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53) at org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114) at org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134) at org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218) at org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206) at org.apache.cassandra.stress.Stress.main(Stress.java:58) {noformat} When table-definition is wrong: {noformat} Exception in thread main java.lang.RuntimeException: org.apache.cassandra.exceptions.SyntaxException: line 6:14 mismatched input '(' expecting ')' at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:550) at org.apache.cassandra.stress.StressProfile.init(StressProfile.java:134) at org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482) at org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53) at org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114) at org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134) at org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218) at org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206) at org.apache.cassandra.stress.Stress.main(Stress.java:58) Caused by: org.apache.cassandra.exceptions.SyntaxException: line 6:14 mismatched input '(' expecting ')' at org.apache.cassandra.cql3.CqlParser.throwLastRecognitionError(CqlParser.java:273) at org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:456) at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:541) ... 8 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7716) cassandra-stress: provide better error messages
[ https://issues.apache.org/jira/browse/CASSANDRA-7716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134055#comment-14134055 ] Benedict commented on CASSANDRA-7716: - I figure default is only if it's not provided; if it's provided but empty they perhaps messed up. But either is totally reasonable, so not going to quibble. cassandra-stress: provide better error messages --- Key: CASSANDRA-7716 URL: https://issues.apache.org/jira/browse/CASSANDRA-7716 Project: Cassandra Issue Type: Improvement Reporter: Robert Stupp Assignee: T Jake Luciani Priority: Trivial Fix For: 2.1.1 Attachments: 7166v2.txt, 7716.txt Just tried new stress tool. It would be great if the stress tool gives better error messages by telling the user what option or config parameter/value caused an error. YAML parse errors are meaningful (gives code snippets etc). Examples are: {noformat} WARN 16:59:39 Setting caching options with deprecated syntax. Exception in thread main java.lang.NullPointerException at java.util.regex.Matcher.getTextLength(Matcher.java:1234) at java.util.regex.Matcher.reset(Matcher.java:308) at java.util.regex.Matcher.init(Matcher.java:228) at java.util.regex.Pattern.matcher(Pattern.java:1088) at org.apache.cassandra.stress.settings.OptionDistribution.get(OptionDistribution.java:67) at org.apache.cassandra.stress.StressProfile.init(StressProfile.java:151) at org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482) at org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53) at org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114) at org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134) at org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218) at org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206) at org.apache.cassandra.stress.Stress.main(Stress.java:58) {noformat} When table-definition is wrong: {noformat} Exception in thread main java.lang.RuntimeException: org.apache.cassandra.exceptions.SyntaxException: line 6:14 mismatched input '(' expecting ')' at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:550) at org.apache.cassandra.stress.StressProfile.init(StressProfile.java:134) at org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482) at org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53) at org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114) at org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134) at org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218) at org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206) at org.apache.cassandra.stress.Stress.main(Stress.java:58) Caused by: org.apache.cassandra.exceptions.SyntaxException: line 6:14 mismatched input '(' expecting ')' at org.apache.cassandra.cql3.CqlParser.throwLastRecognitionError(CqlParser.java:273) at org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:456) at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:541) ... 8 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7402) Add metrics to track memory used by client requests
[ https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134071#comment-14134071 ] Benedict commented on CASSANDRA-7402: - I'm not convinced tracking the per-client/per-query statistics is likely to be very viable. Once queries cross the MS threshold the information isn't available to us, and making it could be costly. We could probably serialize the prepared statement id over the wire, and wire that up as the data is requested in nodetool, say, by attempting to locate a server with the statement. I think tracking and reporting this data in this manner should be a separate ticket to constraining it, however, which is a much more concretely beneficial and achievable goal. Add metrics to track memory used by client requests --- Key: CASSANDRA-7402 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Labels: ops, performance, stability Fix For: 3.0 When running a production cluster one common operational issue is quantifying GC pauses caused by ongoing requests. Since different queries return varying amount of data you can easily get your self into a situation where you Stop the world from a couple of bad actors in the system. Or more likely the aggregate garbage generated on a single node across all in flight requests causes a GC. We should be able to set a limit on the max heap we can allocate to all outstanding requests and track the garbage per requests to stop this from happening. It should increase a single nodes availability substantially. In the yaml this would be {code} total_request_memory_space_mb: 400 {code} It would also be nice to have either a log of queries which generate the most garbage so operators can track this. Also a histogram. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7402) Add metrics to track memory used by client requests
[ https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134076#comment-14134076 ] Benedict commented on CASSANDRA-7402: - It's the queries we cannot easily track, aggregated or not. At least, not those we received over MessagingService, or not easily and cheaply. Add metrics to track memory used by client requests --- Key: CASSANDRA-7402 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Labels: ops, performance, stability Fix For: 3.0 When running a production cluster one common operational issue is quantifying GC pauses caused by ongoing requests. Since different queries return varying amount of data you can easily get your self into a situation where you Stop the world from a couple of bad actors in the system. Or more likely the aggregate garbage generated on a single node across all in flight requests causes a GC. We should be able to set a limit on the max heap we can allocate to all outstanding requests and track the garbage per requests to stop this from happening. It should increase a single nodes availability substantially. In the yaml this would be {code} total_request_memory_space_mb: 400 {code} It would also be nice to have either a log of queries which generate the most garbage so operators can track this. Also a histogram. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7402) Add metrics to track memory used by client requests
[ https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134093#comment-14134093 ] Benedict commented on CASSANDRA-7402: - Makes sense. If we're going to be grabbing the data cross-cluster (necessary for delivering per-node stats) then tracking based on statement id would be sufficient also, since we could populate the map in nodetool from the whole cluster, so users can drill down into hotspot nodes Add metrics to track memory used by client requests --- Key: CASSANDRA-7402 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Labels: ops, performance, stability Fix For: 3.0 When running a production cluster one common operational issue is quantifying GC pauses caused by ongoing requests. Since different queries return varying amount of data you can easily get your self into a situation where you Stop the world from a couple of bad actors in the system. Or more likely the aggregate garbage generated on a single node across all in flight requests causes a GC. We should be able to set a limit on the max heap we can allocate to all outstanding requests and track the garbage per requests to stop this from happening. It should increase a single nodes availability substantially. In the yaml this would be {code} total_request_memory_space_mb: 400 {code} It would also be nice to have either a log of queries which generate the most garbage so operators can track this. Also a histogram. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7938) Releases prior to 2.0 gratuitously invalidate buffer cache
[ https://issues.apache.org/jira/browse/CASSANDRA-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134334#comment-14134334 ] Benedict commented on CASSANDRA-7938: - Also file streaming Releases prior to 2.0 gratuitously invalidate buffer cache -- Key: CASSANDRA-7938 URL: https://issues.apache.org/jira/browse/CASSANDRA-7938 Project: Cassandra Issue Type: Bug Components: Core Reporter: Matt Stump Fix For: 1.2.19 RandomAccessReader gratuitously invalidates the buffer cache in releases prior to 2.0. Additionally, Linux 3.X kernels spend 30% of CPU time in book keeping for the invalidated pages as captured by CPU flame graphs. fadvise DONT_NEED should never be called for files other than the commit log segments. https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/io/util/RandomAccessReader.java#L168 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks
[ https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134612#comment-14134612 ] Benedict commented on CASSANDRA-7928: - Regrettably this is very plausible, and adds credence to CASSANDRA-7130, which we should consider reopening. This ticket is also a sensible idea to help mitigate the issue. I knocked up a quick benchmark, results show lz4 being consistently at least twice as fast. It's actually quite easy to explain: if the data is compressed, there is actually less data to operate over; if it is not easily compressed (say, it is highly random), it degrades itself to a simple copy to avoid wasting work (as demonstrated in the benchmark - it's 5 times faster over completely random data than partially random data). {noformat} Benchmark (duplicateLookback) (pageSize) (randomRatio) (randomRunLength) (uniquePages) Mode SamplesScore Score error Units Compression.adler32 4..128 65536 0 4..16 8192 thrpt5 16.4761.954 ops/ms Compression.adler32 4..128 65536 0 128..512 8192 thrpt5 16.7200.230 ops/ms Compression.adler32 4..128 655360.1 4..16 8192 thrpt5 16.2692.118 ops/ms Compression.adler32 4..128 655360.1 128..512 8192 thrpt5 16.6650.246 ops/ms Compression.adler32 4..128 655361.0 4..16 8192 thrpt5 16.6530.147 ops/ms Compression.adler32 4..128 655361.0 128..512 8192 thrpt5 16.6860.214 ops/ms Compression.lz4 4..128 65536 0 4..16 8192 thrpt5 28.2750.265 ops/ms Compression.lz4 4..128 65536 0 128..512 8192 thrpt5 232.602 48.279 ops/ms Compression.lz4 4..128 655360.1 4..16 8192 thrpt5 34.0810.337 ops/ms Compression.lz4 4..128 655360.1 128..512 8192 thrpt5 130.857 18.157 ops/ms Compression.lz4 4..128 655361.0 4..16 8192 thrpt5 187.9929.190 ops/ms Compression.lz4 4..128 655361.0 128..512 8192 thrpt5 186.0542.267 ops/ms {noformat} Digest queries do not require alder32 checks Key: CASSANDRA-7928 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor While reading data from sstables, C* does Alder32 checks for any data being read. We have seen that this causes higher CPU usage while doing kernel profiling. These checks might not be useful for digest queries as they will have a different digest in case of corruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks
[ https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134625#comment-14134625 ] Benedict commented on CASSANDRA-7928: - For reference, I have uploaded the benchmark [here|https://github.com/belliottsmith/bench] Digest queries do not require alder32 checks Key: CASSANDRA-7928 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor While reading data from sstables, C* does Alder32 checks for any data being read. We have seen that this causes higher CPU usage while doing kernel profiling. These checks might not be useful for digest queries as they will have a different digest in case of corruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (CASSANDRA-7130) Make sstable checksum type configurable and optional
[ https://issues.apache.org/jira/browse/CASSANDRA-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reopened CASSANDRA-7130: - Assignee: (was: Benedict) As discussed on CASSANDRA-7928, this is probably worth revisiting, especially as it isn't too difficult. Probably worth delaying until next-gen storage is out the gate though. Make sstable checksum type configurable and optional Key: CASSANDRA-7130 URL: https://issues.apache.org/jira/browse/CASSANDRA-7130 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Priority: Minor Labels: performance Fix For: 3.0 A lot of our users are becoming bottlenecked on CPU rather than IO, and whilst Adler32 is faster than CRC, it isn't anything like as fast as xxhash (used by LZ4), which can push Gb/s. I propose making the checksum type configurable so that users who want speed can shift to xxhash, and those who want security can use Adler or CRC. It's worth noting that at some point in the future (JDK8?) optimised implementations using latest intel crc instructions will be added, though it's not clear from the mailing list discussion if/when that will materialise: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2013-May/010775.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134674#comment-14134674 ] Benedict commented on CASSANDRA-7546: - Hi Graham, I must admit I'm a bit confused, and it's partially self inflicted. In 2.1.1 we have changed stress again from what we released in 2.1.0, and I can't tell which version you're referring to, though it seems 2.1.1. Neither version has a 'visits' property in the yaml, but 2.1.1 supports -insert visits= revisit=, which are certainly functions worth exploring and I recommend you use 2.1.1 for stress functionality either way. As far as using these functions are concerned, 'visits' splits a wide row up into multiple inserts; if a visits value of 10 is produced, and there are on average 100 rows generated for the partition, approximately 10 rows will be inserted, then the state of the partition will be stashed away and the next insert that operates on that partition will pick up where the previous one left off. Which partition is performed next is decided by the 'revisit' distribution, which selects from the stash of partially completed inserts, with a value of 1 selecting the most recently stashed (the max value of this distribution defines the total number of partitions to stash); if it ever selects outside of the current stash a new partition is generated instead. So the value for 'visits' is related to the number of unique clustering columns you generate for each partition, whereas the value for revisit is determined by how diverse the data you operate over in a given time window is. Separately, it's worth mentioning that offheap_objects is likely a better choice than offheap_buffers, since it is considerably more memory dense. AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Fix For: 2.1.1 Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks
[ https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134680#comment-14134680 ] Benedict commented on CASSANDRA-7928: - Agreed Digest queries do not require alder32 checks Key: CASSANDRA-7928 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Fix For: 2.1.1 While reading data from sstables, C* does Alder32 checks for any data being read. We have seen that this causes higher CPU usage while doing kernel profiling. These checks might not be useful for digest queries as they will have a different digest in case of corruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7928) Digest queries do not require alder32 checks
[ https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7928: Labels: performance (was: ) Digest queries do not require alder32 checks Key: CASSANDRA-7928 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Priority: Minor Labels: performance Fix For: 2.1.1 While reading data from sstables, C* does Alder32 checks for any data being read. We have seen that this causes higher CPU usage while doing kernel profiling. These checks might not be useful for digest queries as they will have a different digest in case of corruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7928) Digest queries do not require alder32 checks
[ https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7928: Assignee: T Jake Luciani Digest queries do not require alder32 checks Key: CASSANDRA-7928 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 2.1.1 While reading data from sstables, C* does Alder32 checks for any data being read. We have seen that this causes higher CPU usage while doing kernel profiling. These checks might not be useful for digest queries as they will have a different digest in case of corruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135004#comment-14135004 ] Benedict commented on CASSANDRA-7546: - 1: that's great news :) 3: if you want lots of unique clustering key values per partition, currently stress has some limitations and you will need/want to have multiple clustering columns for it to be able to generate that smoothly without taking donkeys years per insert (on the workload generation side). Its minimum unit of generation (not insert) is a single tier of clustering values, so it would generate all 100B values each time you wanted to insert any number with your spec. So, you want to consider a yaml like this: {noformat} table_definition: | CREATE TABLE testtable ( p text, c1 int, c2 int, c3 int v blob, PRIMARY KEY(p, c1, c2, c3) ) WITH COMPACT STORAGE AND compaction = { 'class':'LeveledCompactionStrategy' } AND comment='TestTable' columnspec: - name: p size: fixed(16) - name: c1 cluster: fixed(100) - name: c2 cluster: fixed(100) - name: c3 cluster: fixed(100) - name: v size: gaussian(50..250) {noformat} Then you want to consider passing -pop seq=1..1M -insert visits=fixed(1M) revisits=uniform(1..1024) The visits parameter here tells stress to split each partition into 1M distinct inserts, which given its deterministic 1M keys means exactly 1 item inserted each visit. The revisits distribution defines the number of partition keys we will operate over until we exhaust one before selecting another to include in our working set. Notice I've removed the population spec from your partition key in the yaml. This is because it is not necessary to constrain it here, as you can constrain the _seed_ population with the -pop parameter, which is the better way to do it here (so you can use the same yaml across runs). However, in this case given our revisits() distribution we can also not constrain the seed population, since once our first 1024 have been generated no other PK will be visited until one of these has been fully exhausted (i.e. 1024 * 1M inserts, quite a few...). You may also constrain the seed to the same range, which once a key is exhausted would always result in filling back in that key to the working set. It doesn't matter what distribution you choose in this case, since it will keep generating a value until one not present in the stash crops up, which if they operate over the same domain can only result in 1 item regardless of distribution, so I suggest a sequential distribution to ensure determinism. AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory - Key: CASSANDRA-7546 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 Project: Cassandra Issue Type: Bug Components: Core Reporter: graham sanderson Assignee: graham sanderson Fix For: 2.1.1 Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, suggestion1.txt, suggestion1_21.txt, young_gen_gc.png In order to preserve atomicity, this code attempts to read, clone/update, then CAS the state of the partition. Under heavy contention for updating a single partition this can cause some fairly staggering memory growth (the more cores on your machine the worst it gets). Whilst many usage patterns don't do highly concurrent updates to the same partition, hinting today, does, and in this case wild (order(s) of magnitude more than expected) memory allocation rates can be seen (especially when the updates being hinted are small updates to different partitions which can happen very fast on their own) - see CASSANDRA-7545 It would be best to eliminate/reduce/limit the spinning memory allocation whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown
[ https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7784: Reviewer: Benedict DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown --- Key: CASSANDRA-7784 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Attachments: 7784.txt It looks like this is quite a realistic race to hit reasonably often, since we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a lengthy window to overlap with an auto-save -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown
[ https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135082#comment-14135082 ] Benedict commented on CASSANDRA-7784: - Mostly LGTM, in fact it seems like an obvious improvement all round. But I'd either remove the \@Deprecated method entirely, or not deprecate it. Since it's only used in one very cheap to replace place, I'd simply drop it. DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown --- Key: CASSANDRA-7784 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Attachments: 7784.txt It looks like this is quite a realistic race to hit reasonably often, since we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a lengthy window to overlap with an auto-save -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown
[ https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135100#comment-14135100 ] Benedict commented on CASSANDRA-7784: - Ah, yes, I see the confusing part is that it hasn't been replaced in a place where it could have been, but the second usage is a place where it _couldn't_ be. It would be clearer if the first usage was replaced, so it were more obvious. Possibly also dropping the UUID from the parameter list. But no biggy DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown --- Key: CASSANDRA-7784 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784 Project: Cassandra Issue Type: Bug Reporter: Benedict Assignee: Aleksey Yeschenko Priority: Minor Fix For: 2.1.1 Attachments: 7784.txt It looks like this is quite a realistic race to hit reasonably often, since we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a lengthy window to overlap with an auto-save -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks
[ https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135348#comment-14135348 ] Benedict commented on CASSANDRA-7928: - WFM :) Digest queries do not require alder32 checks Key: CASSANDRA-7928 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928 Project: Cassandra Issue Type: Improvement Reporter: sankalp kohli Assignee: sankalp kohli Priority: Minor Labels: performance Fix For: 2.1.1 While reading data from sstables, C* does Alder32 checks for any data being read. We have seen that this causes higher CPU usage while doing kernel profiling. These checks might not be useful for digest queries as they will have a different digest in case of corruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads
[ https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7519: Labels: docs tools (was: tools) Further stress improvements to generate more realistic workloads Key: CASSANDRA-7519 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Labels: docs, tools Fix For: 2.1.1 We generally believe that the most common workload is for reads to exponentially prefer most recently written data. However as stress currently behaves we have two id generation modes: sequential and random (although random can be distributed). I propose introducing a new mode which is somewhat like sequential, except we essentially 'look back' from the current id by some amount defined by a distribution. I may possibly make the position only increment as it's first written to also, so that this mode can be run from a clean slate with a mixed workload. This should allow is to generate workloads that are more representative. At the same time, I will introduce a timestamp value generator for primary key columns that is strictly ascending, i.e. has some random component but is based off of the actual system time (or some shared monotonically increasing state) so that we can again generate a more realistic workload. This may be challenging to tie in with the new procedurally generated partitions, but I'm sure it can be done without too much difficulty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics
[ https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7916: Labels: docs (was: ) Stress should collect and cross-cluster GC statistics - Key: CASSANDRA-7916 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Benedict Priority: Minor Labels: docs Fix For: 2.1.1 It would be useful to see stress outputs deliver cross-cluster statistics, the most useful being GC data. Some simple changes to GCInspector collect the data, and can deliver to a nodetool request or to stress over JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-6146) CQL-native stress
[ https://issues.apache.org/jira/browse/CASSANDRA-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-6146: Labels: docs qa-resolved (was: qa-resolved) CQL-native stress - Key: CASSANDRA-6146 URL: https://issues.apache.org/jira/browse/CASSANDRA-6146 Project: Cassandra Issue Type: New Feature Components: Tools Reporter: Jonathan Ellis Assignee: T Jake Luciani Labels: docs, qa-resolved Fix For: 2.1 rc3 Attachments: 6146-v2.txt, 6146.txt, 6164-v3.txt The existing CQL support in stress is not worth discussing. We need to start over, and we might as well kill two birds with one stone and move to the native protocol while we're at it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5657) remove deprecated metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135464#comment-14135464 ] Benedict commented on CASSANDRA-5657: - This does _not_ appear to be the same as the just closed CASSANDRA-7943, which suggests dropping 'legacy metrics' - which seems to refer to the custom histogram. This ticket refers to removing deprecated mbeans and object wrappers only. I am -1 dropping the legacy histograms, since they are considerably more accurate than the yammer histograms. Before we consider dropping them we need to consider holistically our approach to metrics, which really need overhauling, since yammer metrics are lacking. remove deprecated metrics - Key: CASSANDRA-5657 URL: https://issues.apache.org/jira/browse/CASSANDRA-5657 Project: Cassandra Issue Type: Task Components: Tools Reporter: Jonathan Ellis Assignee: T Jake Luciani Labels: technical_debt Fix For: 3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7818) Improve compaction logging
[ https://issues.apache.org/jira/browse/CASSANDRA-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135538#comment-14135538 ] Benedict commented on CASSANDRA-7818: - Is it worth shortening the SSTableReader() prefix in the message to clean the messages up a little further? No need for it really. Something like {code} StringBuilder ssTableLoggerMsg = new StringBuilder([); for (SSTableReader sstr : sstables) { ssTableLoggerMsg.append(String.format(%s:level %d), , sstr.getFilename(), sstr.getSSTableLevel())); } ssTableLoggerMsg.append(]); {code} ? Improve compaction logging -- Key: CASSANDRA-7818 URL: https://issues.apache.org/jira/browse/CASSANDRA-7818 Project: Cassandra Issue Type: Improvement Reporter: Marcus Eriksson Assignee: Mihai Suteu Priority: Minor Labels: compaction, lhf Fix For: 3.0 Attachments: cassandra-7818.patch We should log more information about compactions to be able to debug issues more efficiently * give each CompactionTask an id that we log (so that you can relate the start-compaction-messages to the finished-compaction ones) * log what level the sstables are taken from -- This message was sent by Atlassian JIRA (v6.3.4#6332)