[jira] [Comment Edited] (CASSANDRA-7245) Out-of-Order keys with stress + CQL3

2014-06-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018257#comment-14018257
 ] 

Benedict edited comment on CASSANDRA-7245 at 6/4/14 10:08 PM:
--

[~jasobrown] I got this one if you've got better stuff to do :-)

[~tjake]: 
* I think the change you made doesn't actually totally eliminate the bug: since 
it extends DroppableRunnable, the finally block may never be run
* In SP.mutate() you can't release the mutations until after we get the all 
clear from the replicas, as we may have to use the mutations to write local 
hints. It might be nice, however, to have the response handler release() the 
reference once we receive enough responses from our replicas, so that we don't 
keep all of the references if we're just waiting for one mutation
* Why do we need a ThreadLocal in ClientState to store the sourceFrame. Can't 
we just store it directly in a field in QueryState?

Nit:
* WS in ClientState constructor; would also prefer to just create ThreadLocal 
in var initialiser since it's always init'd empty

Also, this is off-topic, but I wonder if we shouldn't replace the NBHM in 
ServerConnection with a simple array. We know the range is  32K (.1K for 
older clients), and each index is accessed by a single thread at any given 
time, so we'd just need to be able to atomically swap in larger arrays if we 
wanted dynamic sizing. 

Otherwise LGTM


was (Author: benedict):
[~jasobrown] I got this one if you've got better stuff to do :-)

[~tjake]: 
* I think the change you made doesn't actually totally eliminate the bug: since 
it extends DroppableRunnable, the finally block may never be run
* I think it would be good in SP.mutate() you can't release the mutations until 
after we get the all clear from the replicas, as we may have to use the 
mutations to write local hints. It might be nice, however, to have the response 
handler release() the reference once we receive enough responses from our 
replicas, so that we don't keep all of the references if we're just waiting for 
one mutation
* Why do we need a ThreadLocal in ClientState to store the sourceFrame. Can't 
we just store it directly in a field in QueryState?

Nit:
* WS in ClientState constructor; would also prefer to just create ThreadLocal 
in var initialiser since it's always init'd empty

Also, this is off-topic, but I wonder if we shouldn't replace the NBHM in 
ServerConnection with a simple array. We know the range is  32K (.1K for 
older clients), and each index is accessed by a single thread at any given 
time, so we'd just need to be able to atomically swap in larger arrays if we 
wanted dynamic sizing. 

Otherwise LGTM

 Out-of-Order keys with stress + CQL3
 

 Key: CASSANDRA-7245
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7245
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Pavel Yaskevich
Assignee: T Jake Luciani
 Fix For: 2.1.0

 Attachments: 7245-v2.txt, 7245.txt, 7245v3-rebase.txt, 7245v3.txt, 
 7245v4.txt, netty-all-4.0.19.Final.jar


 We have been generating data (stress with CQL3 prepared) for CASSANDRA-4718 
 and found following problem almost in every SSTable generated (~200 GB of 
 data and 821 SSTables).
 We set up they keys to be 10 bytes in size (default) and population between 1 
 and 6.
 Once I ran 'sstablekeys' on the generated SSTable files I got following 
 exceptions:
 _There is a problem with sorting of normal looking keys:_
 30303039443538353645
 30303039443745364242
 java.io.IOException: Key out of order! DecoratedKey(-217680888487824985, 
 *30303039443745364242*)  DecoratedKey(-1767746583617597213, 
 *30303039443437454333*)
 0a30303033343933
 3734441388343933
 java.io.IOException: Key out of order! DecoratedKey(5440473860101999581, 
 *3734441388343933*)  DecoratedKey(-7565486415339257200, 
 *30303033344639443137*)
 30303033354244363031
 30303033354133423742
 java.io.IOException: Key out of order! DecoratedKey(2687072396429900180, 
 *30303033354133423742*)  DecoratedKey(-7838239767410066684, 
 *30303033354145344534*)
 30303034313442354137
 3034313635363334
 java.io.IOException: Key out of order! DecoratedKey(1516003874415400462, 
 *3034313635363334*)  DecoratedKey(-9106177395653818217, 
 *3030303431444238*)
 30303035373044373435
 30303035373044334631
 java.io.IOException: Key out of order! DecoratedKey(-3645715702154616540, 
 *30303035373044334631*)  DecoratedKey(-4296696226469000945, 
 *30303035373132364138*)
 _And completely different ones:_
 30303041333745373543
 7cd045c59a90d7587d8d
 java.io.IOException: Key out of order! DecoratedKey(-3595402345023230196, 
 *7cd045c59a90d7587d8d*)  DecoratedKey(-5146766422778260690, 
 *30303041333943303232*)
 3030303332314144
 

[jira] [Commented] (CASSANDRA-6572) Workload recording / playback

2014-06-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018301#comment-14018301
 ] 

Benedict commented on CASSANDRA-6572:
-

* You need to move the getAndSet() on line 76 back inside the if statement
* recycleQueue souldn't compareAndSet its position to zero, it should just set 
it always, and always recycle
* I'd probably make a static helper method for checking if a keyspace is one we 
want to avoid tracing

Otherwise this batch of changes LGTM

 Workload recording / playback
 -

 Key: CASSANDRA-6572
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6572
 Project: Cassandra
  Issue Type: New Feature
  Components: Core, Tools
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
 Fix For: 2.1.1

 Attachments: 6572-trunk.diff


 Write sample mode gets us part way to testing new versions against a real 
 world workload, but we need an easy way to test the query side as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-7468) Add time-based execution to cassandra-stress

2014-09-10 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7468:

Assignee: Matt Kennedy  (was: Benedict)

 Add time-based execution to cassandra-stress
 

 Key: CASSANDRA-7468
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Matt Kennedy
Assignee: Matt Kennedy
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-7468) Add time-based execution to cassandra-stress

2014-09-10 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-7468.
-
Resolution: Fixed

This was a really minor issue of not passing the Duration settings object, so 
ninja fixed and, at the same time, applied to 2.1 - no point keeping merges 
more difficult whilst simultaneously depriving users of features for something 
small like this

 Add time-based execution to cassandra-stress
 

 Key: CASSANDRA-7468
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Matt Kennedy
Assignee: Matt Kennedy
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7658) stress connects to all nodes when it shouldn't

2014-09-10 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7658:

Attachment: 7658.txt

Patch attached

 stress connects to all nodes when it shouldn't
 --

 Key: CASSANDRA-7658
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Brandon Williams
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7658.txt


 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do 
 ring discovery and connect to them all anyway (checked via netstat.)  This 
 led to the confusion on CASSANDRA-7567



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-7646) DROP may not clean commit log segments when auto snapshot is false

2014-09-10 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-7646.
-
Resolution: Not a Problem

I was mistaken when I thought this was also an issue: we don't special case 
drop at all, we always flush on it. No calls to renewMemtable which is the 
danger zone.

 DROP may not clean commit log segments when auto snapshot is false
 --

 Key: CASSANDRA-7646
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7646
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Jeremiah Jordan
Assignee: Benedict
 Fix For: 2.0.11


 Per discussion on CASSANDRA-7511 DROP may also be affected by the same issue. 
  If auto snapshot is false, commit log segments may stay dirty because they 
 are not flushed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7282) Faster Memtable map

2014-09-12 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7282:

Attachment: reads.svg
writes.svg

Okay, so I made some minor tweaks to this and did some more serious local 
testing (and graphing of results). The updated branch (also rebased to trunk) 
is 
[here|https://github.com/belliottsmith/cassandra/tree/7282-fastmemtablemap.trunk]

On the whole it looks to me that bdplab is as usual showing its reticence to 
exhibit performance improvements. I suspect it's bottlenecking more readily on 
kernel operations.

On my local machine I see around a 15-20% improvement in throughput for both 
reads and writes, making this a very sensible addition. It also sees 
considerably _reduced_ GC time (though not amount of garbage generated) on 
writes, and sees a reduction in latency almost across the board (max latency on 
a read-only workload is slightly bumped, but since write workloads have the 
largest effect on latency, this seems worth overlooking, and due to how closely 
run we are, it's possible this is noise in the measurement, since the median 
p999/pMax are almost exactly the same)

I've also separately improved stress to collect GC data over JMX and created a 
patch to generate these pretty graphs from stress output automatically, which 
I'll be posting separately.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: reads.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-12 Thread Benedict (JIRA)
Benedict created CASSANDRA-7916:
---

 Summary: Stress should collect and cross-cluster GC statistics
 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


It would be useful to see stress outputs deliver cross-cluster statistics, the 
most useful being GC data. Some simple changes to GCInspector collect the data, 
and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132017#comment-14132017
 ] 

Benedict commented on CASSANDRA-7916:
-

Patch available 
[here|https://github.com/belliottsmith/cassandra/tree/stress-jmx]

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress

2014-09-12 Thread Benedict (JIRA)
Benedict created CASSANDRA-7918:
---

 Summary: Provide graphing tool along with cassandra-stress
 Key: CASSANDRA-7918
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor


Whilst cstar makes some pretty graphs, they're a little limited and also 
require you to run your tests through it. It would be useful to be able to 
graph results from any stress run easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132020#comment-14132020
 ] 

Benedict commented on CASSANDRA-7918:
-

Patch available 
[here|https://github.com/belliottsmith/cassandra/tree/stress-multiplot]

See CASSANDRA-7282 for sample output. This patch relies upon gnuplot.

 Provide graphing tool along with cassandra-stress
 -

 Key: CASSANDRA-7918
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor

 Whilst cstar makes some pretty graphs, they're a little limited and also 
 require you to run your tests through it. It would be useful to be able to 
 graph results from any stress run easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7915) Waiting for sync on the commit log could happen after writing to memtable

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132022#comment-14132022
 ] 

Benedict commented on CASSANDRA-7915:
-

I'm not totally convinced by this, although it's debatable. The issue, of 
course, is that the data becomes visible and can be read before it is 
considered 'durable' - meaning that you could lose the cluster, restore from 
CL, and find data is missing that was previously present. Users relying on 
batch CL probably would not like this scenario.

 Waiting for sync on the commit log could happen after writing to memtable
 -

 Key: CASSANDRA-7915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7915
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Branimir Lambov
Priority: Minor

 Currently the sync wait is part of CommitLog.add, which is executed in whole 
 before any memtable write. The time for executing the latter is thus added on 
 top of the time for file sync, which seems unnecessary.
 Moving the wait to a call at the end of Keystore.apply should hide the 
 memtable write time and may improve performance, especially for the batch 
 sync strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132485#comment-14132485
 ] 

Benedict commented on CASSANDRA-7918:
-

Well, I developed this on 12-hr flight, so doing it with something I need 
Google to achieve wasn't an option, although d3.js has some strengths. I really 
dislike python, however, and have found every time I try to use a scripting 
language to develop a tool they would be considered more suitable for, I waste 
time doing so. I am very productive in Java. I did not want to spend longer on 
this than necessary, so I stuck with Java this time.

I'm very happy with the end result - these graphs are extremely informative. 
With a single glance a lot of comparisons can easily be made. If somebody wants 
to develop something of equal utility with different tools, that's fine by me, 
but in the mean time I don't see why that should prevent this being made use 
of. Since this is an adhoc tool, dropping it in favour of a future improvement 
is easily done, it's a non-breaking change.

 Provide graphing tool along with cassandra-stress
 -

 Key: CASSANDRA-7918
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor

 Whilst cstar makes some pretty graphs, they're a little limited and also 
 require you to run your tests through it. It would be useful to be able to 
 graph results from any stress run easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132487#comment-14132487
 ] 

Benedict commented on CASSANDRA-7919:
-

See my comments on CASSANDRA-6108 for details of why I don't think this is 
suitable. Since we specifically want this for RAMP transactions we need 
guaranteed uniqueness cross-cluster, which timeuuids only achieve if created 
_server side_. We will absolutely have to modify the semantics of TimeUUID if 
we want them to be guaranteed unique still, at which point we may as well 
consider the 64-bit representation I suggested on that ticket.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132494#comment-14132494
 ] 

Benedict commented on CASSANDRA-7282:
-

Yes, we want to restrict the workload to memtables only, and we want to make 
them big to generate enough numbers on insert to avoid noise. So from a fresh 
cluster, I start it with a memtable_cleanup_threshold of 0.99, 
memtable_allocation_type: offheap_objects, and I run a stress test with only 
_one column_ per partition, and make that column size 1 (although any size is 
fine if it's offheap). I'm just trying to make the memtable as large as 
possible, which with my 4Gb laptop is difficult. I then set the 
memtable_heap_space_in_mb to 1024 (feel free to make it much bigger), and then 
insert around 5M items. If you stick to one column, ensure your offheap space 
is sufficiently large for the space you insert into it, then 5M items per node 
per Gb of on-heap space is achievable (my math tells me around 9M should be 
possible, but I overshot slightly and decided to be conservative). I then 
follow up immediately with a read run hitting a random selection of those PKs.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: reads.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132501#comment-14132501
 ] 

Benedict commented on CASSANDRA-7919:
-

I'm also a little uncomfortable making it explicitly a TimeUUID at the client 
level, since this may mean clients start expecting the exact TimeUUID back with 
each request also. This would definitely be a bad thing, and prevent almost all 
of the timestamp optimisation work we have pencilled for 3.0, as we would have 
to retain its full bytes forever, which are half random, the state space of 
which will grow linearly with cluster up-time.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132501#comment-14132501
 ] 

Benedict edited comment on CASSANDRA-7919 at 9/13/14 3:26 AM:
--

I'm also a little uncomfortable making it explicitly a TimeUUID at the client 
level, since this may mean clients start expecting the exact TimeUUID back with 
each request also. This would definitely be a bad thing, and prevent almost all 
of the timestamp optimisation work we have pencilled for 3.0, as we would have 
to retain its full bytes forever, the least significant bits of which become 
meaningless after repair, and are 'random' (a hash of interface + time salt), 
the state space of which will grow linearly with cluster up-time.


was (Author: benedict):
I'm also a little uncomfortable making it explicitly a TimeUUID at the client 
level, since this may mean clients start expecting the exact TimeUUID back with 
each request also. This would definitely be a bad thing, and prevent almost all 
of the timestamp optimisation work we have pencilled for 3.0, as we would have 
to retain its full bytes forever, which are half random, the state space of 
which will grow linearly with cluster up-time.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132508#comment-14132508
 ] 

Benedict commented on CASSANDRA-7282:
-

No. You benchmark to isolate performance improvements. These numbers 
demonstrate a benefit _to any portion of a workload that fits these 
characteristics_. Different use cases will exhibit different ratios of this 
effect; some considerable, some not so much. Certainly the read performance 
enhancements will be more generally applicable, but being able to fill your 
memtables faster and with less garbage is also not a bad thing, even if you end 
up bottlenecking on disk. Especially since server characteristics are changing 
rapidly, so that disk bottlenecks are disappearing.

As optimisation work goes deeper, isolating the specific portion of work that 
is affected is pretty essential to clearly delineating the benefit.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: reads.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132516#comment-14132516
 ] 

Benedict commented on CASSANDRA-7919:
-

We can have multiple clients per interface, we can have multiple clients 
starts/stop at the same time. C* instances can pretty much guarantee this 
doesn't happen; clients cannot. Once we have a collision on the LSB, we are 
exposed indefinitely to conflicts.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132516#comment-14132516
 ] 

Benedict edited comment on CASSANDRA-7919 at 9/13/14 4:09 AM:
--

We can have multiple clients per interface (esp. cross DC, as could have 
multiple DCs with the same network address range that do not communicate 
directly, but communicate with a shared C* data centre), we can have multiple 
clients starts/stop at the same time. C* instances can pretty much guarantee 
this doesn't happen; clients cannot. Once we have a collision on the LSB, we 
are exposed indefinitely to conflicts.


was (Author: benedict):
We can have multiple clients per interface, we can have multiple clients 
starts/stop at the same time. C* instances can pretty much guarantee this 
doesn't happen; clients cannot. Once we have a collision on the LSB, we are 
exposed indefinitely to conflicts.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132518#comment-14132518
 ] 

Benedict commented on CASSANDRA-7919:
-

bq. Once we have a collision on the LSB, we are exposed indefinitely to 
conflicts.

As soon as we collide on LSB, the only thing delivering uniqueness is 
timestamp, which is definitely insufficient.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132523#comment-14132523
 ] 

Benedict commented on CASSANDRA-7282:
-

The new code is not significantly more complex. In fact, its arguably less 
complex, the only difference being some of that complexity is not derived from 
an external source (CSLM is significantly more complex than the NBHOM). However 
it is a pretty simple piece of code.

The point of isolating is that we cannot possibly test all possible real world 
use cases. DSE, for instance, delivers Memory-Only SSTables, which depend on 
memtables, so all workloads hitting these would be affected fully by this 
change. All other workloads will be hit to different degrees.

In general we have to make a judgement call about how common such workloads 
are, vs how significant the effect is for those workloads. However trying to 
define a success metric based on a narrow definition of 'realistic workloads' 
that do not isolate the behaviour doesn't buy us anything in my book. Our 
hardware and workload definitions are not sufficiently universal. If we see a 
15% bump for accesses to memtables, that is IMO worth incorporating, esp. as 
the bump will increase over time. As we move memtables fully offheap we can 
support larger and larger memtables, and since the algorithmic quality of this 
change is that the performance does not increase as memtable size grows, 
whereas CSLM grows logarithmically, this change also has the likelihood of 
paying further future dividends.



 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: reads.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132529#comment-14132529
 ] 

Benedict commented on CASSANDRA-7919:
-

Well, I've defined situations in which it can happen: 2+ applications on the 
same server, machines connecting over NAT (e.g. over the internet), or machines 
connected to a common data centre that share address spaces.

As client counts increase, risk increases. The number of affected customers is 
likely to be low, but we can guarantee with the size of our user base it _will_ 
happen.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132536#comment-14132536
 ] 

Benedict commented on CASSANDRA-7282:
-

bq. But there definitely is some value in heavily battle-tested code when you 
work on a database.

Sure, agreed :)

Ok, I'm not disputing there's value in _seeing extra data_ (there always is), 
but if the effect is not seen, or is lost in the noise, on the specific 
hardware / use case we're benchmarking that _doesn't_ isolate, my only point is 
that this doesn't IMO reflect badly on the change.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: reads.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132537#comment-14132537
 ] 

Benedict commented on CASSANDRA-7919:
-

Are they really used that widely in use cases that aren't server generated? 
Also, they're generated considerably less frequently than we will be generating 
them if we switch to all updates requiring them, so the attack vector increases.

For RAMP transactions, I'm pretty sure the conflict becomes more dangerous as 
well. It's a while since I thought about them, but I remember reaching the 
conclusion that collisions would have more severe consequences.

Either way, I still think they're a a very bad idea based on my concerns about 
storage implications.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7923) When preparing a statement, do not parse the provided string if we already have the parsed statement cached

2014-09-12 Thread Benedict (JIRA)
Benedict created CASSANDRA-7923:
---

 Summary: When preparing a statement, do not parse the provided 
string if we already have the parsed statement cached
 Key: CASSANDRA-7923
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7923
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Priority: Minor


If there are many clients preparing the same statement (or the same client 
preparing it multiple times), there's no point parsing the statement each 
times. We already have it prepared, we should ship back the prior result.

I would like us separately to consider introducing some checks to ensure that 
we never have a hash collision (and error if we do, asking the user to salt 
their query string), but this change in no way increases the risk profile here, 
since all we did was overwrite the prior statement with the new one. This 
change means that clients referencing the old statement continue to function 
and the client registering the colliding statement will not execute the correct 
statement, but this is in no way worse than the reverse situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7923) When preparing a statement, do not parse the provided string if we already have the parsed statement cached

2014-09-12 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7923:

Attachment: 7923.txt

 When preparing a statement, do not parse the provided string if we already 
 have the parsed statement cached
 ---

 Key: CASSANDRA-7923
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7923
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Priority: Minor
 Attachments: 7923.txt


 If there are many clients preparing the same statement (or the same client 
 preparing it multiple times), there's no point parsing the statement each 
 times. We already have it prepared, we should ship back the prior result.
 I would like us separately to consider introducing some checks to ensure that 
 we never have a hash collision (and error if we do, asking the user to salt 
 their query string), but this change in no way increases the risk profile 
 here, since all we did was overwrite the prior statement with the new one. 
 This change means that clients referencing the old statement continue to 
 function and the client registering the colliding statement will not execute 
 the correct statement, but this is in no way worse than the reverse situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7923) When preparing a statement, do not parse the provided string if we already have the parsed statement cached

2014-09-12 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7923:

Labels: cql performance  (was: )

 When preparing a statement, do not parse the provided string if we already 
 have the parsed statement cached
 ---

 Key: CASSANDRA-7923
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7923
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: cql, performance
 Fix For: 2.1.1

 Attachments: 7923.txt


 If there are many clients preparing the same statement (or the same client 
 preparing it multiple times), there's no point parsing the statement each 
 times. We already have it prepared, we should ship back the prior result.
 I would like us separately to consider introducing some checks to ensure that 
 we never have a hash collision (and error if we do, asking the user to salt 
 their query string), but this change in no way increases the risk profile 
 here, since all we did was overwrite the prior statement with the new one. 
 This change means that clients referencing the old statement continue to 
 function and the client registering the colliding statement will not execute 
 the correct statement, but this is in no way worse than the reverse situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7819) In progress compactions should not prevent deletion of stale sstables

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132568#comment-14132568
 ] 

Benedict commented on CASSANDRA-7819:
-

Looking at this a little more closely, the first patch actually looks 
completely safe. We never have a MergeTask outstanding inbetween iterations. In 
fact, I cannot see why we have the MergeTask submittedly asynchronously at all, 
since we immediately synchronously wait on the result. The Deserializer is the 
only work done asynchronously, and it doesn't touch this code.  

If this weren't true, the changes above would not be sufficient; we would need 
to introduce some other synchronization, or simply delay this change until 2.1.

ParallelCI returns an Unwrapper(MergeIterator); the _Reducer_ inside this 
MergeIterator, on a call to getReduced(), submits a Runnable to an executor, 
during which we need to be certain the sstables in that collection remain 
available. This getReduced() method is called on computeNext(), which is called 
on either hasNext(), or next(). The Unwrapper always calls next() immediately 
after hasNext(), so there are never any left dangling, and _immediately_ calls 
FBUtilities.waitOnFuture()) on the result.

The only other places we call shouldPurge are: inside the regular 
CompactionIterable Reducer, which is not asynchronously executed, and 
synchronously inside doCleanupCompaction and Scrubber.scrub.

I am a little sleep deprived this second, so if you could double check my 
logic, that would be great. But it looks solid to me.

 In progress compactions should not prevent deletion of stale sstables
 -

 Key: CASSANDRA-7819
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7819
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: compaction
 Fix For: 2.0.11

 Attachments: 0001-7819-v2.patch, 7819.txt


 Compactions retain references to potentially many sstables that existed when 
 they were started but that are now obsolete; many concurrent compactions can 
 compound this dramatically, and with very large files in size tiered 
 compaction it is possible to inflate disk utilisation dramatically beyond 
 what is necessary.
 I propose, during compaction, periodically checking which sstables are 
 obsolete and simply replacing them with the sstable that replaced it. These 
 sstables are by definition only used for lookup, since we are in the process 
 of obsoleting the sstables we're compacting, they're only used to reference 
 overlapping ranges which may be covered by tombstones.
 A simplest solution might even be to simply detect obsoletion and recalculate 
 our overlapping tree afresh. This is a pretty quick operation in the grand 
 scheme of things, certainly wrt compaction, so nothing lost to do this at the 
 rate we obsolete sstables.
 See CASSANDRA-7139 for original discussion of the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7819) In progress compactions should not prevent deletion of stale sstables

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132572#comment-14132572
 ] 

Benedict commented on CASSANDRA-7819:
-

I'm tempted to say we should just get rid of that executor step to absolutely 
guarantee it, since it really doesn't buy us anything at all - it almost 
certainly just slows things down.

 In progress compactions should not prevent deletion of stale sstables
 -

 Key: CASSANDRA-7819
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7819
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: compaction
 Fix For: 2.0.11

 Attachments: 0001-7819-v2.patch, 7819.txt


 Compactions retain references to potentially many sstables that existed when 
 they were started but that are now obsolete; many concurrent compactions can 
 compound this dramatically, and with very large files in size tiered 
 compaction it is possible to inflate disk utilisation dramatically beyond 
 what is necessary.
 I propose, during compaction, periodically checking which sstables are 
 obsolete and simply replacing them with the sstable that replaced it. These 
 sstables are by definition only used for lookup, since we are in the process 
 of obsoleting the sstables we're compacting, they're only used to reference 
 overlapping ranges which may be covered by tombstones.
 A simplest solution might even be to simply detect obsoletion and recalculate 
 our overlapping tree afresh. This is a pretty quick operation in the grand 
 scheme of things, certainly wrt compaction, so nothing lost to do this at the 
 rate we obsolete sstables.
 See CASSANDRA-7139 for original discussion of the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132586#comment-14132586
 ] 

Benedict commented on CASSANDRA-7282:
-

Perhaps this is a good opportunity, for our second set of data points, to try 
taking the latest 'realistic workload' features of cassandra-stress for a whirl

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: reads.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7282:

Attachment: profile.yaml
run1.svg

Ok, so I ran a more realistic workload with the attached profile.yaml, 50/50 
read/writes, with reads favouring recently written partitions following an 
extreme distribution. i.e. the following stress command:

./tools/bin/cassandra-stress user profile=profile.yaml ops\(insert=5,read=5\) 
n=2000 -pop seq=1..10M read-lookback=extreme\(1..1M,2\) -rate threads=200 
-mode cql3 native prepared

This is still a workload geared towards exhibiting favourable behaviour, but it 
is certainly a larger than memory workload.

The graph comparing the results (run1.svg) attached demonstrates it is still 
showing a clear improvement, of around 10% throughput, reduced latencies, 
reduced total GC work. It also results in less frequent flushes, presumably due 
to it requiring slightly less memory than CSLM.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132620#comment-14132620
 ] 

Benedict commented on CASSANDRA-7919:
-

It should be, but +1 reducing the vector of this problem, regardless of the 
discussion around using TimeUUID for timestamps. 

It's more important this get contributed to the various drivers, but since some 
users run multiple C* instances on the same box I'd support including this in 
C* as well, just in case.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7658) stress connects to all nodes when it shouldn't

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132625#comment-14132625
 ] 

Benedict commented on CASSANDRA-7658:
-

I figured I must have configured it correctly since there was only one method 
to provide it, and if it worked or not depended on the Java Driver. But I was 
mistaken, you have to provide it in the builder. I've double checked, and the 
updated patch works correctly now.

 stress connects to all nodes when it shouldn't
 --

 Key: CASSANDRA-7658
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Brandon Williams
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7658.txt


 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do 
 ring discovery and connect to them all anyway (checked via netstat.)  This 
 led to the confusion on CASSANDRA-7567



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7724) Native-Transport threads get stuck in StorageProxy.preparePaxos with no one making progress

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132628#comment-14132628
 ] 

Benedict commented on CASSANDRA-7724:
-

It might be worth trying the (beta) patch for CASSANDRA-7542, to see if that 
has any impact. If it does, we could accelerate polishing it off.

FTR, it is not totally bizarre that all of them should be waiting - if another 
node is making progress on a round, all of the threads on a given node could be 
paused

 Native-Transport threads get stuck in StorageProxy.preparePaxos with no one 
 making progress
 ---

 Key: CASSANDRA-7724
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7724
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Linux 3.13.11-4 #4 SMP PREEMPT x86_64 Intel(R) Core(TM) 
 i7 CPU 950 @ 3.07GHz GenuineIntel 
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
 cassandra 2.0.9
Reporter: Anton Lebedevich
 Attachments: cassandra.threads2


 We've got a lot of write timeouts (cas) when running 
 INSERT INTO cas_demo(pri_id, sec_id, flag, something) VALUES(?, ?, ?, ?) IF 
 NOT EXISTS
  from 16 connections in parallel using the same pri_id and different sec_id.
 Doing the same from 4 connections in parallel works ok.
 All configuration values are at their default values.
 CREATE TABLE cas_demo (
   pri_id varchar,
   sec_id varchar,
   flag boolean,
   something setvarchar,
   PRIMARY KEY (pri_id, sec_id)
 );
 CREATE INDEX cas_demo_flag ON cas_demo(flag);
 Full thread dump is attached. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-6726) Recycle CompressedRandomAccessReader/RandomAccessReader buffers independently of their owners, and move them off-heap when possible

2014-09-13 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-6726.
-
Resolution: Won't Fix

 Recycle CompressedRandomAccessReader/RandomAccessReader buffers independently 
 of their owners, and move them off-heap when possible
 ---

 Key: CASSANDRA-6726
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6726
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Assignee: Branimir Lambov
Priority: Minor
  Labels: performance
 Fix For: 3.0

 Attachments: cassandra-6726.patch


 Whilst CRAR and RAR are pooled, we could and probably should pool the buffers 
 independently, so that they are not tied to a specific sstable. It may be 
 possible to move the RAR buffer off-heap, and the CRAR sometimes (e.g. Snappy 
 may possibly support off-heap buffers)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132641#comment-14132641
 ] 

Benedict commented on CASSANDRA-7784:
-

Any news on a patch for this? We should really have got this into 2.1.0, but 
should definitely fix it for 2.1.1

 DROP table leaves the counter and row cache in a temporarily inconsistent 
 state that, if saved during, will cause an exception to be thrown
 ---

 Key: CASSANDRA-7784
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784
 Project: Cassandra
  Issue Type: Bug
Reporter: Benedict
Assignee: Aleksey Yeschenko
Priority: Minor

 It looks like this is quite a realistic race to hit reasonably often, since 
 we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a 
 lengthy window to overlap with an auto-save



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-6696) Partition sstables by token range

2014-09-13 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-6696:

Labels: compaction correctness performance  (was: )

 Partition sstables by token range
 -

 Key: CASSANDRA-6696
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: sankalp kohli
Assignee: Marcus Eriksson
  Labels: compaction, correctness, performance
 Fix For: 3.0


 In JBOD, when someone gets a bad drive, the bad drive is replaced with a new 
 empty one and repair is run. 
 This can cause deleted data to come back in some cases. Also this is true for 
 corrupt stables in which we delete the corrupt stable and run repair. 
 Here is an example:
 Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
 row=sankalp col=sankalp is written 20 days back and successfully went to all 
 three nodes. 
 Then a delete/tombstone was written successfully for the same row column 15 
 days back. 
 Since this tombstone is more than gc grace, it got compacted in Nodes A and B 
 since it got compacted with the actual data. So there is no trace of this row 
 column in node A and B.
 Now in node C, say the original data is in drive1 and tombstone is in drive2. 
 Compaction has not yet reclaimed the data and tombstone.  
 Drive2 becomes corrupt and was replaced with new empty drive. 
 Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp 
 has come back to life. 
 Now after replacing the drive we run repair. This data will be propagated to 
 all nodes. 
 Note: This is still a problem even if we run repair every gc grace. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7658) stress connects to all nodes when it shouldn't

2014-09-13 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7658:

Attachment: 7658.v2.txt

 stress connects to all nodes when it shouldn't
 --

 Key: CASSANDRA-7658
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7658
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Brandon Williams
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7658.txt, 7658.v2.txt


 If you tell stress -node 1,2 in cluster with more nodes, stress appears to do 
 ring discovery and connect to them all anyway (checked via netstat.)  This 
 led to the confusion on CASSANDRA-7567



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132810#comment-14132810
 ] 

Benedict edited comment on CASSANDRA-7919 at 9/13/14 4:17 PM:
--

That is stale info, unfortunately :)

If we correct the risk of collision within a single application server, which 
is pretty essential, our LSB will suddenly explode regardless of how static our 
client set-of-ips is so optimising storage becomes impossible. And relying on a 
static client set-of-ips for performance is really unpleasant, anyway. It would 
be a very strange scenario for users to discover that their database 
performance degrades over time for reasons unrelated to their load.

That all said, if we only promise to store (and deliver back to users) the 
_time_ component *_only_*, I think we can do it.


was (Author: benedict):
That is stale info, unfortunately :)

If we correct the risk of collision within a single application server, which 
is pretty essential, our LSB will suddenly explode regardless of how static our 
client set-of-ips is so optimising storage becomes impossible. And relying on a 
static client set-of-ips for performance is really unpleasant, anyway. It would 
be a very strange scenario for users to discover that their database 
performance degrades over time for reasons unrelated to their load.

That all said, if we promise to store (and deliver back to users) the _time_ 
component *_only_*, I think we can do it.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132810#comment-14132810
 ] 

Benedict commented on CASSANDRA-7919:
-

That is stale info, unfortunately :)

If we correct the risk of collision within a single application server, which 
is pretty essential, our LSB will suddenly explode regardless of how static our 
client set-of-ips is so optimising storage becomes impossible. And relying on a 
static client set-of-ips for performance is really unpleasant, anyway. It would 
be a very strange scenario for users to discover that their database 
performance degrades over time for reasons unrelated to their load.

That all said, if we promise to store (and deliver back to users) the _time_ 
component *_only_*, I think we can do it.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132815#comment-14132815
 ] 

Benedict commented on CASSANDRA-7916:
-

I doc'd them in the MXBean interface comment :)

Not sure we want to pollute the README with this info, but definitely happy to 
insert it in other places..

bq. It might also be nice to summarize the total gc collections in the Results 
summary.

Good idea.

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132819#comment-14132819
 ] 

Benedict commented on CASSANDRA-7916:
-

Updated repository with results summary change

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132830#comment-14132830
 ] 

Benedict commented on CASSANDRA-7919:
-

Well, we're talking about making _every field a timeuuid_, meaning this 
decision has rather more impact than just those specific TimeUUID fields (which 
I've no issue not optimising the storage of, or only managing to do so 
partially)


 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7926) Stress can OOM on merging of timing samples

2014-09-13 Thread Benedict (JIRA)
Benedict created CASSANDRA-7926:
---

 Summary: Stress can OOM on merging of timing samples
 Key: CASSANDRA-7926
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7926
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


{noformat}
Exception in thread StressMetrics:2 java.lang.OutOfMemoryError: Java heap 
space
at java.util.Arrays.copyOf(Arrays.java:2343)
at 
org.apache.cassandra.stress.util.SampleOfLongs.merge(SampleOfLongs.java:76)
at 
org.apache.cassandra.stress.util.TimingInterval.merge(TimingInterval.java:95)
at org.apache.cassandra.stress.util.Timing.snapInterval(Timing.java:95)
at 
org.apache.cassandra.stress.StressMetrics.update(StressMetrics.java:124)
at 
org.apache.cassandra.stress.StressMetrics.access$200(StressMetrics.java:36)
at 
org.apache.cassandra.stress.StressMetrics$1.run(StressMetrics.java:72)
at java.lang.Thread.run(Thread.java:744)
{noformat}

This is partially down to recently increasing the per-thread sample size, but 
also because we allocate temporary space linear in size to total sample size in 
all threads during merge. This can easily be avoided. We should also scale 
per-thread sample size based on total number of threads, so we limit total 
memory use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132850#comment-14132850
 ] 

Benedict commented on CASSANDRA-7916:
-

AH, I see. That makes sense.

I don't think we have a stress-specific README, though? Might be worth creating 
one, but probably a separate ticket...

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132887#comment-14132887
 ] 

Benedict commented on CASSANDRA-7282:
-

For the Murmur3Partitioner, the Token (which is key for this map) is sorted by 
hash, so all we require is that the hashCode() method returns a prefix of this 
hash.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132887#comment-14132887
 ] 

Benedict edited comment on CASSANDRA-7282 at 9/13/14 7:14 PM:
--

For the Murmur3Partitioner, the Token (which is key for this map) is sorted by 
hash (represented as a Long), so all we require is that the hashCode() method 
returns a prefix of this hash, or the top 32-bits of the long value.


was (Author: benedict):
For the Murmur3Partitioner, the Token (which is key for this map) is sorted by 
hash, so all we require is that the hashCode() method returns a prefix of this 
hash.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132906#comment-14132906
 ] 

Benedict commented on CASSANDRA-7282:
-

As already stated, I disagree. Always, when making these decisions, the 
important factors are: 1) what workloads / portions of workloads are affected; 
and 2) do we consider these common enough (or to become common enough) to 
warrant inclusion?

Since we are very constrained on our hardware and workload generation, picking 
a realistic workload that we can perform at best helps us rule out 
regressions if it is not compatible with exhibiting the change. The important 
question is simply: do we consider it likely it will impact other workloads we 
are not capable of benchmarking, given the known information we have from 
isolating its effect in a manner we _can_.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132935#comment-14132935
 ] 

Benedict commented on CASSANDRA-7919:
-

bq. Regarding the lsb, a given client shouldn't really change it during it's 
lifetime (per the timeuuid spec)

This is not true if we incorporate CASSANDRA-7925

If we fix this, which we should, and we use TimeUUID for solving this problem 
generally, we should _not_ deliver the TimeUUID back to clients, and we should 
not store it fully indefinitely. We should only use it for guaranteeing 
uniqueness, and after that truncate it to a simple timestamp. There is no 
benefit to storing it indefinitely.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953
 ] 

Benedict commented on CASSANDRA-7282:
-

bq. Since this is the case, should we restrict this method to Murmur tokens 
only?

Well, if I can spend the time ensuring safety we should be able to use this for 
RandomPartitioner also.

bq. I would still change the condition in the preface to use non-strict 
inequality, though, because cropping tokens to 32 bits will introduce 
collisions.

There will be collisions with or without truncation. The fact that there are 
collisions doesn't affect the constraint I've imposed upon the data; we assume 
nothing about the dataset when two hashCode()s are equal, and simply resort to 
the underlying token comparator, so I think the constraint is sufficient. 
Non-strict equality is surely more broken, however? It would hold true for 
k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional 
implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().

bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an 
ordering key or ordering prefix?

Perhaps. This was the simplest approach, and since it *is* a hash key used to 
index a hash table it seems suitable to use hashCode(), and impose the extra 
constraint contextually. I'm fairly neutral though; we certainly could 
introduce a new interface. I'll see how it looks.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953
 ] 

Benedict edited comment on CASSANDRA-7282 at 9/13/14 9:49 PM:
--

bq. Since this is the case, should we restrict this method to Murmur tokens 
only?

Well, if I can spend the time ensuring safety we should be able to use this for 
RandomPartitioner also.

bq. I would still change the condition in the preface to use non-strict 
inequality, though, because cropping tokens to 32 bits will introduce 
collisions.

-There will be collisions with or without truncation. The fact that there are 
collisions doesn't affect the constraint I've imposed upon the data; we assume 
nothing about the dataset when two hashCode()s are equal, and simply resort to 
the underlying token comparator, so I think the constraint is sufficient. 
Non-strict equality is surely more broken, however? It would hold true for 
k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional 
implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().-

I just reread your comment and realise you meant the RHS only. Agreed this 
should be changed.

bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an 
ordering key or ordering prefix?

Perhaps. This was the simplest approach, and since it *is* a hash key used to 
index a hash table it seems suitable to use hashCode(), and impose the extra 
constraint contextually. I'm fairly neutral though; we certainly could 
introduce a new interface. I'll see how it looks.


was (Author: benedict):
bq. Since this is the case, should we restrict this method to Murmur tokens 
only?

Well, if I can spend the time ensuring safety we should be able to use this for 
RandomPartitioner also.

bq. I would still change the condition in the preface to use non-strict 
inequality, though, because cropping tokens to 32 bits will introduce 
collisions.

There will be collisions with or without truncation. The fact that there are 
collisions doesn't affect the constraint I've imposed upon the data; we assume 
nothing about the dataset when two hashCode()s are equal, and simply resort to 
the underlying token comparator, so I think the constraint is sufficient. 
Non-strict equality is surely more broken, however? It would hold true for 
k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional 
implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().

bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an 
ordering key or ordering prefix?

Perhaps. This was the simplest approach, and since it *is* a hash key used to 
index a hash table it seems suitable to use hashCode(), and impose the extra 
constraint contextually. I'm fairly neutral though; we certainly could 
introduce a new interface. I'll see how it looks.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7919) Change timestamp representation to timeuuid

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132979#comment-14132979
 ] 

Benedict commented on CASSANDRA-7919:
-

bq. a given client process will have one lsb for its lifetime.

Sure, but when discussing state space we have to assume applications lifecycle 
periodically :-)

bq. I don't have a problem with that.

Ok, if we can agree to that, fix the LSB and make clear to users they should 
not connect through NAT, I think that's all of my concerns mostly dealt with. 
This will still be more annoying to store, but surmountable without severe loss 
of efficiency.

 Change timestamp representation to timeuuid
 ---

 Key: CASSANDRA-7919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7919
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Priority: Minor
 Fix For: 3.0


 In order to overcome some of the issues with timestamps (CASSANDRA-6123) we 
 need to migrate to a better timestamp representation for cells.
 Since drivers already support timeuuid it makes sense to migrate to this 
 internally (see CASSANDRA-7056)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132953#comment-14132953
 ] 

Benedict edited comment on CASSANDRA-7282 at 9/13/14 9:55 PM:
--

bq. Since this is the case, should we restrict this method to Murmur tokens 
only?

Well, if I can spend the time ensuring safety we should be able to use this for 
RandomPartitioner also.

bq. I would still change the condition in the preface to use non-strict 
inequality, though, because cropping tokens to 32 bits will introduce 
collisions.

-There will be collisions with or without truncation. The fact that there are 
collisions doesn't affect the constraint I've imposed upon the data; we assume 
nothing about the dataset when two hashCode()s are equal, and simply resort to 
the underlying token comparator, so I think the constraint is sufficient. 
Non-strict equality is surely more broken, however? It would hold true for 
k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional 
implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().-

I just reread your comment and realise you meant the RHS only. Agreed this 
should be changed. However we should probably opt for the stronger 
bidirectional constraint, since it is still more correct.

bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an 
ordering key or ordering prefix?

Perhaps. This was the simplest approach, and since it *is* a hash key used to 
index a hash table it seems suitable to use hashCode(), and impose the extra 
constraint contextually. I'm fairly neutral though; we certainly could 
introduce a new interface. I'll see how it looks.


was (Author: benedict):
bq. Since this is the case, should we restrict this method to Murmur tokens 
only?

Well, if I can spend the time ensuring safety we should be able to use this for 
RandomPartitioner also.

bq. I would still change the condition in the preface to use non-strict 
inequality, though, because cropping tokens to 32 bits will introduce 
collisions.

-There will be collisions with or without truncation. The fact that there are 
collisions doesn't affect the constraint I've imposed upon the data; we assume 
nothing about the dataset when two hashCode()s are equal, and simply resort to 
the underlying token comparator, so I think the constraint is sufficient. 
Non-strict equality is surely more broken, however? It would hold true for 
k1:1, k2:2, hash(k1):2, hash(k2):2. We could strengthen it to a bi-directional 
implication, i.e. k1 = k2 == k1.hashCode() = k2.hashCode().-

I just reread your comment and realise you meant the RHS only. Agreed this 
should be changed.

bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an 
ordering key or ordering prefix?

Perhaps. This was the simplest approach, and since it *is* a hash key used to 
index a hash table it seems suitable to use hashCode(), and impose the extra 
constraint contextually. I'm fairly neutral though; we certainly could 
introduce a new interface. I'll see how it looks.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132991#comment-14132991
 ] 

Benedict commented on CASSANDRA-7282:
-

Hmm. Yes. I shouldn't get into these discussions at this time of night.

Agreed.

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7931) Single Host CPU profiling data dump

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133090#comment-14133090
 ] 

Benedict commented on CASSANDRA-7931:
-

for (1): take a look at CASSANDRA-4718. The costs of thread signalling for our 
small messages are always dominant, however we have already reduced this 
substantially with the SEPExecutor. I will be blogging about this in the near 
future. Further optimisation may well be possible, but avoiding blocking for 
the executor queues is exactly why this executor is more efficient. Thread 
signalling is expensive, and creates bottlenecks.

for (3): You are incorrect. The park() is only entered if the data is _not_ 
already available. For local requests the SEPExecutor permits optimising the 
call directly into the calling thread _if there are fewer requests than 
permitted to run on the read stage_. So if you tuned your read stage higher 
than your max_rpc count, this call would never happen for local only requests.

for (2) and (4), we can certainly do more: see CASSANDRA-7907 and CASSANDRA-7029


 Single Host CPU profiling data dump
 ---

 Key: CASSANDRA-7931
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7931
 Project: Cassandra
  Issue Type: Improvement
 Environment: 2.1.0 on my local machine
Reporter: Michael Nelson
Priority: Minor
 Attachments: traces.txt


 At Boot Camp today I did some CPU profiling and wanted to turn in my findings 
 somehow. So here they are in a JIRA.
 I ran cassandra-stress read test against my local machine (single node: SSD, 
 8 proc, plenty of RAM). I'm using straight 2.1.0 with the Lightweight Java 
 Profiler by Jeremy Manson 
 (https://code.google.com/p/lightweight-java-profiler/). This is a lower-level 
 profiler that more accurately reports what the CPU is actually doing rather 
 than where time is going.
 Attached is the report. Here are a few high points with some commentary from 
 some investigation I did today:
 1. The number one _consumer_of_CPU_ is this stack trace:
 sun.misc.Unsafe.park(Unsafe.java:-1)
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349)
 org.apache.cassandra.concurrent.SEPWorker.doWaitSpin(SEPWorker.java:236)
 org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:74)
 java.lang.Thread.run(Thread.java:745)
 There is a CPU cost to going into and out of Unsafe.park() and the SEPWorkers 
 are doing it so much when they are spinning that they were the largest single 
 consumer of CPU. Note that this profiler is sampling the CPU so it isn't the 
 _blocking_ that this call does that is the issue. It is the CPU actually 
 being consumed as it prepares to block and then later comes out. This kind of 
 lower level fine grained control of threads is tricky and there appears to be 
 room for improvement here. I tried a few approaches that didn't pan out. 
 There really needs to be a way for threads to block waiting for the executor 
 queues. That way they will be immediately available (rather than waiting for 
 them to poll the executors) and will not consume CPU when they aren't needed. 
 Maybe block for some short period of time and then become available for other 
 queues after that?
 2. Second is Netty writing to sockets. I didn't investigate this. Netty is 
 pretty optimized. Someone mentioned there are native options for Netty. 
 Probably worth trying.
 3. Third is similar to #1:
 sun.misc.Unsafe.park(Unsafe.java:-1)
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349)
 org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(WaitQueue.java:303)
  
 org.apache.cassandra.utils.concurrent.SimpleCondition.await(SimpleCondition.java:63)
 org.apache.cassandra.service.ReadCallback.await(ReadCallback.java:90)
 org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
 org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144)
 org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1226)
 I looked at this a little. The asynchronous callback using Unsafe.park() is 
 being invoked even for local reads (like my local laptop). It only rarely 
 blocked. It just paid the CPU cost of going in and immediately coming out 
 because the read was already satisfied from the buffer cache. :-(
 4. Fourth is Netty doing epoll. Again, worth considering Netty's native 
 optimizations to help here.
 The others are interesting but, for this little benchmark, not as big a deal. 
 For example, ArrayList.grow() shows up more than I would have expected.
 The Boot Camp was great. Feel free to contact me if I can help somehow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7930) Warn when evicting prepared statements from cache

2014-09-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133091#comment-14133091
 ] 

Benedict commented on CASSANDRA-7930:
-

Please use StorageService.scheduledTasks instead of creating a new executor for 
such low volume work

 Warn when evicting prepared statements from cache
 -

 Key: CASSANDRA-7930
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7930
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Robbie Strickland
Assignee: Robbie Strickland
 Attachments: cassandra-2.0-v2.txt, cassandra-2.0-v3.txt, 
 cassandra-2.0-v4.txt, cassandra-2.0.txt, cassandra-2.1.txt


 The prepared statement cache is an LRU, with a max size of maxMemory / 256.  
 There is currently no warning when statements are evicted, which could be 
 problematic if the user is unaware that this is happening.
 At the very least, we should provide a JMX metric and possibly a log message 
 indicating this is happening.  At some point it may also be worthwhile to 
 make this tunable for users with large numbers of statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7705) Safer Resource Management

2014-09-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133110#comment-14133110
 ] 

Benedict commented on CASSANDRA-7705:
-

Patch available 
[here|https://github.com/belliottsmith/cassandra/tree/7705-resourcemgmt]

This patch traps reference leaks as well as double-releasing references. This 
version only modifies SSTableReader resource management, but it can be rolled 
out for any heavy-weight objects we manage, i.e. all instances of 
RefCountedMemory except those stored in SerializingCache.  

 Safer Resource Management
 -

 Key: CASSANDRA-7705
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
 Fix For: 3.0


 We've had a spate of bugs recently with bad reference counting. these can 
 have potentially dire consequences, generally either randomly deleting data 
 or giving us infinite loops. 
 Since in 2.1 we only reference count resources that are relatively expensive 
 and infrequently managed (or in places where this safety is probably not as 
 necessary, e.g. SerializingCache), we could without any negative consequences 
 (and only slight code complexity) introduce a safer resource management 
 scheme for these more expensive/infrequent actions.
 Basically, I propose when we want to acquire a resource we allocate an object 
 that manages the reference. This can only be released once; if it is released 
 twice, we fail immediately at the second release, reporting where the bug is 
 (rather than letting it continue fine until the next correct release corrupts 
 the count). The reference counter remains the same, but we obtain guarantees 
 that the reference count itself is never badly maintained, although code 
 using it could mistakenly release its own handle early (typically this is 
 only an issue when cleaning up after a failure, in which case under the new 
 scheme this would be an innocuous error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources

2014-09-14 Thread Benedict (JIRA)
Benedict created CASSANDRA-7932:
---

 Summary: Corrupt SSTable Cleanup Leak Resources
 Key: CASSANDRA-7932
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.1


CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. I've 
tracked this down to DataTracker.removeUnreadableSSTables() , which does not 
release the reference to any sstables from the tracker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources

2014-09-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7932:

Attachment: 7932.txt

Patch attached

 Corrupt SSTable Cleanup Leak Resources
 --

 Key: CASSANDRA-7932
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.1

 Attachments: 7932.txt


 CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. 
 I've tracked this down to DataTracker.removeUnreadableSSTables() , which does 
 not release the reference to any sstables from the tracker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133122#comment-14133122
 ] 

Benedict commented on CASSANDRA-7282:
-

bq. I wouldn't even call this a hashCode to avoid the confusion, perhaps an 
ordering key or ordering prefix?

Creating an interface would be problematic, since we need to have our map key 
be a shared type for both CSLM keys and NBHOM keys. So I'm going to stick with 
the current situation.

If you meant from a purely documentation point of view, it is absolutely 
essential that the value is a _hash_, otherwise performance will be O\(n\^2\), 
so whilst it may be worth clarifying it is essential we call it a hashCode(). 

To elaborate on this in documentation, I've included the following extra comment

{quote}
This data structure essentially only works for keys that are first sorted by 
some hash value (and may then be sorted 
within those hashes arbitrarily), where a 32-bit prefix of the hash we sort by 
is returned by hashCode()
{quote}

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7933) Update cassandra-stress README

2014-09-14 Thread Benedict (JIRA)
Benedict created CASSANDRA-7933:
---

 Summary: Update cassandra-stress README
 Key: CASSANDRA-7933
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7933
 Project: Cassandra
  Issue Type: Task
Reporter: Benedict
Priority: Minor


There is a README in the tools/stress directory. It is completely out of date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133127#comment-14133127
 ] 

Benedict commented on CASSANDRA-7916:
-

Committed, and created a new ticket for sprucing up the README (since it's 
completely stale, not just this that needs elaborating)

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources

2014-09-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133155#comment-14133155
 ] 

Benedict commented on CASSANDRA-7932:
-

I was mistaken. Whilst that patch fixes what appears to be a leak we could 
encounter from the same code path, the actual leak was more involved. I've 
patched on top of the leak detection code for trunk, however if we get a thumbs 
up we can backport to 2.1

[Patch|https://github.com/belliottsmith/cassandra/tree/7932-corruptleak]

 Corrupt SSTable Cleanup Leak Resources
 --

 Key: CASSANDRA-7932
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.1

 Attachments: 7932.txt


 CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. 
 I've tracked this down to DataTracker.removeUnreadableSSTables() , which does 
 not release the reference to any sstables from the tracker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7932) Corrupt SSTable Cleanup Leak Resources

2014-09-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7932:

Attachment: (was: 7932.txt)

 Corrupt SSTable Cleanup Leak Resources
 --

 Key: CASSANDRA-7932
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7932
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.1


 CASSANDRA-7705, during the BlackingListingCompactionsTest, detected leaks. 
 I've tracked this down to DataTracker.removeUnreadableSSTables() , which does 
 not release the reference to any sstables from the tracker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7282) Faster Memtable map

2014-09-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133164#comment-14133164
 ] 

Benedict commented on CASSANDRA-7282:
-

Uploaded a version with updated comments 
[here|https://github.com/belliottsmith/cassandra/tree/7282-fastmemtablemap]

 Faster Memtable map
 ---

 Key: CASSANDRA-7282
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7282
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance
 Fix For: 3.0

 Attachments: profile.yaml, reads.svg, run1.svg, writes.svg


 Currently we maintain a ConcurrentSkipLastMap of DecoratedKey - Partition in 
 our memtables. Maintaining this is an O(lg(n)) operation; since the vast 
 majority of users use a hash partitioner, it occurs to me we could maintain a 
 hybrid ordered list / hash map. The list would impose the normal order on the 
 collection, but a hash index would live alongside as part of the same data 
 structure, simply mapping into the list and permitting O(1) lookups and 
 inserts.
 I've chosen to implement this initial version as a linked-list node per item, 
 but we can optimise this in future by storing fatter nodes that permit a 
 cache-line's worth of hashes to be checked at once,  further reducing the 
 constant factor costs for lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7926) Stress can OOM on merging of timing samples

2014-09-14 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133191#comment-14133191
 ] 

Benedict commented on CASSANDRA-7926:
-

Patch available 
[here|https://github.com/belliottsmith/cassandra/tree/7926-stressoom]

I've made a few small changes:
* The number of samples we collect/accumulate at any point are all now 
configurable with the \-samples setting
* When merging multiple samples, we no longer merge them altogether and _then_ 
downsample, but instead downsample as we merge, ensuring our memory use is 
bounded much lower
* Switched to ThreadLocalRandom instead of Random for generating probabilities 



 Stress can OOM on merging of timing samples
 ---

 Key: CASSANDRA-7926
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7926
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 {noformat}
 Exception in thread StressMetrics:2 java.lang.OutOfMemoryError: Java heap 
 space
 at java.util.Arrays.copyOf(Arrays.java:2343)
 at 
 org.apache.cassandra.stress.util.SampleOfLongs.merge(SampleOfLongs.java:76)
 at 
 org.apache.cassandra.stress.util.TimingInterval.merge(TimingInterval.java:95)
 at 
 org.apache.cassandra.stress.util.Timing.snapInterval(Timing.java:95)
 at 
 org.apache.cassandra.stress.StressMetrics.update(StressMetrics.java:124)
 at 
 org.apache.cassandra.stress.StressMetrics.access$200(StressMetrics.java:36)
 at 
 org.apache.cassandra.stress.StressMetrics$1.run(StressMetrics.java:72)
 at java.lang.Thread.run(Thread.java:744)
 {noformat}
 This is partially down to recently increasing the per-thread sample size, but 
 also because we allocate temporary space linear in size to total sample size 
 in all threads during merge. This can easily be avoided. We should also scale 
 per-thread sample size based on total number of threads, so we limit total 
 memory use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7934) Remove FBUtilities.threadLocalRandom

2014-09-14 Thread Benedict (JIRA)
Benedict created CASSANDRA-7934:
---

 Summary: Remove FBUtilities.threadLocalRandom
 Key: CASSANDRA-7934
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7934
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


We should use ThreadLocalRandom.current() instead, as it is not only more 
standard, it is considerably faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7738) Permit CL overuse to be explicitly bounded

2014-09-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7738:

Assignee: (was: Benedict)

 Permit CL overuse to be explicitly bounded
 --

 Key: CASSANDRA-7738
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7738
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Priority: Minor
 Fix For: 3.0


 As mentioned in CASSANDRA-7554, we do not currently offer any way to 
 explicitly bound CL growth, which can be problematic in some scenarios (e.g. 
 EC2 where the system drive is quite small). We should offer a configurable 
 amount of headroom, beyond which we stop accepting writes until the backlog 
 clears.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-7915) Waiting for sync on the commit log could happen after writing to memtable

2014-09-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-7915.
-
Resolution: Not a Problem

We don't actually do very much work at all besides the BTree merge on apply, 
which we cannot help but delay until the last moment without risking the work 
being wasted. So there's very little work to save whilst maintaining the 
current behaviour wrt correctness.

 Waiting for sync on the commit log could happen after writing to memtable
 -

 Key: CASSANDRA-7915
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7915
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Branimir Lambov
Priority: Minor

 Currently the sync wait is part of CommitLog.add, which is executed in whole 
 before any memtable write. The time for executing the latter is thus added on 
 top of the time for file sync, which seems unnecessary.
 Moving the wait to a call at the end of Keystore.apply should hide the 
 memtable write time and may improve performance, especially for the batch 
 sync strategy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CASSANDRA-7705) Safer Resource Management

2014-09-14 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict reassigned CASSANDRA-7705:
---

Assignee: Benedict

 Safer Resource Management
 -

 Key: CASSANDRA-7705
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0


 We've had a spate of bugs recently with bad reference counting. these can 
 have potentially dire consequences, generally either randomly deleting data 
 or giving us infinite loops. 
 Since in 2.1 we only reference count resources that are relatively expensive 
 and infrequently managed (or in places where this safety is probably not as 
 necessary, e.g. SerializingCache), we could without any negative consequences 
 (and only slight code complexity) introduce a safer resource management 
 scheme for these more expensive/infrequent actions.
 Basically, I propose when we want to acquire a resource we allocate an object 
 that manages the reference. This can only be released once; if it is released 
 twice, we fail immediately at the second release, reporting where the bug is 
 (rather than letting it continue fine until the next correct release corrupts 
 the count). The reference counter remains the same, but we obtain guarantees 
 that the reference count itself is never badly maintained, although code 
 using it could mistakenly release its own handle early (typically this is 
 only an issue when cleaning up after a failure, in which case under the new 
 scheme this would be an innocuous error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7724) Native-Transport threads get stuck in StorageProxy.preparePaxos with no one making progress

2014-09-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7724:

Attachment: aggregateddump.txt

Attached is your thread dump reformatted to be easier to digest. Looking at it, 
there appears to be an incoming read on the IncomingTcpConnection, which is 
quite likely one of the prepare requests or responses being processed by the 
node, although it's hard to say for certain. There are multiple threads 
involved - the native transport requests start the work, however the mutation 
stage processes the paxos messages on receipt, and the incoming/outbound tcp 
connections deliver those messages. It's still a bit funny that you can never 
see these threads live, though, in any of your dumps, and I would be interested 
in getting a few to double check.

Either way, this raises the sensible prospect of optimising cas when RF=1. 

 Native-Transport threads get stuck in StorageProxy.preparePaxos with no one 
 making progress
 ---

 Key: CASSANDRA-7724
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7724
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Linux 3.13.11-4 #4 SMP PREEMPT x86_64 Intel(R) Core(TM) 
 i7 CPU 950 @ 3.07GHz GenuineIntel 
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
 cassandra 2.0.9
Reporter: Anton Lebedevich
 Attachments: aggregateddump.txt, cassandra.threads2


 We've got a lot of write timeouts (cas) when running 
 INSERT INTO cas_demo(pri_id, sec_id, flag, something) VALUES(?, ?, ?, ?) IF 
 NOT EXISTS
  from 16 connections in parallel using the same pri_id and different sec_id.
 Doing the same from 4 connections in parallel works ok.
 All configuration values are at their default values.
 CREATE TABLE cas_demo (
   pri_id varchar,
   sec_id varchar,
   flag boolean,
   something setvarchar,
   PRIMARY KEY (pri_id, sec_id)
 );
 CREATE INDEX cas_demo_flag ON cas_demo(flag);
 Full thread dump is attached. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes

2014-09-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7937:

Labels: performance  (was: )

 Apply backpressure gently when overloaded with writes
 -

 Key: CASSANDRA-7937
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0
Reporter: Piotr Kołaczkowski
  Labels: performance

 When writing huge amounts of data into C* cluster from analytic tools like 
 Hadoop or Apache Spark, we can see that often C* can't keep up with the load. 
 This is because analytic tools typically write data as fast as they can in 
 parallel, from many nodes and they are not artificially rate-limited, so C* 
 is the bottleneck here. Also, increasing the number of nodes doesn't really 
 help, because in a collocated setup this also increases number of 
 Hadoop/Spark nodes (writers) and although possible write performance is 
 higher, the problem still remains.
 We observe the following behavior:
 1. data is ingested at an extreme fast pace into memtables and flush queue 
 fills up
 2. the available memory limit for memtables is reached and writes are no 
 longer accepted
 3. the application gets hit by write timeout, and retries repeatedly, in 
 vain 
 4. after several failed attempts to write, the job gets aborted 
 Desired behaviour:
 1. data is ingested at an extreme fast pace into memtables and flush queue 
 fills up
 2. after exceeding some memtable fill threshold, C* applies rate limiting 
 to writes - the more the buffers are filled-up, the less writes/s are 
 accepted, however writes still occur within the write timeout.
 3. thanks to slowed down data ingestion, now flush can happen before all the 
 memory gets used
 Of course the details how rate limiting could be done are up for a discussion.
 It may be also worth considering putting such logic into the driver, not C* 
 core, but then C* needs to expose at least the following information to the 
 driver, so we could calculate the desired maximum data rate:
 1. current amount of memory available for writes before they would completely 
 block
 2. total amount of data queued to be flushed and flush progress (amount of 
 data to flush remaining for the memtable currently being flushed)
 3. average flush write speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7937) Apply backpressure gently when overloaded with writes

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133862#comment-14133862
 ] 

Benedict commented on CASSANDRA-7937:
-

This should certainly be dealt with by the cluster. We cannot rely on well 
behaved clients, and clients cannot easily calculate a safe data-rate cross 
cluster, so any client change would at best help direct writes only, which with 
RF1 is not much help. Nor could it be as responsive. 

My preferred solution to this is CASSANDRA-6812, which should keep the server 
responding to writes within the timeout window even as it blocks for lengthy 
flushes, but during these windows writes would be acked much more slowly, at a 
steady drip. This solution won't make it into 2.0 or 2.1, and possibly not even 
3.0, though.

 Apply backpressure gently when overloaded with writes
 -

 Key: CASSANDRA-7937
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0
Reporter: Piotr Kołaczkowski
  Labels: performance

 When writing huge amounts of data into C* cluster from analytic tools like 
 Hadoop or Apache Spark, we can see that often C* can't keep up with the load. 
 This is because analytic tools typically write data as fast as they can in 
 parallel, from many nodes and they are not artificially rate-limited, so C* 
 is the bottleneck here. Also, increasing the number of nodes doesn't really 
 help, because in a collocated setup this also increases number of 
 Hadoop/Spark nodes (writers) and although possible write performance is 
 higher, the problem still remains.
 We observe the following behavior:
 1. data is ingested at an extreme fast pace into memtables and flush queue 
 fills up
 2. the available memory limit for memtables is reached and writes are no 
 longer accepted
 3. the application gets hit by write timeout, and retries repeatedly, in 
 vain 
 4. after several failed attempts to write, the job gets aborted 
 Desired behaviour:
 1. data is ingested at an extreme fast pace into memtables and flush queue 
 fills up
 2. after exceeding some memtable fill threshold, C* applies rate limiting 
 to writes - the more the buffers are filled-up, the less writes/s are 
 accepted, however writes still occur within the write timeout.
 3. thanks to slowed down data ingestion, now flush can happen before all the 
 memory gets used
 Of course the details how rate limiting could be done are up for a discussion.
 It may be also worth considering putting such logic into the driver, not C* 
 core, but then C* needs to expose at least the following information to the 
 driver, so we could calculate the desired maximum data rate:
 1. current amount of memory available for writes before they would completely 
 block
 2. total amount of data queued to be flushed and flush progress (amount of 
 data to flush remaining for the memtable currently being flushed)
 3. average flush write speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-3017) add a Message size limit

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133863#comment-14133863
 ] 

Benedict commented on CASSANDRA-3017:
-

This is definitely a good idea. At the same time I think it might be worth 
considering introducing an upper limit on either the total size of requests 
we've currently got in-flight for MessagingService, or the total number, or 
possibly both. Once the threshold is exceeded we stop consuming input from all 
IncomingTcpConnection(s). This is not dramatically different to our imposition 
of a max rpc count, but stops a single server being overloaded through a 
hotspot of queries driven by non-token aware clients (but also from only a 
slight variant on the malicious oversized payload attack).

 add a Message size limit
 

 Key: CASSANDRA-3017
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3017
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jonathan Ellis
Priority: Minor
  Labels: lhf
 Attachments: 
 0001-use-the-thrift-max-message-size-for-inter-node-messa.patch, 
 trunk-3017.txt


 We protect the server from allocating huge buffers for malformed message with 
 the Thrift frame size (CASSANDRA-475).  But we don't have similar protection 
 for the inter-node Message objects.
 Adding this would be good to deal with malicious adversaries as well as a 
 malfunctioning cluster participant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7402) limit the on heap memory available to requests

2014-09-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7402:

Labels: ops performance stability  (was: ops)

 limit the on heap memory available to requests
 --

 Key: CASSANDRA-7402
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
  Labels: ops, performance, stability
 Fix For: 3.0


 When running a production cluster one common operational issue is quantifying 
 GC pauses caused by ongoing requests.
 Since different queries return varying amount of data you can easily get your 
 self into a situation where you Stop the world from a couple of bad actors in 
 the system.  Or more likely the aggregate garbage generated on a single node 
 across all in flight requests causes a GC.
 We should be able to set a limit on the max heap we can allocate to all 
 outstanding requests and track the garbage per requests to stop this from 
 happening.  It should increase a single nodes availability substantially.
 In the yaml this would be
 {code}
 total_request_memory_space_mb: 400
 {code}
 It would also be nice to have either a log of queries which generate the most 
 garbage so operators can track this.  Also a histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134042#comment-14134042
 ] 

Benedict commented on CASSANDRA-7916:
-

ninja fixed

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7716) cassandra-stress: provide better error messages

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134049#comment-14134049
 ] 

Benedict commented on CASSANDRA-7716:
-

I think the old behaviour of StressProfile.select() makes more sense...? If 
you've specified a property key, but no value, that should probably fail rather 
than using the default don't you think?

Otherwise LGTM, and no really strong feeling on that issue.

 cassandra-stress: provide better error messages
 ---

 Key: CASSANDRA-7716
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7716
 Project: Cassandra
  Issue Type: Improvement
Reporter: Robert Stupp
Assignee: T Jake Luciani
Priority: Trivial
 Fix For: 2.1.1

 Attachments: 7166v2.txt, 7716.txt


 Just tried new stress tool.
 It would be great if the stress tool gives better error messages by telling 
 the user what option or config parameter/value caused an error.
 YAML parse errors are meaningful (gives code snippets etc).
 Examples are:
 {noformat}
 WARN  16:59:39 Setting caching options with deprecated syntax.
 Exception in thread main java.lang.NullPointerException
   at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
   at java.util.regex.Matcher.reset(Matcher.java:308)
   at java.util.regex.Matcher.init(Matcher.java:228)
   at java.util.regex.Pattern.matcher(Pattern.java:1088)
   at 
 org.apache.cassandra.stress.settings.OptionDistribution.get(OptionDistribution.java:67)
   at 
 org.apache.cassandra.stress.StressProfile.init(StressProfile.java:151)
   at 
 org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114)
   at 
 org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134)
   at 
 org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218)
   at 
 org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206)
   at org.apache.cassandra.stress.Stress.main(Stress.java:58)
 {noformat}
 When table-definition is wrong:
 {noformat}
 Exception in thread main java.lang.RuntimeException: 
 org.apache.cassandra.exceptions.SyntaxException: line 6:14 mismatched input 
 '(' expecting ')'
   at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:550)
   at 
 org.apache.cassandra.stress.StressProfile.init(StressProfile.java:134)
   at 
 org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114)
   at 
 org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134)
   at 
 org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218)
   at 
 org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206)
   at org.apache.cassandra.stress.Stress.main(Stress.java:58)
 Caused by: org.apache.cassandra.exceptions.SyntaxException: line 6:14 
 mismatched input '(' expecting ')'
   at 
 org.apache.cassandra.cql3.CqlParser.throwLastRecognitionError(CqlParser.java:273)
   at 
 org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:456)
   at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:541)
   ... 8 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7716) cassandra-stress: provide better error messages

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134055#comment-14134055
 ] 

Benedict commented on CASSANDRA-7716:
-

I figure default is only if it's not provided; if it's provided but empty they 
perhaps messed up. But either is totally reasonable, so not going to quibble.

 cassandra-stress: provide better error messages
 ---

 Key: CASSANDRA-7716
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7716
 Project: Cassandra
  Issue Type: Improvement
Reporter: Robert Stupp
Assignee: T Jake Luciani
Priority: Trivial
 Fix For: 2.1.1

 Attachments: 7166v2.txt, 7716.txt


 Just tried new stress tool.
 It would be great if the stress tool gives better error messages by telling 
 the user what option or config parameter/value caused an error.
 YAML parse errors are meaningful (gives code snippets etc).
 Examples are:
 {noformat}
 WARN  16:59:39 Setting caching options with deprecated syntax.
 Exception in thread main java.lang.NullPointerException
   at java.util.regex.Matcher.getTextLength(Matcher.java:1234)
   at java.util.regex.Matcher.reset(Matcher.java:308)
   at java.util.regex.Matcher.init(Matcher.java:228)
   at java.util.regex.Pattern.matcher(Pattern.java:1088)
   at 
 org.apache.cassandra.stress.settings.OptionDistribution.get(OptionDistribution.java:67)
   at 
 org.apache.cassandra.stress.StressProfile.init(StressProfile.java:151)
   at 
 org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114)
   at 
 org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134)
   at 
 org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218)
   at 
 org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206)
   at org.apache.cassandra.stress.Stress.main(Stress.java:58)
 {noformat}
 When table-definition is wrong:
 {noformat}
 Exception in thread main java.lang.RuntimeException: 
 org.apache.cassandra.exceptions.SyntaxException: line 6:14 mismatched input 
 '(' expecting ')'
   at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:550)
   at 
 org.apache.cassandra.stress.StressProfile.init(StressProfile.java:134)
   at 
 org.apache.cassandra.stress.StressProfile.load(StressProfile.java:482)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.init(SettingsCommandUser.java:53)
   at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.build(SettingsCommandUser.java:114)
   at 
 org.apache.cassandra.stress.settings.SettingsCommand.get(SettingsCommand.java:134)
   at 
 org.apache.cassandra.stress.settings.StressSettings.get(StressSettings.java:218)
   at 
 org.apache.cassandra.stress.settings.StressSettings.parse(StressSettings.java:206)
   at org.apache.cassandra.stress.Stress.main(Stress.java:58)
 Caused by: org.apache.cassandra.exceptions.SyntaxException: line 6:14 
 mismatched input '(' expecting ')'
   at 
 org.apache.cassandra.cql3.CqlParser.throwLastRecognitionError(CqlParser.java:273)
   at 
 org.apache.cassandra.cql3.QueryProcessor.parseStatement(QueryProcessor.java:456)
   at org.apache.cassandra.config.CFMetaData.compile(CFMetaData.java:541)
   ... 8 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7402) Add metrics to track memory used by client requests

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134071#comment-14134071
 ] 

Benedict commented on CASSANDRA-7402:
-

I'm not convinced tracking the per-client/per-query statistics is likely to be 
very viable. Once queries cross the MS threshold the information isn't 
available to us, and making it could be costly. We could probably serialize the 
prepared statement id over the wire, and wire that up as the data is requested 
in nodetool, say, by attempting to locate a server with the statement. I think 
tracking and reporting this data in this manner should be a separate ticket to 
constraining it, however, which is a much more concretely beneficial and 
achievable goal.

 Add metrics to track memory used by client requests
 ---

 Key: CASSANDRA-7402
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
  Labels: ops, performance, stability
 Fix For: 3.0


 When running a production cluster one common operational issue is quantifying 
 GC pauses caused by ongoing requests.
 Since different queries return varying amount of data you can easily get your 
 self into a situation where you Stop the world from a couple of bad actors in 
 the system.  Or more likely the aggregate garbage generated on a single node 
 across all in flight requests causes a GC.
 We should be able to set a limit on the max heap we can allocate to all 
 outstanding requests and track the garbage per requests to stop this from 
 happening.  It should increase a single nodes availability substantially.
 In the yaml this would be
 {code}
 total_request_memory_space_mb: 400
 {code}
 It would also be nice to have either a log of queries which generate the most 
 garbage so operators can track this.  Also a histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7402) Add metrics to track memory used by client requests

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134076#comment-14134076
 ] 

Benedict commented on CASSANDRA-7402:
-

It's the queries we cannot easily track, aggregated or not. At least, not those 
we received over MessagingService, or not easily and cheaply.

 Add metrics to track memory used by client requests
 ---

 Key: CASSANDRA-7402
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
  Labels: ops, performance, stability
 Fix For: 3.0


 When running a production cluster one common operational issue is quantifying 
 GC pauses caused by ongoing requests.
 Since different queries return varying amount of data you can easily get your 
 self into a situation where you Stop the world from a couple of bad actors in 
 the system.  Or more likely the aggregate garbage generated on a single node 
 across all in flight requests causes a GC.
 We should be able to set a limit on the max heap we can allocate to all 
 outstanding requests and track the garbage per requests to stop this from 
 happening.  It should increase a single nodes availability substantially.
 In the yaml this would be
 {code}
 total_request_memory_space_mb: 400
 {code}
 It would also be nice to have either a log of queries which generate the most 
 garbage so operators can track this.  Also a histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7402) Add metrics to track memory used by client requests

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134093#comment-14134093
 ] 

Benedict commented on CASSANDRA-7402:
-

Makes sense. If we're going to be grabbing the data cross-cluster (necessary 
for delivering per-node stats) then tracking based on statement id would be 
sufficient also, since we could populate the map in nodetool from the whole 
cluster, so users can drill down into hotspot nodes

 Add metrics to track memory used by client requests
 ---

 Key: CASSANDRA-7402
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7402
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
  Labels: ops, performance, stability
 Fix For: 3.0


 When running a production cluster one common operational issue is quantifying 
 GC pauses caused by ongoing requests.
 Since different queries return varying amount of data you can easily get your 
 self into a situation where you Stop the world from a couple of bad actors in 
 the system.  Or more likely the aggregate garbage generated on a single node 
 across all in flight requests causes a GC.
 We should be able to set a limit on the max heap we can allocate to all 
 outstanding requests and track the garbage per requests to stop this from 
 happening.  It should increase a single nodes availability substantially.
 In the yaml this would be
 {code}
 total_request_memory_space_mb: 400
 {code}
 It would also be nice to have either a log of queries which generate the most 
 garbage so operators can track this.  Also a histogram.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7938) Releases prior to 2.0 gratuitously invalidate buffer cache

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134334#comment-14134334
 ] 

Benedict commented on CASSANDRA-7938:
-

Also file streaming

 Releases prior to 2.0 gratuitously invalidate buffer cache
 --

 Key: CASSANDRA-7938
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7938
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Matt Stump
 Fix For: 1.2.19


 RandomAccessReader gratuitously invalidates the buffer cache in releases 
 prior to 2.0. 
 Additionally, Linux 3.X kernels spend 30% of CPU time in book keeping for the 
 invalidated pages as captured by CPU flame graphs.
 fadvise DONT_NEED should never be called for files other than the commit log 
 segments. 
 https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/io/util/RandomAccessReader.java#L168



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134612#comment-14134612
 ] 

Benedict commented on CASSANDRA-7928:
-

Regrettably this is very plausible, and adds credence to CASSANDRA-7130, which 
we should consider reopening. This ticket is also a sensible idea to help 
mitigate the issue. 

I knocked up a quick benchmark, results show lz4 being consistently at least 
twice as fast. It's actually quite easy to explain: if the data is compressed, 
there is actually less data to operate over; if it is not easily compressed 
(say, it is highly random), it degrades itself to a simple copy to avoid 
wasting work (as demonstrated in the benchmark - it's 5 times faster over 
completely random data than partially random data).

{noformat}
Benchmark  (duplicateLookback)  (pageSize)  (randomRatio)  
(randomRunLength)  (uniquePages)   Mode  SamplesScore  Score error   Units
Compression.adler32 4..128   65536  0   
   4..16   8192  thrpt5   16.4761.954  ops/ms
Compression.adler32 4..128   65536  0   
128..512   8192  thrpt5   16.7200.230  ops/ms
Compression.adler32 4..128   655360.1   
   4..16   8192  thrpt5   16.2692.118  ops/ms
Compression.adler32 4..128   655360.1   
128..512   8192  thrpt5   16.6650.246  ops/ms
Compression.adler32 4..128   655361.0   
   4..16   8192  thrpt5   16.6530.147  ops/ms
Compression.adler32 4..128   655361.0   
128..512   8192  thrpt5   16.6860.214  ops/ms
Compression.lz4 4..128   65536  0   
   4..16   8192  thrpt5   28.2750.265  ops/ms
Compression.lz4 4..128   65536  0   
128..512   8192  thrpt5  232.602   48.279  ops/ms
Compression.lz4 4..128   655360.1   
   4..16   8192  thrpt5   34.0810.337  ops/ms
Compression.lz4 4..128   655360.1   
128..512   8192  thrpt5  130.857   18.157  ops/ms
Compression.lz4 4..128   655361.0   
   4..16   8192  thrpt5  187.9929.190  ops/ms
Compression.lz4 4..128   655361.0   
128..512   8192  thrpt5  186.0542.267  ops/ms
{noformat}


 Digest queries do not require alder32 checks
 

 Key: CASSANDRA-7928
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Priority: Minor

  While reading data from sstables, C* does Alder32 checks for any data being 
 read. We have seen that this causes higher CPU usage while doing kernel 
 profiling. These checks might not be useful for digest queries as they will 
 have a different digest in case of corruption. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134625#comment-14134625
 ] 

Benedict commented on CASSANDRA-7928:
-

For reference, I have uploaded the benchmark 
[here|https://github.com/belliottsmith/bench]

 Digest queries do not require alder32 checks
 

 Key: CASSANDRA-7928
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Priority: Minor

  While reading data from sstables, C* does Alder32 checks for any data being 
 read. We have seen that this causes higher CPU usage while doing kernel 
 profiling. These checks might not be useful for digest queries as they will 
 have a different digest in case of corruption. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (CASSANDRA-7130) Make sstable checksum type configurable and optional

2014-09-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict reopened CASSANDRA-7130:
-
  Assignee: (was: Benedict)

As discussed on CASSANDRA-7928, this is probably worth revisiting, especially 
as it isn't too difficult. Probably worth delaying until next-gen storage is 
out the gate though.

 Make sstable checksum type configurable and optional
 

 Key: CASSANDRA-7130
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7130
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Priority: Minor
  Labels: performance
 Fix For: 3.0


 A lot of our users are becoming bottlenecked on CPU rather than IO, and 
 whilst Adler32 is faster than CRC, it isn't anything like as fast as xxhash 
 (used by LZ4), which can push Gb/s. I propose making the checksum type 
 configurable so that users who want speed can shift to xxhash, and those who 
 want security can use Adler or CRC.
 It's worth noting that at some point in the future (JDK8?) optimised 
 implementations using latest intel crc instructions will be added, though 
 it's not clear from the mailing list discussion if/when that will materialise:
 http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2013-May/010775.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134674#comment-14134674
 ] 

Benedict commented on CASSANDRA-7546:
-

Hi Graham,

I must admit I'm a bit confused, and it's partially self inflicted. In 2.1.1 we 
have changed stress again from what we released in 2.1.0, and I can't tell 
which version you're referring to, though it seems 2.1.1. Neither version has a 
'visits' property in the yaml, but 2.1.1 supports -insert visits= revisit=, 
which are certainly functions worth exploring and I recommend you use 2.1.1 for 
stress functionality either way. 

As far as using these functions are concerned, 'visits' splits a wide row up 
into multiple inserts; if a visits value of 10 is produced, and there are on 
average 100 rows generated for the partition, approximately 10 rows will be 
inserted, then the state of the partition will be stashed away and the next 
insert that operates on that partition will pick up where the previous one left 
off. Which partition is performed next is decided by the 'revisit' 
distribution, which selects from the stash of partially completed inserts, with 
a value of 1 selecting the most recently stashed (the max value of this 
distribution defines the total number of partitions to stash); if it ever 
selects outside of the current stash a new partition is generated instead. 

So the value for 'visits' is related to the number of unique clustering columns 
you generate for each partition, whereas the value for revisit is determined by 
how diverse the data you operate over in a given time window is.  

Separately, it's worth mentioning that offheap_objects is likely a better 
choice than offheap_buffers, since it is considerably more memory dense.

 AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
 -

 Key: CASSANDRA-7546
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: graham sanderson
Assignee: graham sanderson
 Fix For: 2.1.1

 Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 
 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, 
 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, 
 suggestion1.txt, suggestion1_21.txt, young_gen_gc.png


 In order to preserve atomicity, this code attempts to read, clone/update, 
 then CAS the state of the partition.
 Under heavy contention for updating a single partition this can cause some 
 fairly staggering memory growth (the more cores on your machine the worst it 
 gets).
 Whilst many usage patterns don't do highly concurrent updates to the same 
 partition, hinting today, does, and in this case wild (order(s) of magnitude 
 more than expected) memory allocation rates can be seen (especially when the 
 updates being hinted are small updates to different partitions which can 
 happen very fast on their own) - see CASSANDRA-7545
 It would be best to eliminate/reduce/limit the spinning memory allocation 
 whilst not slowing down the very common un-contended case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134680#comment-14134680
 ] 

Benedict commented on CASSANDRA-7928:
-

Agreed

 Digest queries do not require alder32 checks
 

 Key: CASSANDRA-7928
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Priority: Minor
 Fix For: 2.1.1


  While reading data from sstables, C* does Alder32 checks for any data being 
 read. We have seen that this causes higher CPU usage while doing kernel 
 profiling. These checks might not be useful for digest queries as they will 
 have a different digest in case of corruption. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7928) Digest queries do not require alder32 checks

2014-09-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7928:

Labels: performance  (was: )

 Digest queries do not require alder32 checks
 

 Key: CASSANDRA-7928
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Priority: Minor
  Labels: performance
 Fix For: 2.1.1


  While reading data from sstables, C* does Alder32 checks for any data being 
 read. We have seen that this causes higher CPU usage while doing kernel 
 profiling. These checks might not be useful for digest queries as they will 
 have a different digest in case of corruption. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7928) Digest queries do not require alder32 checks

2014-09-15 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7928:

Assignee: T Jake Luciani

 Digest queries do not require alder32 checks
 

 Key: CASSANDRA-7928
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: T Jake Luciani
Priority: Minor
  Labels: performance
 Fix For: 2.1.1


  While reading data from sstables, C* does Alder32 checks for any data being 
 read. We have seen that this causes higher CPU usage while doing kernel 
 profiling. These checks might not be useful for digest queries as they will 
 have a different digest in case of corruption. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7546) AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory

2014-09-15 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135004#comment-14135004
 ] 

Benedict commented on CASSANDRA-7546:
-

1: that's great news :)
3: if you want lots of unique clustering key values per partition, currently 
stress has some limitations and you will need/want to have multiple clustering 
columns for it to be able to generate that smoothly without taking donkeys 
years per insert (on the workload generation side). Its minimum unit of 
generation (not insert) is a single tier of clustering values, so it would 
generate all 100B values each time you wanted to insert any number with your 
spec.

So, you want to consider a yaml like this:

{noformat}
table_definition: |
  CREATE TABLE testtable (
p text,
c1 int, c2 int, c3 int
v blob,
PRIMARY KEY(p, c1, c2, c3)
  ) WITH COMPACT STORAGE 
AND compaction = { 'class':'LeveledCompactionStrategy' }
AND comment='TestTable'

columnspec:
  - name: p
size: fixed(16)
  - name: c1
cluster: fixed(100)
  - name: c2
cluster: fixed(100)
  - name: c3
cluster: fixed(100)
  - name: v
size: gaussian(50..250)
{noformat}

Then you want to consider passing -pop seq=1..1M -insert visits=fixed(1M) 
revisits=uniform(1..1024)

The visits parameter here tells stress to split each partition into 1M distinct 
inserts, which given its deterministic 1M keys means exactly 1 item inserted 
each visit. The revisits distribution defines the number of partition keys we 
will operate over until we exhaust one before selecting another to include in 
our working set. 

Notice I've removed the population spec from your partition key in the yaml. 
This is because it is not necessary to constrain it here, as you can constrain 
the _seed_ population with the -pop parameter, which is the better way to do it 
here (so you can use the same yaml across runs). However, in this case given 
our revisits() distribution we can also not constrain the seed population, 
since once our first 1024 have been generated no other PK will be visited until 
one of these has been fully exhausted (i.e. 1024 * 1M inserts, quite a few...). 

You may also constrain the seed to the same range, which once a key is 
exhausted would always result in filling back in that key to the working set. 
It doesn't matter what distribution you choose in this case, since it will keep 
generating a value until one not present in the stash crops up, which if they 
operate over the same domain can only result in 1 item regardless of 
distribution, so I suggest a sequential distribution to ensure determinism.

 AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
 -

 Key: CASSANDRA-7546
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: graham sanderson
Assignee: graham sanderson
 Fix For: 2.1.1

 Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 
 7546.20_4.txt, 7546.20_5.txt, 7546.20_6.txt, 7546.20_7.txt, 7546.20_7b.txt, 
 7546.20_alt.txt, 7546.20_async.txt, 7546.21_v1.txt, hint_spikes.png, 
 suggestion1.txt, suggestion1_21.txt, young_gen_gc.png


 In order to preserve atomicity, this code attempts to read, clone/update, 
 then CAS the state of the partition.
 Under heavy contention for updating a single partition this can cause some 
 fairly staggering memory growth (the more cores on your machine the worst it 
 gets).
 Whilst many usage patterns don't do highly concurrent updates to the same 
 partition, hinting today, does, and in this case wild (order(s) of magnitude 
 more than expected) memory allocation rates can be seen (especially when the 
 updates being hinted are small updates to different partitions which can 
 happen very fast on their own) - see CASSANDRA-7545
 It would be best to eliminate/reduce/limit the spinning memory allocation 
 whilst not slowing down the very common un-contended case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown

2014-09-16 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7784:

Reviewer: Benedict

 DROP table leaves the counter and row cache in a temporarily inconsistent 
 state that, if saved during, will cause an exception to be thrown
 ---

 Key: CASSANDRA-7784
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784
 Project: Cassandra
  Issue Type: Bug
Reporter: Benedict
Assignee: Aleksey Yeschenko
Priority: Minor
 Attachments: 7784.txt


 It looks like this is quite a realistic race to hit reasonably often, since 
 we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a 
 lengthy window to overlap with an auto-save



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown

2014-09-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135082#comment-14135082
 ] 

Benedict commented on CASSANDRA-7784:
-

Mostly LGTM, in fact it seems like an obvious improvement all round. But I'd 
either remove the \@Deprecated method entirely, or not deprecate it. Since it's 
only used in one very cheap to replace place, I'd simply drop it.

 DROP table leaves the counter and row cache in a temporarily inconsistent 
 state that, if saved during, will cause an exception to be thrown
 ---

 Key: CASSANDRA-7784
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784
 Project: Cassandra
  Issue Type: Bug
Reporter: Benedict
Assignee: Aleksey Yeschenko
Priority: Minor
 Attachments: 7784.txt


 It looks like this is quite a realistic race to hit reasonably often, since 
 we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a 
 lengthy window to overlap with an auto-save



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7784) DROP table leaves the counter and row cache in a temporarily inconsistent state that, if saved during, will cause an exception to be thrown

2014-09-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135100#comment-14135100
 ] 

Benedict commented on CASSANDRA-7784:
-

Ah, yes, I see the confusing part is that it hasn't been replaced in a place 
where it could have been, but the second usage is a place where it _couldn't_ 
be. It would be clearer if the first usage was replaced, so it were more 
obvious. Possibly also dropping the UUID from the parameter list. But no biggy

 DROP table leaves the counter and row cache in a temporarily inconsistent 
 state that, if saved during, will cause an exception to be thrown
 ---

 Key: CASSANDRA-7784
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7784
 Project: Cassandra
  Issue Type: Bug
Reporter: Benedict
Assignee: Aleksey Yeschenko
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7784.txt


 It looks like this is quite a realistic race to hit reasonably often, since 
 we forceBlockingFlush after removing from Schema.cfIdMap, so there could be a 
 lengthy window to overlap with an auto-save



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7928) Digest queries do not require alder32 checks

2014-09-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135348#comment-14135348
 ] 

Benedict commented on CASSANDRA-7928:
-

WFM :)

 Digest queries do not require alder32 checks
 

 Key: CASSANDRA-7928
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7928
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: sankalp kohli
Priority: Minor
  Labels: performance
 Fix For: 2.1.1


  While reading data from sstables, C* does Alder32 checks for any data being 
 read. We have seen that this causes higher CPU usage while doing kernel 
 profiling. These checks might not be useful for digest queries as they will 
 have a different digest in case of corruption. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-09-16 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7519:

Labels: docs tools  (was: tools)

 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: docs, tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7916) Stress should collect and cross-cluster GC statistics

2014-09-16 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7916:

Labels: docs  (was: )

 Stress should collect and cross-cluster GC statistics
 -

 Key: CASSANDRA-7916
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7916
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: docs
 Fix For: 2.1.1


 It would be useful to see stress outputs deliver cross-cluster statistics, 
 the most useful being GC data. Some simple changes to GCInspector collect the 
 data, and can deliver to a nodetool request or to stress over JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-6146) CQL-native stress

2014-09-16 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-6146:

Labels: docs qa-resolved  (was: qa-resolved)

 CQL-native stress
 -

 Key: CASSANDRA-6146
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6146
 Project: Cassandra
  Issue Type: New Feature
  Components: Tools
Reporter: Jonathan Ellis
Assignee: T Jake Luciani
  Labels: docs, qa-resolved
 Fix For: 2.1 rc3

 Attachments: 6146-v2.txt, 6146.txt, 6164-v3.txt


 The existing CQL support in stress is not worth discussing.  We need to 
 start over, and we might as well kill two birds with one stone and move to 
 the native protocol while we're at it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-5657) remove deprecated metrics

2014-09-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135464#comment-14135464
 ] 

Benedict commented on CASSANDRA-5657:
-

This does _not_ appear to be the same as the just closed CASSANDRA-7943, which 
suggests dropping 'legacy metrics' - which seems to refer to the custom 
histogram. This ticket refers to removing deprecated mbeans and object wrappers 
only. I am -1 dropping the legacy histograms, since they are considerably more 
accurate than the yammer histograms. Before we consider dropping them we need 
to consider holistically our approach to metrics, which really need 
overhauling, since yammer metrics are lacking.

 remove deprecated metrics
 -

 Key: CASSANDRA-5657
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5657
 Project: Cassandra
  Issue Type: Task
  Components: Tools
Reporter: Jonathan Ellis
Assignee: T Jake Luciani
  Labels: technical_debt
 Fix For: 3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7818) Improve compaction logging

2014-09-16 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135538#comment-14135538
 ] 

Benedict commented on CASSANDRA-7818:
-

Is it worth shortening the SSTableReader() prefix in the message to clean the 
messages up a little further? No need for it really. Something like
{code}
StringBuilder ssTableLoggerMsg = new StringBuilder([);
for (SSTableReader sstr : sstables)
{
ssTableLoggerMsg.append(String.format(%s:level %d), , sstr.getFilename(), 
sstr.getSSTableLevel()));
}
 ssTableLoggerMsg.append(]);
{code}
?

 Improve compaction logging
 --

 Key: CASSANDRA-7818
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7818
 Project: Cassandra
  Issue Type: Improvement
Reporter: Marcus Eriksson
Assignee: Mihai Suteu
Priority: Minor
  Labels: compaction, lhf
 Fix For: 3.0

 Attachments: cassandra-7818.patch


 We should log more information about compactions to be able to debug issues 
 more efficiently
 * give each CompactionTask an id that we log (so that you can relate the 
 start-compaction-messages to the finished-compaction ones)
 * log what level the sstables are taken from



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    3   4   5   6   7   8   9   10   11   12   >