BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Graham Sanderson
If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra 2.1, you 
may have had

commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 25

in you cassiandra.yaml

It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just 
happened immediately), but fixed in 2.1, which meant that every mutation 
blocked its writer thread for 25ms meaning at 80 mutations/sec/writer thread 
you’d start DROPPING mutations if your write timeout is 2000ms.

This turns out to be a massive problem if you write fast, and the default 
commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6 as a way of 
addressing this (with some suggesting 1ms)

Neither of these changes got much fanfare except an eventual reference in 
CHANGES.TXT

With 2.1.9 if you aren’t doing periodic sync, then I think the new behavior is 
just to sync whenever the commit logs have a consistent/complete set of 
mutations ready.

Note this is hard to diagnose because CPU is idle and pretty much all latency 
metrics (except the overall coordinator write) do not count this time (and you 
probably weren’t noticing the 25ms write ACK time). It turned out for us that 
one of our nodes was getting more writes (> 20k mutations per second) which was 
about the magic number… anything shy of that and everything looked fine, but 
just by going slightly over, this node was dropping lots of mutations.






smime.p7s
Description: S/MIME cryptographic signature


Re: BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Michael Shuler

On 10/19/2015 10:55 AM, Graham Sanderson wrote:

If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra
2.1, you may have had

commitlog_sync: batch

commitlog_sync_batch_window_in_ms: 25


in you cassiandra.yaml

It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just
happened immediately), but fixed in 2.1, *which meant that every
mutation blocked its writer thread for 25ms meaning at 80
mutations/sec/writer thread you’d start DROPPING mutations if your write
timeout is 2000ms.*

This turns out to be a massive problem if you write fast, and the
default commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6
as a way of addressing this (with some suggesting 1ms)

Neither of these changes got much fanfare except an eventual reference
in CHANGES.TXT

With 2.1.9 if you aren’t doing periodic sync, then I think the new
behavior is just to sync whenever the commit logs have a
consistent/complete set of mutations ready.

Note this is hard to diagnose because CPU is idle and pretty much all
latency metrics (except the overall coordinator write) do not count this
time (and you probably weren’t noticing the 25ms write ACK time). It
turned out for us that one of our nodes was getting more writes (> 20k
mutations per second) which was about the magic number… anything shy of
that and everything looked fine, but just by going slightly over, this
node was dropping lots of mutations.


If you would be kind enough to submit a patch to JIRA for NEWS.txt 
(aligned with the right versions you're warning about) that includes the 
info upgrading users might need, that would be great!


--
Kind regards,
Michael


Re: BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Graham Sanderson
But basically if you were on 2.1.0 thru 2.1.5 you probably couldn’t know to 
change your config
If you were on 2.1.6 thru 2.1.8 you may not have noticed the NEWS.TXT change 
and changed your config
If you are on 2.1.9+ you are probably OK

if you are using periodic fsync then you don’t have an issue

> On Oct 19, 2015, at 11:37 AM, Graham Sanderson  wrote:
> 
> - commitlog_sync_batch_window_in_ms behavior has changed from the
>   maximum time to wait between fsync to the minimum time.  We are 
>   working on making this more user-friendly (see CASSANDRA-9533) but in the
>   meantime, this means 2.1 needs a much smaller batch window to keep
>   writer threads from starving.  The suggested default is now 2ms.
> was added retroactively to NEWS.txt in 2.1.6 which is why it is not obvious
> 
>> On Oct 19, 2015, at 11:03 AM, Michael Shuler > > wrote:
>> 
>> On 10/19/2015 10:55 AM, Graham Sanderson wrote:
>>> If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra
>>> 2.1, you may have had
>>> 
>>> commitlog_sync: batch
>>> 
>>> commitlog_sync_batch_window_in_ms: 25
>>> 
>>> 
>>> in you cassiandra.yaml
>>> 
>>> It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just
>>> happened immediately), but fixed in 2.1, *which meant that every
>>> mutation blocked its writer thread for 25ms meaning at 80
>>> mutations/sec/writer thread you’d start DROPPING mutations if your write
>>> timeout is 2000ms.*
>>> 
>>> This turns out to be a massive problem if you write fast, and the
>>> default commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6
>>> as a way of addressing this (with some suggesting 1ms)
>>> 
>>> Neither of these changes got much fanfare except an eventual reference
>>> in CHANGES.TXT
>>> 
>>> With 2.1.9 if you aren’t doing periodic sync, then I think the new
>>> behavior is just to sync whenever the commit logs have a
>>> consistent/complete set of mutations ready.
>>> 
>>> Note this is hard to diagnose because CPU is idle and pretty much all
>>> latency metrics (except the overall coordinator write) do not count this
>>> time (and you probably weren’t noticing the 25ms write ACK time). It
>>> turned out for us that one of our nodes was getting more writes (> 20k
>>> mutations per second) which was about the magic number… anything shy of
>>> that and everything looked fine, but just by going slightly over, this
>>> node was dropping lots of mutations.
>> 
>> If you would be kind enough to submit a patch to JIRA for NEWS.txt (aligned 
>> with the right versions you're warning about) that includes the info 
>> upgrading users might need, that would be great!
>> 
>> -- 
>> Kind regards,
>> Michael
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Graham Sanderson
- commitlog_sync_batch_window_in_ms behavior has changed from the
  maximum time to wait between fsync to the minimum time.  We are 
  working on making this more user-friendly (see CASSANDRA-9533) but in the
  meantime, this means 2.1 needs a much smaller batch window to keep
  writer threads from starving.  The suggested default is now 2ms.
was added retroactively to NEWS.txt in 2.1.6 which is why it is not obvious

> On Oct 19, 2015, at 11:03 AM, Michael Shuler  wrote:
> 
> On 10/19/2015 10:55 AM, Graham Sanderson wrote:
>> If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra
>> 2.1, you may have had
>> 
>> commitlog_sync: batch
>> 
>> commitlog_sync_batch_window_in_ms: 25
>> 
>> 
>> in you cassiandra.yaml
>> 
>> It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just
>> happened immediately), but fixed in 2.1, *which meant that every
>> mutation blocked its writer thread for 25ms meaning at 80
>> mutations/sec/writer thread you’d start DROPPING mutations if your write
>> timeout is 2000ms.*
>> 
>> This turns out to be a massive problem if you write fast, and the
>> default commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6
>> as a way of addressing this (with some suggesting 1ms)
>> 
>> Neither of these changes got much fanfare except an eventual reference
>> in CHANGES.TXT
>> 
>> With 2.1.9 if you aren’t doing periodic sync, then I think the new
>> behavior is just to sync whenever the commit logs have a
>> consistent/complete set of mutations ready.
>> 
>> Note this is hard to diagnose because CPU is idle and pretty much all
>> latency metrics (except the overall coordinator write) do not count this
>> time (and you probably weren’t noticing the 25ms write ACK time). It
>> turned out for us that one of our nodes was getting more writes (> 20k
>> mutations per second) which was about the magic number… anything shy of
>> that and everything looked fine, but just by going slightly over, this
>> node was dropping lots of mutations.
> 
> If you would be kind enough to submit a patch to JIRA for NEWS.txt (aligned 
> with the right versions you're warning about) that includes the info 
> upgrading users might need, that would be great!
> 
> -- 
> Kind regards,
> Michael



smime.p7s
Description: S/MIME cryptographic signature