[jira] [Comment Edited] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 16kb

2018-10-24 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661979#comment-16661979
 ] 

Joseph Lynch edited comment on CASSANDRA-13241 at 10/24/18 9:21 AM:


[~brstgt]
{quote}4-8kb will not "destroy" the OS page cache. Linux Pages are 4kb by 
default, so 4kb chunks perfectly fit into cache pages. Actually read-ahead will 
kill your performance if you have a lot of disk-reads going on. This can kill 
your page cache if your dataset is a lot larger than available memory and you 
are doing many random reads with small resultsets.
{quote}
I don't mean it would be bad in terms of the OS not being able to fetch the 
pages, I mean that losing that much ratio increases the dataset size so that 
less of the hot set fits in the cache to begin with. For example, on most JSON 
datasets you're getting something like 0.2-0.35 ratio. If you regress to 
0.5-0.6 then you've doubled your dataset, and it's plausible that instead of 
fitting like 99%+ of your hot dataset in RAM, you're only fitting like 60%. I 
stand by that this much reduction in ratio would "destroy" the OS page cache 
hit rate, which very much impacts tail latencies. I've measured this on 
numerous production clusters using 
[cachestat|https://github.com/iovisor/bcc/blob/master/tools/cachestat_example.txt]
 and 
[biosnoop|https://github.com/iovisor/bcc/blob/master/tools/biosnoop_example.txt]
 and it is a very real problem.
{quote}We use 4kb chunks and we observed a TREMENDOUS difference in IO reads 
when disabling read ahead completely. With default read ahead kernel settings, 
the physical read IO is roughly 20-30x in our use case, specifically it was 
like ~20MB/s vs 600MB/s.
{quote}
Yes, disabling IO read ahead and using 4kb chunks works really well if you have 
fast local drives (e.g. nvme ssds) and are doing small point reads but if 
you're doing any kind of key scans (like 10kb result sets) or running on 
limited IOPs high latency drives (like AWS gp2 or GCE persistent ssds) then 
this can be really really bad (especially for compaction). I don't think that 
the default setting should assume very low latency and high IOP local storage. 
Assuming something in between seems more reasonable to me.
{quote}Sum-up: Not 4KB chunk size alone is the problem but all components have 
to be tuned and aligned to remove bottlenecks and make the whole system 
performant. The specific params always depend on the particular case.
{quote}
I completely agree that you have to tune everything, and to be clear I do 100% 
agree reducing from 64kb default is a great idea (imo 32kb seems eminently 
reasonable, 16 is as low as I'd personally default LZ4 to). I also think 
figuring out how to properly {{madvise}} so that users don't have to tune 
readahead would be a huge win ... I just don't think that memory usage is the 
main problem here (the memory usage is usually worth it); I think ratio loss 
and the corresponding loss in cache hit rate, creating higher IOPs is going to 
be the big potential issue if we go much smaller by default.


was (Author: jolynch):
[~brstgt]
{quote}4-8kb will not "destroy" the OS page cache. Linux Pages are 4kb by 
default, so 4kb chunks perfectly fit into cache pages. Actually read-ahead will 
kill your performance if you have a lot of disk-reads going on. This can kill 
your page cache if your dataset is a lot larger than available memory and you 
are doing many random reads with small resultsets.
{quote}
I don't mean it would be bad in terms of the OS not being able to fetch the 
pages, I mean that losing that much ratio increases the dataset size so that 
less of the hot set fits in the cache to begin with. For example, on most JSON 
datasets you're getting something like 0.2-0.35 ratio. If you regress to 
0.5-0.6 then you've doubled your dataset, and it's plausible that instead of 
fitting like 99%+ of your hot dataset in RAM, you're only fitting like 60%. I 
stand by that this much reduction in ratio would "destroy" the OS page cache 
hit rate, which very much impacts tail latencies. I've measured this on 
numerous production clusters using 
[cachestat|https://github.com/iovisor/bcc/blob/master/tools/cachestat_example.txt]
 and 
[biosnoop|https://github.com/iovisor/bcc/blob/master/tools/biosnoop_example.txt]
 and it is a very real problem.
{quote}We use 4kb chunks and we observed a TREMENDOUS difference in IO reads 
when disabling read ahead completely. With default read ahead kernel settings, 
the physical read IO is roughly 20-30x in our use case, specifically it was 
like ~20MB/s vs 600MB/s.
{quote}
Yes, disabling IO read ahead and using 4kb chunks works really well if you have 
fast local drives (e.g. nvme ssds) and are doing small point reads but if 
you're doing any kind of key scans (like 10kb result sets) or running on 
limited IOPs high latency drives (like AWS gp2 or GCE persistent ssds) 

[jira] [Comment Edited] (CASSANDRA-13241) Lower default chunk_length_in_kb from 64kb to 16kb

2018-10-23 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-13241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661492#comment-16661492
 ] 

Joseph Lynch edited comment on CASSANDRA-13241 at 10/24/18 12:05 AM:
-

I don't mean to add another perspective to this, but I am not sure we're 
considering the compression ratio loss on real world data enough here. We've 
been talking a lot about the memory requirements but I think the bigger issues 
are:
 * Ratio loss leading to less of the dataset being hot in OS page cache
 * OS Read-ahead is usually 16 or 32kb, so if you're reading less than that 
from disk you're still going to read 16 or 32kb...

I think for Cassandra which relies on the OS page cache heavily for 
performance, 16kb is the absolute minimum I would default to. For example from 
IRC today I ran Ariel's ratio 
[script|https://gist.github.com/jolynch/411e62ac592bfb55cfdd5db87c77ef6f] on a 
(somewhat arbitrary) 3.0.17 production cluster dataset and saw the following 
ratios :
{noformat}
Chunk size 4096, ratio 0.541505
Chunk size 8192, ratio 0.467537
Chunk size 16384, ratio 0.425122
Chunk size 32768, ratio 0.387040
Chunk size 65536, ratio 0.352454
{noformat}
The reduction in ratio at 4-8kb would destroy the OS page cache imo. 16KB isn't 
too bad, and 32kb is downright fine.

In my experience, 32kb is often an easy win, and 16kb is often a good idea for 
less compressible datasets. Last I checked Scylla uses direct io and bypasses 
the OS cache so I don't think we should use their default unless we implement 
direct io as well (and the buffer cache on top of it)...

If the hot dataset is much less than RAM, then yea 4kb all the way ...


was (Author: jolynch):
I don't mean to add another perspective to this, but I am not sure we're 
considering the compression ratio loss on real world data enough here. We've 
been talking a lot about the memory requirements but I think the bigger issues 
are:
 * Ratio loss leading to less of the dataset being hot in OS page cache
 * OS Read-ahead is usually 16 or 32kb, so if you're reading less than that 
from disk you're still going to read 16 or 32kb...

I think for Cassandra which relies on the OS page cache heavily for 
performance, 16kb is the absolute minimum I would default to. For example from 
IRC today I ran Ariel's ratio 
[script|https://gist.github.com/jolynch/411e62ac592bfb55cfdd5db87c77ef6f] on a 
(somewhat arbitrary) 3.0.17 production cluster dataset and saw the following 
ratios :
{noformat}
Chunk size 4096, ratio 0.541505
Chunk size 8192, ratio 0.467537
Chunk size 16384, ratio 0.425122
Chunk size 32768, ratio 0.387040
Chunk size 65536, ratio 0.352454
{noformat}
The reduction in ratio at 4-8kb would destroy the OS page cache imo. 16KB isn't 
too bad, and 32kb is downright fine.

In my experience, 32kb is often an easy win, and 16kb is often a good idea for 
less compressible datasets. Last I checked Scylla uses direct io and bypasses 
the OS cache so I don't think we should use their default unless we implement 
direct io as well (and the buffer cache on top of it)...

If the dataset is less than RAM, then yea 4kb all the way ...

> Lower default chunk_length_in_kb from 64kb to 16kb
> --
>
> Key: CASSANDRA-13241
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13241
> Project: Cassandra
>  Issue Type: Wish
>  Components: Core
>Reporter: Benjamin Roth
>Assignee: Ariel Weisberg
>Priority: Major
> Attachments: CompactIntegerSequence.java, 
> CompactIntegerSequenceBench.java, CompactSummingIntegerSequence.java
>
>
> Having a too low chunk size may result in some wasted disk space. A too high 
> chunk size may lead to massive overreads and may have a critical impact on 
> overall system performance.
> In my case, the default chunk size lead to peak read IOs of up to 1GB/s and 
> avg reads of 200MB/s. After lowering chunksize (of course aligned with read 
> ahead), the avg read IO went below 20 MB/s, rather 10-15MB/s.
> The risk of (physical) overreads is increasing with lower (page cache size) / 
> (total data size) ratio.
> High chunk sizes are mostly appropriate for bigger payloads pre request but 
> if the model consists rather of small rows or small resultsets, the read 
> overhead with 64kb chunk size is insanely high. This applies for example for 
> (small) skinny rows.
> Please also see here:
> https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
> To give you some insights what a difference it can make (460GB data, 128GB 
> RAM):
> - Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
> - Disk throughput: https://cl.ly/2a0Z250S1M3c
> - This shows, that the request distribution remained the same, so no "dynamic 
> snitch magic": https://cl.ly/3E0t1T1z2c0J



--
This message was sent by