Re: CASSANDRA-13241 lower default chunk_length_in_kb

2018-10-11 Thread Jeff Jirsa



I think 16k is a better default, but it should only affect new tables. Whoever 
changes it, please make sure you think about the upgrade path. 


> On Oct 12, 2018, at 2:31 AM, Ben Bromhead  wrote:
> 
> This is something that's bugged me for ages, tbh the performance gain for
> most use cases far outweighs the increase in memory usage and I would even
> be in favor of changing the default now, optimizing the storage cost later
> (if it's found to be worth it).
> 
> For some anecdotal evidence:
> 4kb is usually what we end setting it to, 16kb feels more reasonable given
> the memory impact, but what would be the point if practically, most folks
> set it to 4kb anyway?
> 
> Note that chunk_length will largely be dependent on your read sizes, but 4k
> is the floor for most physical devices in terms of ones block size.
> 
> +1 for making this change in 4.0 given the small size and the large
> improvement to new users experience (as long as we are explicit in the
> documentation about memory consumption).
> 
> 
>> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg  wrote:
>> 
>> Hi,
>> 
>> This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241
>> 
>> This ticket has languished for a while. IMO it's too late in 4.0 to
>> implement a more memory efficient representation for compressed chunk
>> offsets. However I don't think we should put out another release with the
>> current 64k default as it's pretty unreasonable.
>> 
>> I propose that we lower the value to 16kb. 4k might never be the correct
>> default anyways as there is a cost to compression and 16k will still be a
>> large improvement.
>> 
>> Benedict and Jon Haddad are both +1 on making this change for 4.0. In the
>> past there has been some consensus about reducing this value although maybe
>> with more memory efficiency.
>> 
>> The napkin math for what this costs is:
>> "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks
>> at 8 bytes each (128MB).
>> With 16k chunks, that's 512MB.
>> With 4k chunks, it's 2G.
>> Per terabyte of data (pre-compression)."
>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
>> 
>> By way of comparison memory mapping the files has a similar cost per 4k
>> page of 8 bytes. Multiple mappings makes this more expensive. With a
>> default of 16kb this would be 4x less expensive than memory mapping a file.
>> I only mention this to give a sense of the costs we are already paying. I
>> am not saying they are directly related.
>> 
>> I'll wait a week for discussion and if there is consensus make the change.
>> 
>> Regards,
>> Ariel
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>> 
>> --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692
> Reliability at Scale
> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: CASSANDRA-13241 lower default chunk_length_in_kb

2018-10-11 Thread Pavel Yaskevich
On Thu, Oct 11, 2018 at 4:31 PM Ben Bromhead  wrote:

> This is something that's bugged me for ages, tbh the performance gain for
> most use cases far outweighs the increase in memory usage and I would even
> be in favor of changing the default now, optimizing the storage cost later
> (if it's found to be worth it).
>
> For some anecdotal evidence:
> 4kb is usually what we end setting it to, 16kb feels more reasonable given
> the memory impact, but what would be the point if practically, most folks
> set it to 4kb anyway?
>
> Note that chunk_length will largely be dependent on your read sizes, but 4k
> is the floor for most physical devices in terms of ones block size.
>

It might be worth while to investigate how splitting chunk size into data,
index and compaction sizes would affect performance.


>
> +1 for making this change in 4.0 given the small size and the large
> improvement to new users experience (as long as we are explicit in the
> documentation about memory consumption).
>
>
> On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg  wrote:
>
> > Hi,
> >
> > This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241
> >
> > This ticket has languished for a while. IMO it's too late in 4.0 to
> > implement a more memory efficient representation for compressed chunk
> > offsets. However I don't think we should put out another release with the
> > current 64k default as it's pretty unreasonable.
> >
> > I propose that we lower the value to 16kb. 4k might never be the correct
> > default anyways as there is a cost to compression and 16k will still be a
> > large improvement.
> >
> > Benedict and Jon Haddad are both +1 on making this change for 4.0. In the
> > past there has been some consensus about reducing this value although
> maybe
> > with more memory efficiency.
> >
> > The napkin math for what this costs is:
> > "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks
> > at 8 bytes each (128MB).
> > With 16k chunks, that's 512MB.
> > With 4k chunks, it's 2G.
> > Per terabyte of data (pre-compression)."
> >
> >
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
> >
> > By way of comparison memory mapping the files has a similar cost per 4k
> > page of 8 bytes. Multiple mappings makes this more expensive. With a
> > default of 16kb this would be 4x less expensive than memory mapping a
> file.
> > I only mention this to give a sense of the costs we are already paying. I
> > am not saying they are directly related.
> >
> > I'll wait a week for discussion and if there is consensus make the
> change.
> >
> > Regards,
> > Ariel
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> > --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692
> Reliability at Scale
> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>


cluster launching tool for dev work

2018-10-11 Thread Jonathan Haddad
Recently I reached an inflection point where my annoyance of launching
clusters finally overcame my laziness.  I wanted something similar to CCM,
so I wrote it.

The tool was designed for our usage at TLP, which usually means quickly
firing up clusters for running tests.  It started out as some scripts,
added a little docker, and eventually reached a point where it's generic
enough to OSS.

I don't expect (or want, for that matter) this to be a tool for normal
Cassandra users to launch clusters.  It's designed for taking a C*
codebase, turn it into a deb package and push it onto a handful of
servers.  It configures some stuff for you automatically, like seeds.  It
has bugs, you'll probably need to read the code to figure it out while I
work on improving the docs.  It definitely needs more features & polish as
well, but I figured hey, might as well share what I have maybe someone will
find it useful.

Code: https://github.com/thelastpickle/tlp-cluster
Docs: http://thelastpickle.com/tlp-cluster/

It's Apache licensed, so feel free to do what you want with it.  I'll try
to update the docs to make it a little friendlier to use.
-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Re: CASSANDRA-13241 lower default chunk_length_in_kb

2018-10-11 Thread Ben Bromhead
This is something that's bugged me for ages, tbh the performance gain for
most use cases far outweighs the increase in memory usage and I would even
be in favor of changing the default now, optimizing the storage cost later
(if it's found to be worth it).

For some anecdotal evidence:
4kb is usually what we end setting it to, 16kb feels more reasonable given
the memory impact, but what would be the point if practically, most folks
set it to 4kb anyway?

Note that chunk_length will largely be dependent on your read sizes, but 4k
is the floor for most physical devices in terms of ones block size.

+1 for making this change in 4.0 given the small size and the large
improvement to new users experience (as long as we are explicit in the
documentation about memory consumption).


On Thu, Oct 11, 2018 at 7:11 PM Ariel Weisberg  wrote:

> Hi,
>
> This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241
>
> This ticket has languished for a while. IMO it's too late in 4.0 to
> implement a more memory efficient representation for compressed chunk
> offsets. However I don't think we should put out another release with the
> current 64k default as it's pretty unreasonable.
>
> I propose that we lower the value to 16kb. 4k might never be the correct
> default anyways as there is a cost to compression and 16k will still be a
> large improvement.
>
> Benedict and Jon Haddad are both +1 on making this change for 4.0. In the
> past there has been some consensus about reducing this value although maybe
> with more memory efficiency.
>
> The napkin math for what this costs is:
> "If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks
> at 8 bytes each (128MB).
> With 16k chunks, that's 512MB.
> With 4k chunks, it's 2G.
> Per terabyte of data (pre-compression)."
>
> https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621
>
> By way of comparison memory mapping the files has a similar cost per 4k
> page of 8 bytes. Multiple mappings makes this more expensive. With a
> default of 16kb this would be 4x less expensive than memory mapping a file.
> I only mention this to give a sense of the costs we are already paying. I
> am not saying they are directly related.
>
> I'll wait a week for discussion and if there is consensus make the change.
>
> Regards,
> Ariel
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
> --
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer


CASSANDRA-13241 lower default chunk_length_in_kb

2018-10-11 Thread Ariel Weisberg
Hi,

This is regarding https://issues.apache.org/jira/browse/CASSANDRA-13241

This ticket has languished for a while. IMO it's too late in 4.0 to implement a 
more memory efficient representation for compressed chunk offsets. However I 
don't think we should put out another release with the current 64k default as 
it's pretty unreasonable.

I propose that we lower the value to 16kb. 4k might never be the correct 
default anyways as there is a cost to compression and 16k will still be a large 
improvement.

Benedict and Jon Haddad are both +1 on making this change for 4.0. In the past 
there has been some consensus about reducing this value although maybe with 
more memory efficiency.

The napkin math for what this costs is:
"If you have 1TB of uncompressed data, with 64k chunks that's 16M chunks at 8 
bytes each (128MB).
With 16k chunks, that's 512MB.
With 4k chunks, it's 2G.
Per terabyte of data (pre-compression)."
https://issues.apache.org/jira/browse/CASSANDRA-13241?focusedCommentId=15886621=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15886621

By way of comparison memory mapping the files has a similar cost per 4k page of 
8 bytes. Multiple mappings makes this more expensive. With a default of 16kb 
this would be 4x less expensive than memory mapping a file. I only mention this 
to give a sense of the costs we are already paying. I am not saying they are 
directly related.

I'll wait a week for discussion and if there is consensus make the change.

Regards,
Ariel

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org