[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-02-18 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039297#comment-17039297
 ] 

Jordan West commented on CASSANDRA-15213:
-

Updated the default and the commit message. 
[branch|https://github.com/jrwest/cassandra/commits/jwest/15213] 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 15213-perf-branch, 15213-perf-trunk
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-02-18 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039092#comment-17039092
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


[~jrwest] we discussed (somewhere) maybe reducing the default value to 2, at 
least initially?  I don't mind using 4, but I think doubling the memory we use 
here is probably a good first stab at balancing improvement for larger users 
(who are ordinarily better able to reconfigure) against the cost for users with 
smaller nodes?

Also, patch message formatting norms sign off:

patch by A, B; reviewed by C, D for CASSANDRA-X

This is particularly helpful if using awk for analysis, since this is the only 
standard we've ever followed almost consistently, so it's helpful to retain it.

Otherwise LGTM

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 15213-perf-branch, 15213-perf-trunk
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-02-06 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031972#comment-17031972
 ] 

Jordan West commented on CASSANDRA-15213:
-

Was asked to run tlp-stress against this branch. After several runs I think at 
the cluster level its pretty much a wash between this branch and trunk (except 
the memory savings of this branch ofc). Added the output from one of the runs 
if folks want to look closer. Its worth noting: this was pretty small / limited 
hardware.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 15213-perf-branch, 15213-perf-trunk
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-30 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026770#comment-17026770
 ] 

Jordan West commented on CASSANDRA-15213:
-

Incorporated your change, rebased/squashed, and pushed. Thanks [~benedict].

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-30 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026615#comment-17026615
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


Awesome.  I think we're ready.  I've pushed one last set of minor suggestions, 
that _should_ permit C2 to produce branchless code for the calculation, at the 
expense only of values 0 through 8 being slower to compute (since these are 
likely extremely uncommon, this is probably preferable).  I think it would 
still be possible to reduce the total work by a few instructions with some time 
to think, but probably not worth it.

Entirely up to you if you prefer to use this suggestion, as it's not 
particularly important, and it's very hard to objectively determine the effect 
(since it will depend on branch predictor pollution).  Once you've decided, 
I'll get this committed.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-29 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026466#comment-17026466
 ] 

Jordan West commented on CASSANDRA-15213:
-

Pushed an update to the branch that includes prime distribution

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-15 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016077#comment-17016077
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


Except if we use the cleaner {{stripedIndex}} calculation, we might need to go 
up to e.g. 8x stripes, in which case we'd need to throw {{23}} into the mix.  
That seems to take us up to 16x, with {{29}} taking us all the way to 64, which 
is way over provisioning stripes.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-15 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016076#comment-17016076
 ] 

Jordan West commented on CASSANDRA-15213:
-

Sounds good. Thanks for the test / proof. 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-15 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016070#comment-17016070
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


Also fwiw, it looks like the primes 17 and 19 are sufficient, so we can 
literally just try either of those. Proof:

{code}
int[] primes = new int[] { 17, 19 };
BitSet sizeWithoutConflict = new BitSet();
for (int prime : primes)
{
for (int size = 1 ; size < 238 ; ++size)
{
BitSet conflict = new BitSet();
boolean hasConflict = false;
for (int i = 0 ; i < size ; ++i)
{
if (conflict.get((i * prime) % size))
hasConflict = true;
conflict.set((i * prime) % size);
}
if (!hasConflict)
sizeWithoutConflict.set(size);
}
}
for (int size = 1 ; size < 238 ; ++size)
{
if (!sizeWithoutConflict.get(size))
System.out.println(size);
}
{code}

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-15 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016055#comment-17016055
 ] 

Jordan West commented on CASSANDRA-15213:
-

Thanks. I'll start exploring that approach. I implemented a version using 
{{Integer.reverse}} (which distributed well) but didn't find an approach using 
it that didn't involve extra reads (was looking for something more along the 
lines of a simple calculation like this). Will report back with my testing 
results / findings. 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-14 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015512#comment-17015512
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


fwiw, wrt bucket distribution, we should be able to do something as simple as 
multiplying by the smallest prime larger than two cache-lines (when multiplied 
by 8), that doesn't also divide the number of buckets. i.e., 17 (with 
verification it doesn't divide a custom number of buckets, and choosing another 
prime if it does)

i.e. 
{code}
stripedIndex(index, stripe) = ((index * prime) % (bucketOffsets.length + 1)) + 
((bucketOffsets.length + 1) * stripe)
or
stripedIndex(index, stripe) = (((index * nStripes + stripe) * prime) % 
buckets.length())
{code}


> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-13 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014718#comment-17014718
 ] 

Jordan West commented on CASSANDRA-15213:
-

(y) I'll add the configurable stripe count and look more into bucket 
distribution as well as better benchmarking

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-13 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014715#comment-17014715
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


> but its encouraging we are seeing improved performance and memory usage with 
> the test we have and this branch

(y)

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-13 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014713#comment-17014713
 ] 

Jordan West commented on CASSANDRA-15213:
-

PEBKAC. I was rushing to run the bm between meetings and used the wrong branch. 
I am able to reproduce your numbers: 

{code}
 [java] Benchmark   Mode  Cnt 
Score Error  Units
 [java] LatencyTrackingBench.benchInsertToDEHR thrpt5  
39263498.481 ± 1809260.330  ops/s
 [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt5  
10473724.356 ±  137670.100  ops/s

{code}

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-13 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014704#comment-17014704
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


Just to reiterate, though, it's still not a great test, and it might be 
worthwhile improving it first.  Ideally we would have tighter control on the 
amount of competition we're measuring, since that's what we're interested in.  
We probably also do not want to visit the same values in the same order for 
every execution, as this can lead to increased synchrony in execution lowering 
overall throughput artificially.  We also probably _do_ want to use the same 
random sequence on each run, to improve reproducibility.


> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-13 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014686#comment-17014686
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


Interesting.  I get:

{code}
addAndGet
# Run progress: 0.00% complete, ETA 00:00:13
# Fork: 1 of 1
# Warmup Iteration   1: 39047588.027 ops/s
# Warmup Iteration   2: 42432555.911 ops/s
# Warmup Iteration   3: 42290231.202 ops/s
Iteration   1: 42252484.876 ops/s
Iteration   2: 41715818.895 ops/s
Iteration   3: 40307394.422 ops/s
Iteration   4: 40383945.073 ops/s
Iteration   5: 38825162.703 ops/s


Result 
"org.apache.cassandra.test.microbench.LatencyTrackingBench.benchInsertToDEHR":
  40696961.194 ±(99.9%) 5170160.207 ops/s [Average]
{code}

{code}
compareAndSet
# Run progress: 0.00% complete, ETA 00:00:13
# Fork: 1 of 1
# Warmup Iteration   1: 18926088.938 ops/s
# Warmup Iteration   2: 19504293.182 ops/s
# Warmup Iteration   3: 19261074.598 ops/s
Iteration   1: 19399456.839 ops/s
Iteration   2: 18979470.788 ops/s
Iteration   3: 18720138.076 ops/s
Iteration   4: 18458839.475 ops/s
Iteration   5: 18776539.883 ops/s


Result 
"org.apache.cassandra.test.microbench.LatencyTrackingBench.benchInsertToDEHR":
  18866889.012 ±(99.9%) 1351167.820 ops/s [Average]
{code}

i.e. twice the throughput with {{addAndGet}}. There is no material difference 
to my version:

{code}
public void updateBucket(AtomicLongArray buckets, int index, long value)
{
index = stripedIndex(index, (int) Thread.currentThread().getId() & 
(nStripes - 1));
buckets.addAndGet(index, value);
}
{code}

It's possible it's an issue of JDK?  Perhaps yours is not replacing 
{{addAndGet}} with the relevant {{lock xadd}} assembly?  This seems pretty 
unlikely, I think this has happened for a long time, but you could try looking 
at the assembly that's produced.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-13 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014661#comment-17014661
 ] 

Jordan West commented on CASSANDRA-15213:
-

Using {{addAndGet}} I see performance comparable to 3.0 and prior to 
CASSANDRA-14281.

{codee}
15213 (4 core, 4 stripes, striping & linear search, addAndGet):

 [java] Benchmark   Mode  Cnt
ScoreError  Units
 [java] LatencyTrackingBench.benchInsertToDEHR thrpt5  
5742550.865 ± 256043.651  ops/s
 [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt5  
1979885.731 ± 117381.276  ops/s

{/code}

Here is how I implemented update bucket in case I missed something
{code:java}
public void updateBucket(AtomicLongArray buckets, int index, long value)
{
int stripe = (int) (Thread.currentThread().getId() & (nStripes - 1));
buckets.addAndGet(stripedIndex(index, stripe), value);
}
{/code}

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-10 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013139#comment-17013139
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


I'm inclined to agree that a simple approach is best, and simply offering a 
tunable parameter to modify the number of stripes on startup might be 
sufficient.  Here's some further minor feedback/suggestions:

# It's faster to use {{addAndGet}} than {{compareAndSet}} because it's 
guaranteed to succeed - the increment occurs whilst holding the cache line in 
an exclusive state, so it will always execute as quickly as the first failed 
{{compareAndSet}}
# Combined with this, it probably makes most sense to pick a "random" bucket 
using e.g. {{Thread.currentThread().getId() & (stripes-1)}}, to always increment
# Finally, it might make sense to "shuffle" the ordinary buckets as well, so 
that logically adjacent buckets are not spatially adjacent; since a histogram 
is _likely_ to receive updates to mostly oscillating around a given median 
point, this means there won't be any false-sharing competition between two 
updates to two logically adjacent buckets

I think at this point it then makes most sense to improve the benchmarks, in 
one case to account for memory latency of lookup, and in another to produce 
more realistic distributions of values.


> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012511#comment-17012511
 ] 

Jordan West commented on CASSANDRA-15213:
-

I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the 
changes you suggested. Regarding the extra bucket, I realized that even more 
simply we could return {{bucketOffsets.length}} if the estimate is greater than 
or equal to it – saving us an access in this case altogether. 

Also added a very simple striping approach. The performance is about the same 
(the runs are within each others' margin of error) and still significantly 
better than before CASSANDRA-14281. From the memory consumption perspective, 
however, 4 stripes ends up allocating more than 12MiB after startup. This isn't 
surprising because with a single stripe (or pre CASSANDRA-14281) after startup 
was a little more than 3MiB and we're quadrupling the size of each array.  
However, these won't grow under contention like {{LongAdder[]}} would: memory 
consumption is on the order of the number of histograms created not number 
created and contention as before. I've found comparable performance using 2 
stripes on my 4 core machine. I have an 8 core available to me I can test with 
tomorrow. If the memory consumption is still a concern I can investigate 
on-demand striping but I prefer the simpler approach and think the trade-offs 
of 2-4 stripes is reasonable. 

 
{code:java}
 3.0 (4 core):
     [java] Benchmark                                       Mode  Cnt        
Score        Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
5848504.963 ± 147881.511  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5  
1946154.537 ± 623185.290  ops/s


Trunk (4 core)


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
21887453.678 ± 2108481.285  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8908817.316 ±  115453.642  ops/s


15213 (4 core, linear search changes only):


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
24646022.304 ± 1052105.818  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
9175928.594 ±  269984.204  ops/s




15213 (4 core, 4 stripes, striping and linear search):


[java] Benchmark                                       Mode  Cnt         Score  
      Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
18818181.576 ± 506997.366  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8895569.814 ± 154219.113  ops/s


 {code}
 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012390#comment-17012390
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


Looking good.  A couple of minor suggestions:

# The {{value <= 0}} check can be moved inside of the {{value <= 8}} check, 
since both will rarely be performed
# Can {{(int) value - (bucketOffsets[0] == 0 ? 0 : 1)}} be simplified to 
{{(int) value - bucketOffsets[0]}} ?
# {{value > bucketOffsets[bucketOffsets.length - 1]}} could be replaced by 
(e.g.) {{firstCandidate = min(firstCandidate,bucketOffsets.length-2)}}, which 
only requires touching the already-used header portion of the array, avoiding 
one potential cache-miss (though not one measured on the existing benchmarks I 
expect).

As it stands, I think in the normal course of execution it's already reasonable 
to expect no mispredicted branches in the implementation you've put up, and 
these tweaks should improve the situation a little.  I think it's fairly likely 
the final pair of return statements will be implemented as a {{cmov}}, but it 
would be interesting to take a look at the final assembly.  When we have a 
final product maybe we can take a look.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012330#comment-17012330
 ] 

Jordan West commented on CASSANDRA-15213:
-

My working branch is here: 
[https://github.com/jrwest/cassandra/tree/jwest/15213]

I've implemented the binary search replacement. Didn't make an attempt at going 
branchless but even as implemented its faster than the existing implementation. 
Working on the striping / replacement of {{LongAdder[]}} and will share when 
ready. 

 
{code:java}
Trunk:


 [java] Benchmark                                     Mode  Cnt         
Score         Error  Units
 [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
21887453.678 ± 2108481.285  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8908817.316 ±  115453.642  ops/s


15213 (linear search changes only):


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
24646022.304 ± 1052105.818  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
9175928.594 ±  269984.204  ops/s
 {code}

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012063#comment-17012063
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


It might be the difference is whether or not we're including a zero bucket?  
Because the fixed offset we use probably needs to be incremented by one in this 
case, since there's no logarithm of zero.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011952#comment-17011952
 ] 

Jordan West commented on CASSANDRA-15213:
-

Ok I figured out what was going on. My linked test was generating only large 
values and the approximation of only the last bucket is very inaccurate as 
value grows but also easy to account for. Fixing this the test confirms your 
findings (sometimes I see a distance of 4 indexes but still a huge improvement):

Sample from 10k test run:
{code:java}
model = 80, estimate = 83
model = 83, estimate = 86
model = 81, estimate = 84
model = 82, estimate = 85
model = 69, estimate = 72
model = 79, estimate = 82
model = 81, estimate = 84
model = 79, estimate = 82
model = 75, estimate = 79
model = 83, estimate = 86
model = 77, estimate = 81
model = 72, estimate = 75
model = 84, estimate = 88
model = 61, estimate = 64
{code}

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011774#comment-17011774
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


bq. It looks to me like the approximation is always within precisely 2 or 3 of 
the offset

FWIW, this appears to hold just fine up to {{long}} overflow

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-09 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011744#comment-17011744
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


{quote}Fwiw, the proposed fast log code does improve performance of my local 
modifications when benchmarked but basically brings it back inline w/ the 
existing implementation (caveat: benchmarking issues already mentioned).
{quote}
Achieving equivalent performance in this synthetic benchmark is pretty much all 
we need, I think. This should predict a strict performance win once memory 
latency is taken into account, as our largest such array is ~1.2KiB, with 150 
elements, leading to a predicted 7 memory stalls for binary search - although 
execution will proceed speculatively and may mask one or so of those stalls. If 
the approximation can achieve an optimal 1 memory stall in the worst case, with 
similar performance when the entire structure is in cache, it’s a win in my 
book.
{quote}would result in several cases where a linear search would lead to over 
10-50 accesses
{quote}
Interesting, I’m not sure what’s happening in that case. For comparison, I ran 
the code below
{code:java}
long[] offsets = newOffsets(160, true);
for (int i = 0 ; i < offsets.length ; ++i)
{
long start = i == 0 ? 0 : offsets[i - 1] + 1;
long end = offsets[i];
System.out.printf("%d %d %d %d %.1f %.1f %.1f %.1f\n", i, end, 
(int)fastLog12(start) - i, (int)fastLog12(end) - i, fastLog12(start), 
fastLog12(end), slowLog12(start), slowLog12(end));
}
{code}
Which yields the below results, with columns:
 # The index in the offsets array
 # The value associated with the bucket
 # The integer difference between fastLog(lowest) and this offset, where lowest 
is the smallest value we want to bucket at this offset
 # This integer difference between fastLog(highest)
 # fastLog(lowest)
 # fastLog(highest)
 # slowLog(lowest)
 # slowLog(highest)

It looks to me like the approximation is always within precisely 2 or 3 of the 
offset, but _ahead_ by 2 or 3, because the array doesn’t follow a strict 1.2 
logarithm given the first few buckets must increment by whole integers. So for 
values > 2, we can subtract 2 from the result and walk backwards. With this 
approach we’d only ever have to consult between 1 and 2 buckets, both adjacent, 
incurring only a single memory stall. If we wanted to be really clever this 
could be implemented branchlessly, so that this memory latency wouldn’t pause 
further execution, or invalidate the pipeline, only consume some reorder buffer 
slots and register file entries.
{code:java}
0 0 0 0 0.0 0.0 -Infinity -Infinity
1 1 -1 -1 0.0 0.0 0.0 0.0
2 2 1 1 3.8 3.8 3.8 3.8
3 3 3 3 6.0 6.0 6.0 6.0
4 4 3 3 7.6 7.6 7.6 7.6
5 5 3 3 8.8 8.8 8.8 8.8
6 6 3 3 9.8 9.8 9.8 9.8
7 7 3 3 10.7 10.7 10.7 10.7
8 8 3 3 11.4 11.4 11.4 11.4
9 10 3 3 12.1 12.6 12.1 12.6
10 12 3 3 13.2 13.6 13.2 13.6
11 14 3 3 14.1 14.5 14.1 14.5
12 17 2 3 14.9 15.5 14.9 15.5
13 20 2 3 15.9 16.4 15.9 16.4
14 24 2 3 16.7 17.4 16.7 17.4
15 29 2 3 17.7 18.5 17.7 18.5
16 35 2 3 18.7 19.3 18.7 19.5
17 42 2 3 19.7 20.5 19.7 20.5
18 50 2 3 20.5 21.5 20.6 21.5
19 60 2 3 21.5 22.5 21.6 22.5
20 72 2 3 22.5 23.5 22.5 23.5
21 86 2 3 23.5 24.3 23.5 24.4
22 103 2 3 24.3 25.3 24.5 25.4
23 124 2 3 25.5 26.4 25.5 26.4
24 149 2 3 26.4 27.3 26.5 27.4
25 179 2 3 27.3 28.4 27.5 28.5
26 215 2 3 28.4 29.3 28.5 29.5
27 258 2 3 29.5 30.4 29.5 30.5
28 310 2 3 30.4 31.4 30.5 31.5
29 372 2 3 31.4 32.4 31.5 32.5
30 446 2 3 32.4 33.3 32.5 33.5
31 535 2 3 33.3 34.2 33.5 34.5
32 642 2 3 34.2 35.4 34.5 35.5
33 770 2 3 35.4 36.4 35.5 36.5
34 924 2 3 36.4 37.3 36.5 37.5
35 1109 2 3 37.3 38.4 37.5 38.5
36 1331 2 3 38.4 39.2 38.5 39.5
37 1597 2 3 39.2 40.2 39.5 40.5
38 1916 2 3 40.2 41.3 40.5 41.5
39 2299 2 3 41.3 42.2 41.5 42.5
40 2759 2 3 42.2 43.3 42.5 43.5
41 3311 2 3 43.3 44.3 43.5 44.5
42 3973 2 3 44.3 45.4 44.5 45.5
43 4768 2 3 45.4 46.3 45.5 46.5
44 5722 2 3 46.3 47.4 46.5 47.5
45 6866 2 3 47.4 48.3 47.5 48.5
46 8239 2 3 48.3 49.4 48.5 49.5
47 9887 2 3 49.4 50.4 49.5 50.5
48 11864 2 3 50.4 51.4 50.5 51.5
49 14237 2 3 51.4 52.3 51.5 52.5
50 17084 2 3 52.3 53.2 52.5 53.5
51 20501 2 3 53.2 54.4 53.5 54.5
52 24601 2 3 54.4 55.4 54.5 55.5
53 29521 2 3 55.4 56.3 55.5 56.5
54 35425 2 3 56.3 57.4 56.5 57.5
55 42510 2 3 57.4 58.3 57.5 58.5
56 51012 2 3 58.3 59.3 58.5 59.5
57 61214 2 3 59.3 60.3 59.5 60.5
58 73457 2 3 60.3 61.2 60.5 61.5
59 88148 2 3 61.2 62.3 61.5 62.5
60 105778 2 3 62.3 63.3 62.5 63.5
61 126934 2 3 63.3 64.3 63.5 64.5
62 152321 2 3 64.3 65.3 64.5 65.5
63 182785 2 3 65.3 66.4 65.5 66.5
64 219342 2 3 66.4 67.3 66.5 67.5
65 263210 2 3 67.3 68.4 67.5 68.5
66 315852 2 3 68.4 69.4 68.5 69.5
67 379022 2 3 69.4 70.4 69.5 70.5
68 454826 2 3 70.4 71.3 70.5 71.5
69 545791 2 3 71.3 72.2 71.5 72.5
70 654949 2 3 72.2 73.2 72.5 73.5
71 785939 2 3 73.2 74.2 73.5 74.5
72 943127 2 3 74.2 75.3 74.5 75.5
73 1131752 2 3 

[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-08 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011493#comment-17011493
 ] 

Jordan West commented on CASSANDRA-15213:
-

Thanks for the clarification. I will take a stab at something to replace the 
usage of {{LongAdder[]}} along the lines of what you suggested and share when 
ready. I agree re: benchmarking and will try and come up with a more complete 
set -- clearly we used the existing benchmarks before without finding this 
issue. 

Regarding the binary search, I am still a bit confused. If you run: 
https://github.com/jrwest/cassandra/blob/jwest/15213/test/unit/org/apache/cassandra/metrics/DecayingEstimatedHistogramReservoirTest.java#L58
 you will see that the proposed estimation (using the code provided) would 
result in several cases where a linear search would lead to 10-50 accesses (as 
value grows the approximation gets more inaccurate). 

Fwiw, the proposed fast log code does improve performance of my local 
modifications when benchmarked but basically brings it back inline w/ the 
existing implementation (caveat: benchmarking issues already mentioned). 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-08 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011029#comment-17011029
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


FWIW, I think when benchmarking something like this you need to create a few 
hundred MiB worth of backing arrays, and cycle through them for each test.  Or, 
at least, there are different ways to achieve this but you ideally want tests 
that include memory latency and this is a simple mechanism to achieve that.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-08 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010990#comment-17010990
 ] 

Benedict Elliott Smith commented on CASSANDRA-15213:


bq. Or are you suggesting increasing the bucket resolution
bq. Do you mean having multiple buckets per offset? 

Right.  I meant one option was to essentially inline the striping by 
introducing, e.g., 4x as many buckets as requested.  Each logical bucket can be 
represented by the sum of that many buckets.  The buckets would be offset from 
each other by at least two cache lines, and all buckets would ideally be more 
uniformly distributed in memory (by e.g. {{Integer.reverse(idx)}}), so that 
adjacent buckets that are most likely to be used together don't also compete 
for cache locks.

bq. Additionally, re: your LongAdder reference, are you proposing creating new 
buckets for the same offset under contention (like LongAdder does w/ Cell)?

I meant an alternative approach might be to batch a range of buckets together - 
say 16 or more, to amortise the object overhead - and to perform 
{{LongAdder}}’s inflation of the number of buckets on the detection of 
competition, so that we can gradually increase the number of cells needed.  
This might have the added benefit of not wasting space for striping of buckets 
that are rarely used, but at the cost of greater complexity, object overheads 
and indirection.

bq. Regarding the binary search suggestion, are you suggesting using the 
approximation to replace the maximum index

I meant to use it as the floor for a linear search for the correct position, so 
that it would ordinarily be 1 or 2 comparisons, but typically only one cache 
miss.

Specifically, I meant to use an integer approximation, along the lines of...

{code}
private static final int TABLE_BITS = 4;
private static final int TABLE_MASK = -1 >>> (32 - TABLE_BITS);
private static final float[] LOG2_TABLE = computeTable(TABLE_BITS);
private static final float log2_12 = (float) slowLog2(1.2d);
private static final float log2_12_recp = (float) (1d / slowLog2(1.2d));

private static float[] computeTable(int bits)
{
float[] table = new float[1 << bits];
for (int i = 1 ; i < 1< DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15213) DecayingEstimatedHistogramReservoir Inefficiencies

2020-01-08 Thread Jordan West (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010956#comment-17010956
 ] 

Jordan West commented on CASSANDRA-15213:
-

I’ve verified the reported {{LongAdder}} memory consumption. I’m seeing 13.5MiB 
right after startup.

A few questions about your proposed solutions?
 * By the backing array do you mean the original AtomicLongArray? Two issues I 
see with increasing the number of buckets: the maximum we can create are 237 
(then the offset overflows) and increasing the number of buckets doesn’t 
necessarily reduce contention. This is because the resolution of the buckets 
doesn’t change except for the last bucket (which for microseconds is over 200 
days and for nanoseconds is over 5 hours). Or are you suggesting increasing the 
bucket resolution by decreasing the factor in {{newOffsets}} when adding more 
buckets?
 * I’m not sure what you mean by “providing multiple buckets” but I suspect it 
might be what I am missing to better understand the proposal. Do you mean 
having multiple buckets per offset? Additionally, re: your {{LongAdder}} 
reference, are you proposing creating new buckets for the same offset under 
contention (like {{LongAdder}} does w/ {{Cell}})?
 * Regarding the binary search suggestion, are you suggesting using the 
approximation to replace the maximum index (e.g. binarySearch(offsets, 0, 
Math.floor(….) + 1, value)? Or to outright use that approximation? I wasn’t 
sure of your exact proposal but it does seem like using the equation you gave 
as an upper bound estimate and {{Math.floor(Math.log(Math.floor(n / 1.2)) / 
Math.log(1.2))}} as a lower bound estimate can reduce the space of the array we 
have to search by 25-60% (I saw up to 75% in one case with DEFAULT_BUCKET_COUNT 
but haven’t reproduced that high of a reduction since). I’ve added a 
QuickTheories test to show this here: 
[https://github.com/jrwest/cassandra/commit/a4795bc651cd0e0bd582a8df2414e312782b5020].
 * We could also circumvent the search entirely in the lowest buckets by using 
{{if value < bucketCount && bucketOffsets[value - 1] == value}} although given 
our use case I’m not sure how often we will see sub 165 ns operations.

> DecayingEstimatedHistogramReservoir Inefficiencies
> --
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Benedict Elliott Smith
>Assignee: Jordan West
>Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org