[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375697#comment-15375697 ] Tyler Hobbs commented on CASSANDRA-12153: - +1 > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375573#comment-15375573 ] Benjamin Lerer commented on CASSANDRA-12153: CI results: ||[utest|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-12153-trunk-testall/]||[dtests|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-12153-trunk-dtest/]|| > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373170#comment-15373170 ] Tyler Hobbs commented on CASSANDRA-12153: - Nice! I'm +1 on your code changes. Can you start a CI test run? bq. PS: To be honest, I was not expecting that the use of new LinkHashSet, stream() and lambdas was so bad from a performance point of view. I was surprised by how much more expensive it was as well. > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370492#comment-15370492 ] Benjamin Lerer commented on CASSANDRA-12153: I profiled trunk with Mission Control with the {{-XX:+UnlockDiagnosticVMOptions}} and {{-XX:+DebugNonSafepoints}} options. To be sure that the problem was not hidden by the results processing part of the code, I used a program that fired queries on an empty table. On my initial run {{RestrictionSet.hasIn()}} was shown second on the hot methods list with 3.14% of the time in the call tree. Looking at the other hot methods, I found out that {{HashMap.putVal}}, {{HashMap.resize()}}, and {{HashSet.init()}} were pretty high on the list. In the allocation tab: {{HashMap$Node[]}}, {{LinkedHashMap}} and {{LinkedHashMap$Entry}} were representing a total pressure of 11,81%. All those were due to the {{LinkedHashSet}} creation in {{RestrictionSet.stream()}} and {{RestrictionSet.iterator()}} methods. Removing {{RestrictionSet.stream()}} and the use of lambdas by simple loops using {{RestrictionSet.iterator()}} reduced {{RestrictionSet.hasIn()}} to 0.61% in call tree. Replacing new {{LinkedHashSet()}} by a custom filtering iterator brought down {{RestrictionSet.hasIn()}} to 0.10% and removed all the allocations caused by the use of {{LinkedHashSet}}. As we can only have duplicates when multiple column restrictions are used, I added a variable to keep track of it and only filter duplicates when needed. That has a strong effect everywhere where {{RestrictionSet.iterator()}} was used and changed the complete picture. {{RestrictionSet.hasIn()}} came back up to 0.51% just because other part of the code were faster :-). I have pushed my experimental branch [here|https://github.com/apache/cassandra/compare/trunk...blerer:12153-trunk]. Waiting for your feedbacks. PS: To be honest, I was not expecting that the use of new {{LinkHashSet}}, {{stream()}} and lambdas was so bad from a performance point of view. > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368576#comment-15368576 ] Tyler Hobbs commented on CASSANDRA-12153: - bq. If we are really looking for speed, I think that we should have some field variables for hasIN, hasEq When profiling with the patch applied, {{hasIN()}} drops to ~0.02% of the time, so I think that's fast enough. bq. Any reason not to do the same with... Seems like we might as well do that too. Those didn't show up for the simple query I was profiling, but they might be in the hot path for some other types of queries. > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368229#comment-15368229 ] Benjamin Lerer commented on CASSANDRA-12153: I am the one to blame for the {{stream()}} method. My main concern, when I created it, was just to simplify the code. If we are really looking for speed, I think that we should have some field variables for {{hasIN}}, {{hasEq}} ... It will move the computation at preparation time rather than at execution time and will perform it only once (if my memory is correct {{hasIN()}} is called multiple times). bq. Then remove RestrictionSet stream() to discourage this from being reintroduced? There is 2 problems associated to the {{stream()}} method. The creation of the {{LinkedHashSet}} which is used to remove the duplicates {{MultiColumnRestrictions}} and the Lambda expressions. The {{{LinkedHashSet}} is unfortunatly also created in {{iterator()}} so removing {{stream()} will not solve that problem. I think, we could keep track of the fact that multicolumn restrictions are used or not and avoid creating the {{LinkedHashSet}} if they are not used. I have no idea of the cost associated to the use of the lambda. > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12153) RestrictionSet.hasIN() is slow
[ https://issues.apache.org/jira/browse/CASSANDRA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368069#comment-15368069 ] Jeff Jirsa commented on CASSANDRA-12153: Any reason not to do the same with: https://github.com/thobbs/cassandra/blob/194b56b3c604ed4d6a5a56dcca8c613be13cb8ee/src/java/org/apache/cassandra/cql3/restrictions/ClusteringColumnRestrictions.java#L155 https://github.com/thobbs/cassandra/blob/194b56b3c604ed4d6a5a56dcca8c613be13cb8ee/src/java/org/apache/cassandra/cql3/restrictions/ClusteringColumnRestrictions.java#L166 Then remove {{RestrictionSet stream()}} to discourage this from being reintroduced? > RestrictionSet.hasIN() is slow > -- > > Key: CASSANDRA-12153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12153 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs >Priority: Minor > Fix For: 3.x > > > While profiling local in-memory reads for CASSANDRA-10993, I noticed that > {{RestrictionSet.hasIN()}} was responsible for about 1% of the time. It > looks like it's mostly slow because it creates a new LinkedHashSet (which is > expensive to init) and uses streams. This can be replaced with a simple for > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)