[jira] [Commented] (CASSANDRA-12674) [SASI] Confusing AND/OR semantics for StandardAnalyzer
[ https://issues.apache.org/jira/browse/CASSANDRA-12674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428229#comment-16428229 ] DOAN DuyHai commented on CASSANDRA-12674: - Look like this ticket id dead. Because the fix for the bug is far from trivial I don't see any solution soon, or ever > [SASI] Confusing AND/OR semantics for StandardAnalyzer > --- > > Key: CASSANDRA-12674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12674 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7 >Reporter: DOAN DuyHai >Assignee: Alex Petrov >Priority: Major > > {code:sql} > Connected to Test Cluster at 127.0.0.1:9042. > [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4] > Use HELP for help. > cqlsh> use test; > cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY > KEY((id), clustering)); > cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING > 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { > 'mode': 'CONTAINS', > 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', > 'analyzed': 'true'}; > //1st example SAME PARTITION KEY > cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, > 'homeworker'); > cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, > 'hardworker'); > cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%'; > id | clustering | val > ++ > 1 | 1 | homeworker > 1 | 2 | hardworker > (2 rows) > //2nd example DIFFERENT PARTITION KEY > cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, > 'speedrun'); > cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, > 'longrun'); > cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%'; > id | clustering | val > ++- > 11 | 1 | longrun > (1 rows) > {code} > In the 1st example, both rows belong to the same partition so SASI returns > both values. Indeed {{LIKE '%work home%'}} means {{contains 'work' OR > 'home'}} so the result makes sense > In the 2nd example, only one row is returned whereas we expect 2 rows because > {{LIKE '%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should > be returned too. > So where is the problem ? Explanation: > When there is only 1 predicate, the root operation type is an *AND*: > {code:java|title=QueryPlan} > private Operation analyze() > { > try > { > Operation.Builder and = new Operation.Builder(OperationType.AND, > controller); > controller.getExpressions().forEach(and::add); > return and.complete(); > } >... > } > {code} > During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for > the searched term: {{long}} and {{run}}, which corresponds to an *OR* logic. > However, this piece of code just ruins the *OR* logic: > {code:java|title=Operation} > public Operation complete() > { > if (!expressions.isEmpty()) > { > ListMultimap> analyzedExpressions = analyzeGroup(controller, op, expressions); > RangeIterator.Builder range = > controller.getIndexes(op, analyzedExpressions.values()); > ... > } > {code} > As you can see, we blindly take all the *values* of the MultiMap (which > contains a single entry for the {{val}} column with 2 expressions) and pass > it to {{controller.getIndexes(...)}} > {code:java|title=QueryController} > public RangeIterator.Builder getIndexes(OperationType op, > Collection expressions) > { > if (resources.containsKey(expressions)) > throw new IllegalArgumentException("Can't process the same > expressions multiple times."); > RangeIterator.Builder builder = op == OperationType.OR > ? RangeUnionIterator. Token>builder() > : > RangeIntersectionIterator. builder(); > ... > } > {code} > And because the root operation has *AND* type, the > {{RangeIntersectionIterator}} will be used on both expressions {{long}} and > {{run}}. > So when data belong to different partitions, we have the *AND* logic that > applies and eliminates _speedrun_ > When data belong to the same partition but different row, the > {{RangeIntersectionIterator}} returns a single partition and then the rows > are filtered further by {{operationTree.satisfiedBy}} and the results are > correct > {code:java|title=QueryPlan} > while (currentKeys.hasNext()) > { >
[jira] [Commented] (CASSANDRA-12674) [SASI] Confusing AND/OR semantics for StandardAnalyzer
[ https://issues.apache.org/jira/browse/CASSANDRA-12674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830351#comment-15830351 ] Alex Petrov commented on CASSANDRA-12674: - [~xedin] sure! I was working on finalising and testing [CASSANDRA-11990], so didn't get a chance continue on this one. Hope to get to it soon. > [SASI] Confusing AND/OR semantics for StandardAnalyzer > --- > > Key: CASSANDRA-12674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12674 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7 >Reporter: DOAN DuyHai > > {code:sql} > Connected to Test Cluster at 127.0.0.1:9042. > [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4] > Use HELP for help. > cqlsh> use test; > cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY > KEY((id), clustering)); > cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING > 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { > 'mode': 'CONTAINS', > 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', > 'analyzed': 'true'}; > //1st example SAME PARTITION KEY > cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, > 'homeworker'); > cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, > 'hardworker'); > cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%'; > id | clustering | val > ++ > 1 | 1 | homeworker > 1 | 2 | hardworker > (2 rows) > //2nd example DIFFERENT PARTITION KEY > cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, > 'speedrun'); > cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, > 'longrun'); > cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%'; > id | clustering | val > ++- > 11 | 1 | longrun > (1 rows) > {code} > In the 1st example, both rows belong to the same partition so SASI returns > both values. Indeed {{LIKE '%work home%'}} means {{contains 'work' OR > 'home'}} so the result makes sense > In the 2nd example, only one row is returned whereas we expect 2 rows because > {{LIKE '%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should > be returned too. > So where is the problem ? Explanation: > When there is only 1 predicate, the root operation type is an *AND*: > {code:java|title=QueryPlan} > private Operation analyze() > { > try > { > Operation.Builder and = new Operation.Builder(OperationType.AND, > controller); > controller.getExpressions().forEach(and::add); > return and.complete(); > } >... > } > {code} > During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for > the searched term: {{long}} and {{run}}, which corresponds to an *OR* logic. > However, this piece of code just ruins the *OR* logic: > {code:java|title=Operation} > public Operation complete() > { > if (!expressions.isEmpty()) > { > ListMultimap> analyzedExpressions = analyzeGroup(controller, op, expressions); > RangeIterator.Builder range = > controller.getIndexes(op, analyzedExpressions.values()); > ... > } > {code} > As you can see, we blindly take all the *values* of the MultiMap (which > contains a single entry for the {{val}} column with 2 expressions) and pass > it to {{controller.getIndexes(...)}} > {code:java|title=QueryController} > public RangeIterator.Builder getIndexes(OperationType op, > Collection expressions) > { > if (resources.containsKey(expressions)) > throw new IllegalArgumentException("Can't process the same > expressions multiple times."); > RangeIterator.Builder builder = op == OperationType.OR > ? RangeUnionIterator. Token>builder() > : > RangeIntersectionIterator. builder(); > ... > } > {code} > And because the root operation has *AND* type, the > {{RangeIntersectionIterator}} will be used on both expressions {{long}} and > {{run}}. > So when data belong to different partitions, we have the *AND* logic that > applies and eliminates _speedrun_ > When data belong to the same partition but different row, the > {{RangeIntersectionIterator}} returns a single partition and then the rows > are filtered further by {{operationTree.satisfiedBy}} and the results are > correct > {code:java|title=QueryPlan} > while (currentKeys.hasNext()) > { > DecoratedKey key =
[jira] [Commented] (CASSANDRA-12674) [SASI] Confusing AND/OR semantics for StandardAnalyzer
[ https://issues.apache.org/jira/browse/CASSANDRA-12674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829060#comment-15829060 ] Pavel Yaskevich commented on CASSANDRA-12674: - [~ifesdjeen] Can you please take a look at this one? > [SASI] Confusing AND/OR semantics for StandardAnalyzer > --- > > Key: CASSANDRA-12674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12674 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7 >Reporter: DOAN DuyHai > > {code:sql} > Connected to Test Cluster at 127.0.0.1:9042. > [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4] > Use HELP for help. > cqlsh> use test; > cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY > KEY((id), clustering)); > cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING > 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { > 'mode': 'CONTAINS', > 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', > 'analyzed': 'true'}; > //1st example SAME PARTITION KEY > cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, > 'homeworker'); > cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, > 'hardworker'); > cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%'; > id | clustering | val > ++ > 1 | 1 | homeworker > 1 | 2 | hardworker > (2 rows) > //2nd example DIFFERENT PARTITION KEY > cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, > 'speedrun'); > cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, > 'longrun'); > cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%'; > id | clustering | val > ++- > 11 | 1 | longrun > (1 rows) > {code} > In the 1st example, both rows belong to the same partition so SASI returns > both values. Indeed {{LIKE '%work home%'}} means {{contains 'work' OR > 'home'}} so the result makes sense > In the 2nd example, only one row is returned whereas we expect 2 rows because > {{LIKE '%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should > be returned too. > So where is the problem ? Explanation: > When there is only 1 predicate, the root operation type is an *AND*: > {code:java|title=QueryPlan} > private Operation analyze() > { > try > { > Operation.Builder and = new Operation.Builder(OperationType.AND, > controller); > controller.getExpressions().forEach(and::add); > return and.complete(); > } >... > } > {code} > During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for > the searched term: {{long}} and {{run}}, which corresponds to an *OR* logic. > However, this piece of code just ruins the *OR* logic: > {code:java|title=Operation} > public Operation complete() > { > if (!expressions.isEmpty()) > { > ListMultimap> analyzedExpressions = analyzeGroup(controller, op, expressions); > RangeIterator.Builder range = > controller.getIndexes(op, analyzedExpressions.values()); > ... > } > {code} > As you can see, we blindly take all the *values* of the MultiMap (which > contains a single entry for the {{val}} column with 2 expressions) and pass > it to {{controller.getIndexes(...)}} > {code:java|title=QueryController} > public RangeIterator.Builder getIndexes(OperationType op, > Collection expressions) > { > if (resources.containsKey(expressions)) > throw new IllegalArgumentException("Can't process the same > expressions multiple times."); > RangeIterator.Builder builder = op == OperationType.OR > ? RangeUnionIterator. Token>builder() > : > RangeIntersectionIterator. builder(); > ... > } > {code} > And because the root operation has *AND* type, the > {{RangeIntersectionIterator}} will be used on both expressions {{long}} and > {{run}}. > So when data belong to different partitions, we have the *AND* logic that > applies and eliminates _speedrun_ > When data belong to the same partition but different row, the > {{RangeIntersectionIterator}} returns a single partition and then the rows > are filtered further by {{operationTree.satisfiedBy}} and the results are > correct > {code:java|title=QueryPlan} > while (currentKeys.hasNext()) > { > DecoratedKey key = currentKeys.next(); > if (!keyRange.right.isMinimum() && >