DOAN DuyHai created CASSANDRA-12674: ---------------------------------------
Summary: [SASI] Confusing AND/OR semantics for StandardAnalyzer Key: CASSANDRA-12674 URL: https://issues.apache.org/jira/browse/CASSANDRA-12674 Project: Cassandra Issue Type: Bug Components: sasi Environment: Cassandra 3.7 Reporter: DOAN DuyHai {code:sql} Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4] Use HELP for help. cqlsh> use test; cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY KEY((id), clustering)); cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', 'analyzed': 'true'}; //1st example SAME PARTITION KEY cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, 'homeworker'); cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, 'hardworker'); cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%'; id | clustering | val ----+------------+------------ 1 | 1 | homeworker 1 | 2 | hardworker (2 rows) //2nd example DIFFERENT PARTITION KEY cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, 'speedrun'); cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, 'longrun'); cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%'; id | clustering | val ----+------------+--------- 11 | 1 | longrun (1 rows) {code} In the 1st example, both rows belong to the same partition so SASI returns both values. Indeed {{LIKE '%work home%'}} means {{contains 'work' OR 'home'}} so the result makes sense In the 2nd example, only one row is returned whereas we expect 2 rows because {{LIKE '%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should be returned too. So where is the problem ? Explanation: When there is only 1 predicate, the root operation type is an *AND*: {code:java|title=QueryPlan} private Operation analyze() { try { Operation.Builder and = new Operation.Builder(OperationType.AND, controller); controller.getExpressions().forEach(and::add); return and.complete(); } ... } {code} During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for the searched term: {{long}} and {{run}}, which corresponds to an *OR* logic. However, this piece of code just ruins the *OR* logic: {code:java|title=Operation} public Operation complete() { if (!expressions.isEmpty()) { ListMultimap<ColumnDefinition, Expression> analyzedExpressions = analyzeGroup(controller, op, expressions); RangeIterator.Builder<Long, Token> range = controller.getIndexes(op, analyzedExpressions.values()); ... } {code} As you can see, we blindly take all the *values* of the MultiMap (which contains a single entry for the {{val}} column with 2 expressions) and pass it to {{controller.getIndexes(...)}} {code:java|title=QueryController} public RangeIterator.Builder<Long, Token> getIndexes(OperationType op, Collection<Expression> expressions) { if (resources.containsKey(expressions)) throw new IllegalArgumentException("Can't process the same expressions multiple times."); RangeIterator.Builder<Long, Token> builder = op == OperationType.OR ? RangeUnionIterator.<Long, Token>builder() : RangeIntersectionIterator.<Long, Token>builder(); ... } {code} And because the root operation has *AND* type, the {{RangeIntersectionIterator}} will be used on both expressions {{long}} and {{run}}. So when data belong to different partitions, we have the *AND* logic that applies and eliminates _speedrun_ When data belong to the same partition but different row, the {{RangeIntersectionIterator}} returns a single partition and then the rows are filtered further by {{operationTree.satisfiedBy}} and the results are correct {code:java|title=QueryPlan} while (currentKeys.hasNext()) { DecoratedKey key = currentKeys.next(); if (!keyRange.right.isMinimum() && keyRange.right.compareTo(key) < 0) return endOfData(); try (UnfilteredRowIterator partition = controller.getPartition(key, executionController)) { Row staticRow = partition.staticRow(); List<Unfiltered> clusters = new ArrayList<>(); while (partition.hasNext()) { Unfiltered row = partition.next(); if (operationTree.satisfiedBy(row, staticRow, true)) clusters.add(row); } ... } {code} /cc [~xedin] [~ifesdjeen] -- This message was sent by Atlassian JIRA (v6.3.4#6332)