[jira] [Commented] (CASSANDRA-14304) DELETE after INSERT IF NOT EXISTS does not work

2018-04-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439147#comment-16439147
 ] 

DOAN DuyHai commented on CASSANDRA-14304:
-

If you enforce server-generated timestamp:

 

Pros:
 * easier to sync time between servers rather than between clients
 * server are usually co-located geographically so easier to tune NTP

Cons:
 * you cannot apply retry strategy at client-side because in case of error, the 
client would send the same request for retry but now the server would assign a 
newer timestamp

When using client-side timestamp, you can use elaborated retry strategies like: 
Basic retry, speculative execution etc ..

 

 

> DELETE after INSERT IF NOT EXISTS does not work
> ---
>
> Key: CASSANDRA-14304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14304
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Julien
>Assignee: Vinay Chella
>Priority: Major
> Attachments: debug.log, system.log
>
>
> DELETE a row immediately after INSERT IF NOT EXISTS does not work.
> Can be reproduced with this CQL script:
> {code:java}
> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
> DELETE FROM ks.ta WHERE id = 'myId';
> SELECT * FROM ks.ta WHERE id='myId';
> {code}
> {code:java}
> [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP for help.
> WARNING: pyreadline dependency missing.  Install to enable tab completion.
> cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> cqlsh> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> cqlsh> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
>  [applied]
> ---
>   True
> cqlsh> DELETE FROM ks.ta WHERE id = 'myId';
> cqlsh> SELECT * FROM ks.ta WHERE id='myId';
>  id   | col
> --+---
>  myId | myCol
> {code}
>  * Only happens if the client is on a different host (works as expected on 
> the same host)
>  * Works as expected without IF NOT EXISTS
>  * A ~500 ms delay between INSERT and DELETE fixes the issue.
> Logs attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14304) DELETE after INSERT IF NOT EXISTS does not work

2018-04-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439076#comment-16439076
 ] 

DOAN DuyHai commented on CASSANDRA-14304:
-

> Does this mean that the processing of a LWT ignores the client-generated 
>timestamp and uses one generated by the server?

 

Exactly, the timestamp used by the server for all LWT operations == Paxos 
ballot. This is necessary to guarantee a monotonic increasing timestamp for LWT

> DELETE after INSERT IF NOT EXISTS does not work
> ---
>
> Key: CASSANDRA-14304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14304
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Julien
>Assignee: Vinay Chella
>Priority: Major
> Attachments: debug.log, system.log
>
>
> DELETE a row immediately after INSERT IF NOT EXISTS does not work.
> Can be reproduced with this CQL script:
> {code:java}
> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
> DELETE FROM ks.ta WHERE id = 'myId';
> SELECT * FROM ks.ta WHERE id='myId';
> {code}
> {code:java}
> [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP for help.
> WARNING: pyreadline dependency missing.  Install to enable tab completion.
> cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> cqlsh> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> cqlsh> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
>  [applied]
> ---
>   True
> cqlsh> DELETE FROM ks.ta WHERE id = 'myId';
> cqlsh> SELECT * FROM ks.ta WHERE id='myId';
>  id   | col
> --+---
>  myId | myCol
> {code}
>  * Only happens if the client is on a different host (works as expected on 
> the same host)
>  * Works as expected without IF NOT EXISTS
>  * A ~500 ms delay between INSERT and DELETE fixes the issue.
> Logs attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12674) [SASI] Confusing AND/OR semantics for StandardAnalyzer

2018-04-06 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428229#comment-16428229
 ] 

DOAN DuyHai commented on CASSANDRA-12674:
-

Look like this ticket id dead.

Because the fix for the bug is far from trivial I don't see any solution soon, 
or ever 

> [SASI] Confusing AND/OR semantics for StandardAnalyzer 
> ---
>
> Key: CASSANDRA-12674
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12674
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: Alex Petrov
>Priority: Major
>
> {code:sql}
> Connected to Test Cluster at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
> Use HELP for help.
> cqlsh> use test;
> cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY 
> KEY((id), clustering));
> cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
> 'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
> 'analyzed': 'true'};
> //1st example SAME PARTITION KEY
> cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, 
> 'homeworker');
> cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, 
> 'hardworker');
> cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%';
>  id | clustering | val
> ++
>   1 |  1 | homeworker
>   1 |  2 | hardworker
> (2 rows)
> //2nd example DIFFERENT PARTITION KEY
> cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, 
> 'speedrun');
> cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, 
> 'longrun');
> cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%';
>  id | clustering | val
> ++-
>  11 |  1 | longrun
> (1 rows)
> {code}
> In the 1st example, both rows belong to the same partition so SASI returns 
> both values. Indeed {{LIKE '%work home%'}} means {{contains 'work' OR 
> 'home'}} so the result makes sense
> In the 2nd example, only one row is returned whereas we expect 2 rows because 
> {{LIKE '%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should 
> be returned too.
> So where is the problem ? Explanation:
> When there is only 1 predicate, the root operation type is an *AND*:
> {code:java|title=QueryPlan}
> private Operation analyze()
> {
> try
> {
> Operation.Builder and = new Operation.Builder(OperationType.AND, 
> controller);
> controller.getExpressions().forEach(and::add);
> return and.complete();
> }
>...
> }
> {code}
> During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for 
> the searched term: {{long}} and {{run}}, which corresponds to an *OR* logic. 
> However, this piece of code just ruins the *OR* logic:
> {code:java|title=Operation}
> public Operation complete()
> {
> if (!expressions.isEmpty())
> {
> ListMultimap 
> analyzedExpressions = analyzeGroup(controller, op, expressions);
> RangeIterator.Builder range = 
> controller.getIndexes(op, analyzedExpressions.values());
>  ...
> }
> {code}
> As you can see, we blindly take all the *values* of the MultiMap (which 
> contains a single entry for the {{val}} column with 2 expressions) and pass 
> it to {{controller.getIndexes(...)}}
> {code:java|title=QueryController}
> public RangeIterator.Builder getIndexes(OperationType op, 
> Collection expressions)
> {
> if (resources.containsKey(expressions))
> throw new IllegalArgumentException("Can't process the same 
> expressions multiple times.");
> RangeIterator.Builder builder = op == OperationType.OR
> ? RangeUnionIterator. Token>builder()
> : 
> RangeIntersectionIterator.builder();
> ...
> }
> {code}
> And because the root operation has *AND* type, the 
> {{RangeIntersectionIterator}} will be used on both expressions {{long}} and 
> {{run}}.
> So when data belong to different partitions, we have the *AND* logic that 
> applies and eliminates _speedrun_
> When data belong to the same partition but different row, the 
> {{RangeIntersectionIterator}} returns a single partition and then the rows 
> are filtered further by {{operationTree.satisfiedBy}} and the results are 
> correct
> {code:java|title=QueryPlan}
> while (currentKeys.hasNext())
> {
>

[jira] [Commented] (CASSANDRA-14357) A Simple List of New Major Features Desired for Version 4.0

2018-04-02 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422581#comment-16422581
 ] 

DOAN DuyHai commented on CASSANDRA-14357:
-

My wished feature: 
 * Add support for arithmetic operators (CASSANDRA-11935)
 * Allow IN restrictions on column families with collections (CASSANDRA-12654)
 * Add support for + and - operations on dates (CASSANDRA-11936)
 * Add the currentTimestamp, currentDate, currentTime and currentTimeUUID 
functions (CASSANDRA-13132)
 * Allow selecting Map values and Set elements (CASSANDRA-7396)

 

Those are mostly useful for timeseries data models

 

 

> A Simple List of New Major Features Desired for Version 4.0
> ---
>
> Key: CASSANDRA-14357
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14357
> Project: Cassandra
>  Issue Type: Wish
>Reporter: Kenneth Brotman
>Priority: Major
>
> Just list any desired new major features for 4.0 that you want added.  I will 
> maintain a compiled list somewhere on this Jira as well.  Don't worry about 
> any steps beyond this.  Don't make any judgements about or make any comments 
> at all about what others add. 
> No judgments at this point.  This is a list of everyone's suggestions.  Add 
> your suggestions for new major features you desire be added for version 4.0 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14357) A Simple List of New Major Features Desired for Version 4.0

2018-04-02 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422581#comment-16422581
 ] 

DOAN DuyHai edited comment on CASSANDRA-14357 at 4/2/18 2:51 PM:
-

My wished features: 
 * Add support for arithmetic operators (CASSANDRA-11935)
 * Allow IN restrictions on column families with collections (CASSANDRA-12654)
 * Add support for + and - operations on dates (CASSANDRA-11936)
 * Add the currentTimestamp, currentDate, currentTime and currentTimeUUID 
functions (CASSANDRA-13132)
 * Allow selecting Map values and Set elements (CASSANDRA-7396)

 

Those are mostly useful for timeseries data models

 

 


was (Author: doanduyhai):
My wished feature: 
 * Add support for arithmetic operators (CASSANDRA-11935)
 * Allow IN restrictions on column families with collections (CASSANDRA-12654)
 * Add support for + and - operations on dates (CASSANDRA-11936)
 * Add the currentTimestamp, currentDate, currentTime and currentTimeUUID 
functions (CASSANDRA-13132)
 * Allow selecting Map values and Set elements (CASSANDRA-7396)

 

Those are mostly useful for timeseries data models

 

 

> A Simple List of New Major Features Desired for Version 4.0
> ---
>
> Key: CASSANDRA-14357
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14357
> Project: Cassandra
>  Issue Type: Wish
>Reporter: Kenneth Brotman
>Priority: Major
>
> Just list any desired new major features for 4.0 that you want added.  I will 
> maintain a compiled list somewhere on this Jira as well.  Don't worry about 
> any steps beyond this.  Don't make any judgements about or make any comments 
> at all about what others add. 
> No judgments at this point.  This is a list of everyone's suggestions.  Add 
> your suggestions for new major features you desire be added for version 4.0 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14304) DELETE after INSERT IF NOT EXISTS does not work

2018-03-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413795#comment-16413795
 ] 

DOAN DuyHai commented on CASSANDRA-14304:
-

Hints:

 

1) Right after doing the INSERT INTO IF NOT EXISTS, just do a SELECT 
writetime(col) FROM ks.ta WHERE id='myId' to get the timestamp/Paxos ballot

2) Issue a DELETE by providing a timestamp > the previous Paxos ballot

3) Now do a SELECT * and check if you still see the issue

> DELETE after INSERT IF NOT EXISTS does not work
> ---
>
> Key: CASSANDRA-14304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14304
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Julien
>Assignee: Vinay Chella
>Priority: Major
> Attachments: debug.log, system.log
>
>
> DELETE a row immediately after INSERT IF NOT EXISTS does not work.
> Can be reproduced with this CQL script:
> {code:java}
> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
> DELETE FROM ks.ta WHERE id = 'myId';
> SELECT * FROM ks.ta WHERE id='myId';
> {code}
> {code:java}
> [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP for help.
> WARNING: pyreadline dependency missing.  Install to enable tab completion.
> cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> cqlsh> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> cqlsh> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
>  [applied]
> ---
>   True
> cqlsh> DELETE FROM ks.ta WHERE id = 'myId';
> cqlsh> SELECT * FROM ks.ta WHERE id='myId';
>  id   | col
> --+---
>  myId | myCol
> {code}
>  * Only happens if the client is on a different host (works as expected on 
> the same host)
>  * Works as expected without IF NOT EXISTS
>  * A ~500 ms delay between INSERT and DELETE fixes the issue.
> Logs attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14304) DELETE after INSERT IF NOT EXISTS does not work

2018-03-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-14304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412654#comment-16412654
 ] 

DOAN DuyHai commented on CASSANDRA-14304:
-

Hints:

1) LWT operations are using the ballot based on an agreement of timestamp value 
between QUORUM of replicas. It can happens that the timestamp is slightly 
incremented (some microsecs) in case of conflict/contention on the cluster. The 
consequence is that the timestamp used for LWT can be slightly (again in 
microsecs) in the future. Read the source code to check the part of the code 
responsible for ballot agreement with Paxos

2) For the DELETE request:

   a) it can use the  timestamp, which can belong to the "past" with 
respect to the one used by LWT, thus SELECT does return a value

   b) it can hit a 3rd replica not involved in the previous LWT (remember LWT 
only uses QUORUM) and the clock of this 3rd replica is slighly in the past.

 

Solution

a. use INSERT INTO ... IF NOT EXISTS

b. use DELETE using QUORUM

 

> DELETE after INSERT IF NOT EXISTS does not work
> ---
>
> Key: CASSANDRA-14304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14304
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Julien
>Assignee: Vinay Chella
>Priority: Major
> Attachments: debug.log, system.log
>
>
> DELETE a row immediately after INSERT IF NOT EXISTS does not work.
> Can be reproduced with this CQL script:
> {code:java}
> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
> DELETE FROM ks.ta WHERE id = 'myId';
> SELECT * FROM ks.ta WHERE id='myId';
> {code}
> {code:java}
> [cqlsh 5.0.1 | Cassandra 3.11.2 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP for help.
> WARNING: pyreadline dependency missing.  Install to enable tab completion.
> cqlsh> CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'SimpleStrategy', 
> 'replication_factor' : 1 };
> cqlsh> CREATE TABLE ks.ta ( id text PRIMARY KEY, col text );
> cqlsh> INSERT INTO ks.ta (id, col) VALUES ('myId', 'myCol') IF NOT EXISTS;
>  [applied]
> ---
>   True
> cqlsh> DELETE FROM ks.ta WHERE id = 'myId';
> cqlsh> SELECT * FROM ks.ta WHERE id='myId';
>  id   | col
> --+---
>  myId | myCol
> {code}
>  * Only happens if the client is on a different host (works as expected on 
> the same host)
>  * Works as expected without IF NOT EXISTS
>  * A ~500 ms delay between INSERT and DELETE fixes the issue.
> Logs attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-10-10 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198253#comment-16198253
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

So I've wrapped my head around the design of transient replicas. So far I can 
spot 2 concerns

1) It is not working with ONE or LOCAL_ONE. Of course transient replication is 
an opt-in feature but it means users should be super-careful about issuing 
queries at ONE/LOCAL_ONE for the keyspaces having transient replication 
enabled. Considering that ONE/LOCAL_ONE is the *default consistency level* for 
drivers and spark connector, maybe should we throw exception whenever a query 
with those consistency level are issued against transiently replicated 
keyspaces ?

2) *Consistency level* and *repair* have been 2 distinct and orthogonal notions 
so far. With transient replication they are strongly tied. Indeed transient 
replication relies heavily on incremental repair. Of course it is a detail of 
impl, [~aweisberg] has mentioned replicated hints as another impl alternative 
but in this case we're making transient replication dependent of hints impl. 
Same story

 The consequence of point 2) is that any bug in the incremental 
repair/replicated hints will impact terribly the correctness/assumptions of 
transient replication. This point worries me much more than point 1)

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-10-09 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197624#comment-16197624
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

I did not mean about end-users, I meant about core C* developers. We need to 
introduce some code change in order to accomodate the asymmetry between 
replicas in the code base.

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-10-09 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197593#comment-16197593
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

Ok I get it, so you can provision transient replicas with much less disk space 
than normal replicas --> cost saving

That being said, I think the cost saving become a real argument for very huge 
clusters. For average C* users in the range of 10 - 20 nodes, not sure the 
added complexity in reasoning worths the disk space saving.

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-10-09 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197385#comment-16197385
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

{quote}The target is 10x to 20x less storage{quote}

As far as I understand, with RF=3, if you remove repaired data on transient 
replicas, you'll reduce storage by 1/3. Where do you get this 10x - 20x then ?

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-10-05 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193359#comment-16193359
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

Waouh, thank you [~aweisberg] for the so detailed answer! I'm going to take 
time to digest through all this.

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13553) Map C* table schema to RocksDB key value data model

2017-10-03 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189682#comment-16189682
 ] 

DOAN DuyHai commented on CASSANDRA-13553:
-

Thanks [~dikanggu], can't wait to play with this new storage engine once we 
have a beta version

> Map C* table schema to RocksDB key value data model
> ---
>
> Key: CASSANDRA-13553
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13553
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Dikang Gu
>Assignee: Dikang Gu
>
> The goal for this ticket is to find a way to map Cassandra's table data model 
> to RocksDB's key value data model.
> To support most common C* queries on top of RocksDB, we plan to use this 
> strategy, for each row in Cassandra:
> 1. Encode Cassandra partition key + clustering keys into RocksDB key.
> 2. Encode rest of Cassandra columns into RocksDB value.
> With this approach, there are two major problems we need to solve:
> 1. After we encode C* keys into RocksDB key, we need to preserve the same 
> sorting order in RocksDB byte comparator, as in original data type.
> 2. Support timestamp, ttl, and tombestone on the values.
> To solve problem 1, we need to carefully design the encoding algorithm for 
> each data type. Fortunately, there are some existing libraries we can play 
> with, such as orderly (https://github.com/ndimiduk/orderly), which is used by 
> HBase. Or flatbuffer (https://github.com/google/flatbuffers)
> To solve problem 2, our plan is to encode C* timestamp, ttl, and tombestone 
> together with the values, and then use RocksDB's merge operator/compaction 
> filter to merge different version of data, and handle ttl/tombestones. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13476) RocksDB based storage engine

2017-10-02 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188636#comment-16188636
 ] 

DOAN DuyHai commented on CASSANDRA-13476:
-

Thanks [~jjirsa]

 The bleeding-edge part of me get very excited about new storage engine

 The conservative part of me is afraid about the impact on production, 
therefore my list of questions

> RocksDB based storage engine
> 
>
> Key: CASSANDRA-13476
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13476
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Dikang Gu
>
> As I mentioned in CASSANDRA-13474, we got huge P99 read latency gain from the 
> RocksDB integration experiment.
> After we make the existing storage engine to be pluggable, we want to 
> implement a RocksDB based storage engine, which can support existing 
> Cassandra data model, and provide better and more predictable performance.
> The effort will include but not limited to:
> 1. Wide column support on RocksDB
> 2. Streaming support on RocksDB
> 3. RocksDB instances management
> 4. Nodetool support
> 5. Metrics and monitoring
> 6. Counter support on RocksDB



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13553) Map C* Wide column to RocksDB key value data model

2017-10-02 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188632#comment-16188632
 ] 

DOAN DuyHai commented on CASSANDRA-13553:
-

I like the idea of having special encoding to preserve the orders of clustering 
columns. Neat!

About point 2 of storing all normal columns of C* into RocksDB value, how would 
we deal with CQL mutation on a single column of the row ? Would it require 
necessarily a read-before-write or do we have smarter alternatives ?

If the read-before-write is not avoidable, maybe clearly document the 
performance trade-off when using RockDBs for write so that users don't get 
surprised.

> Map C* Wide column to RocksDB key value data model
> --
>
> Key: CASSANDRA-13553
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13553
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Dikang Gu
>Assignee: Dikang Gu
>
> The goal for this ticket is to find a way to map Cassandra's wide column data 
> model to RocksDB's key value data model.
> To support most common C* queries on top of RocksDB, we plan to use this 
> strategy, for each row in Cassandra:
> 1. Encode Cassandra partition key + clustering keys into RocksDB key.
> 2. Encode rest of Cassandra columns into RocksDB value.
> With this approach, there are two major problems we need to solve:
> 1. After we encode C* keys into RocksDB key, we need to preserve the same 
> sorting order in RocksDB byte comparator, as in original data type.
> 2. Support timestamp, ttl, and tombestone on the values.
> To solve problem 1, we need to carefully design the encoding algorithm for 
> each data type. Fortunately, there are some existing libraries we can play 
> with, such as orderly (https://github.com/ndimiduk/orderly), which is used by 
> HBase. Or flatbuffer (https://github.com/google/flatbuffers)
> To solve problem 2, our plan is to encode C* timestamp, ttl, and tombestone 
> together with the values, and then use RocksDB's merge operator/compaction 
> filter to merge different version of data, and handle ttl/tombestones. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13476) RocksDB based storage engine

2017-10-02 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188503#comment-16188503
 ] 

DOAN DuyHai commented on CASSANDRA-13476:
-

A stupid question but when using RockDB, how can we support all the CQL 
semantics (like collections, clustering, TTL, LWT, SASI, MV, UDF/UDA ...) ?

I don't feel like having different storage engines supporting different subsets 
of CQL features, ending up with a matrix of supported/unsupported features.

Also, I guess that implementing some features like collections or counters will 
yield different performance characteristics when switching between different 
storage engines. It would be nice that it is clearly indicated not to get the 
users by surprise

Also, compaction strategies will also be per-storage engine specific right ? 

> RocksDB based storage engine
> 
>
> Key: CASSANDRA-13476
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13476
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Dikang Gu
>
> As I mentioned in CASSANDRA-13474, we got huge P99 read latency gain from the 
> RocksDB integration experiment.
> After we make the existing storage engine to be pluggable, we want to 
> implement a RocksDB based storage engine, which can support existing 
> Cassandra data model, and provide better and more predictable performance.
> The effort will include but not limited to:
> 1. Wide column support on RocksDB
> 2. Streaming support on RocksDB
> 3. RocksDB instances management
> 4. Nodetool support
> 5. Metrics and monitoring
> 6. Counter support on RocksDB



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-09-30 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186999#comment-16186999
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.

Also how do we define *repaired data* ? It is at token range level ? After a 
round of repair, we can segregate SSTables into repaired and unrepaired 
buckets, fine. But is it applicable to the token range level ?

Suppose for simplification range [0 - 10[(out of [0 - 100[ total range). 
Even if a round of repair has just finished for this range, whenever you have a 
single subsequent update falling in this range, the whole range now can no 
longer be considered repaired ...

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

2017-09-30 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186997#comment-16186997
 ] 

DOAN DuyHai commented on CASSANDRA-13442:
-

So by flagging 2 replicas as fully repaired and used for consistent reads and 
one as transient, we're introducing indeed *asymmetry* into the replicas 
themselves.  

For the following reasoning suppose RF=3,  2 _consistent_ replicas and 1 
_transient_ replica

*I  Consistency*

Now reasoning with CL from the client POV is no longer straightforward, the 
client should know which replica(s) are fully repaired and which are not.

Furthermore, now we need to keep meta data about repaired and unrepaired range 
of data.

- suppose a {{SELECT * FROM table WHERE partition = xxx AND clustering >=  
AND clustering <= zzz}}. What if some ranges of clustering have been repaired 
and some other aren't ? Do we query QUORUM on 3 replicas or does the 
coordinator issue 2 queries, one for the repaired range on 1 consistent replica 
and 2 on a QUORUM of 3 replicas for the unrepaired range ?

- how do we reason with CL like global multi-DC QUORUM and EACH_QUORUM for 
reads ?

*II Failure handling*

Failure of replica is now *asymmetric* too for reads. A failed _transient_ 
replica is a no brainer. A failed _consistent_ replica(s) is a bigger issue. 
With classic replication CL=ONE guarantee you to be highly available in the 
face of 2 failed replicas. Now having 2 failed _consistent_ replicas, since the 
repaired data has been deleted on the _transient_ replica, the CL=ONE contract 
is broken ...

*III Operations*

- how to manage bootstrap of new node(s) ? Is the new node(s) considered 
transient for all ranges until at least one round of repair has finished ? 
Obviously bootstrapping new node(s) by streaming data from _consistent_ 
replicas will make more sense for repaired data but in this case the bootstrap 
process itself need to be repair-aware

- how to manage decommission ? How do we dispatch repaired and unrepaired range 
of data so that the remaining nodes in the cluster are balanced (with regard to 
the _consistent_ and _transient_ replica state) ?

- what's about MV and secondary index ? I know that MV is likely to be flagged 
as experimental but unless we remove it completely from the base code we should 
also think about those concepts of _transient_ replicas for MV at least.

- what's about backup & restore ? Now any backup process should also be 
repair-aware and also know the location of the _consistent_ replicas

- for *existing* clusters, how to transition from normal replication to this 
new replication scheme ? And vice versa 





> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -
>
> Key: CASSANDRA-13442
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient 

[jira] [Commented] (CASSANDRA-13127) Materialized Views: View row expires too soon

2017-07-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091304#comment-16091304
 ] 

DOAN DuyHai commented on CASSANDRA-13127:
-

Hey guys

A question about TTLed columns. I want to make sure we do cover this scenario. 
When an TTLed column expired in the base table, on the next compaction, this 
TTLed column is replaced by a tombstone. 

 Since all this TTLed -> tombstone mechanism only occurs during compaction, do 
we have something in the code path that *remove* the corresponding entry in the 
view ?

ping [~jasonstack] [~fsander]

> Materialized Views: View row expires too soon
> -
>
> Key: CASSANDRA-13127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13127
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local Write-Read Paths, Materialized Views
>Reporter: Duarte Nunes
>Assignee: ZhaoYang
>
> Consider the following commands, ran against trunk:
> {code}
> echo "DROP MATERIALIZED VIEW ks.mv; DROP TABLE ks.base;" | bin/cqlsh
> echo "CREATE TABLE ks.base (p int, c int, v int, PRIMARY KEY (p, c));" | 
> bin/cqlsh
> echo "CREATE MATERIALIZED VIEW ks.mv AS SELECT p, c FROM base WHERE p IS NOT 
> NULL AND c IS NOT NULL PRIMARY KEY (c, p);" | bin/cqlsh
> echo "INSERT INTO ks.base (p, c) VALUES (0, 0) USING TTL 10;" | bin/cqlsh
> # wait for row liveness to get closer to expiration
> sleep 6;
> echo "UPDATE ks.base USING TTL 8 SET v = 0 WHERE p = 0 and c = 0;" | bin/cqlsh
> echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh
>  p | c | ttl(v)
> ---+---+
>  0 | 0 |  7
> (1 rows)
>  c | p
> ---+---
>  0 | 0
> (1 rows)
> # wait for row liveness to expire
> sleep 4;
> echo "SELECT p, c, ttl(v) FROM ks.base; SELECT * FROM ks.mv;" | bin/cqlsh
>  p | c | ttl(v)
> ---+---+
>  0 | 0 |  3
> (1 rows)
>  c | p
> ---+---
> (0 rows)
> {code}
> Notice how the view row is removed even though the base row is still live. I 
> would say this is because in ViewUpdateGenerator#computeLivenessInfoForEntry 
> the TTLs are compared instead of the expiration times, but I'm not sure I'm 
> getting that far ahead in the code when updating a column that's not in the 
> view.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

2017-07-10 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080299#comment-16080299
 ] 

DOAN DuyHai commented on CASSANDRA-12245:
-

[~adelapena] since you change the key of the table 
{{system.views_builds_in_progress}}, how would we handle migration from 
previous version of the table to the new version ? ALTER TABLE does not work on 
primary key columns. 

> initial view build can be parallel
> --
>
> Key: CASSANDRA-12245
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Materialized Views
>Reporter: Tom van der Woerdt
>Assignee: Andrés de la Peña
> Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several 
> weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one 
> thread
>  * just iterate through sstables, not worrying about duplicates, and include 
> the timestamp of the original write in the MV mutation. since this doesn't 
> exclude duplicates it does increase the amount of work and could temporarily 
> surface ghost rows (yikes) but I guess that's why they call it eventual 
> consistency. doing it this way can avoid holding references to all tables on 
> disk, allows parallelization, and removes the need to check other sstables 
> for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-10444) Create an option to forcibly disable tracing

2017-03-13 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907292#comment-15907292
 ] 

DOAN DuyHai commented on CASSANDRA-10444:
-

Does {{ALTER KEYSPACE system_traces WITH replication=\{ ., 
'replication_factor': 0\} }} sufficient enough for the trick ?

> Create an option to forcibly disable tracing
> 
>
> Key: CASSANDRA-10444
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10444
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Brandon Williams
>Priority: Minor
> Fix For: 2.1.x
>
>
> Sometimes people will experience dropped TRACE messages.  Ostensibly, trace 
> is disabled on the server and we know it's from some client, somewhere.  With 
> an inability to locate exactly where client code is causing this, it would be 
> useful to just be able to kill it entirely on the server side.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-10444) Create an option to forcibly disable tracing

2017-03-13 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907292#comment-15907292
 ] 

DOAN DuyHai edited comment on CASSANDRA-10444 at 3/13/17 11:42 AM:
---

Does _ALTER KEYSPACE system_traces WITH replication=\{ ., 
'replication_factor': 0\}_ sufficient enough for the trick ?


was (Author: doanduyhai):
Does {{ALTER KEYSPACE system_traces WITH replication=\{ ., 
'replication_factor': 0\} }} sufficient enough for the trick ?

> Create an option to forcibly disable tracing
> 
>
> Key: CASSANDRA-10444
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10444
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Brandon Williams
>Priority: Minor
> Fix For: 2.1.x
>
>
> Sometimes people will experience dropped TRACE messages.  Ostensibly, trace 
> is disabled on the server and we know it's from some client, somewhere.  With 
> an inability to locate exactly where client code is causing this, it would be 
> useful to just be able to kill it entirely on the server side.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13268) Allow to create custom secondary index on static columns

2017-03-12 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906701#comment-15906701
 ] 

DOAN DuyHai commented on CASSANDRA-13268:
-

https://issues.apache.org/jira/browse/CASSANDRA-11183

> Allow to create custom secondary index on static columns
> 
>
> Key: CASSANDRA-13268
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13268
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, CQL
>Reporter: vincent royer
>Priority: Trivial
>  Labels: features
> Fix For: 3.11.x, 4.x
>
> Attachments: 0001-CASSANDRA-13268-custom-index-on-static-columns.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Custom secondary index implementations (like elassandra) could gain avantage 
> to index static columns, even if not searchable with CQL. Here is a proposal 
> to allow index creation on static columns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13268) Allow to create custom secondary index on static columns

2017-03-12 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906700#comment-15906700
 ] 

DOAN DuyHai commented on CASSANDRA-13268:
-

It is *already* possible to create custom secondary index on static columns ... 
For example with SASI ...

> Allow to create custom secondary index on static columns
> 
>
> Key: CASSANDRA-13268
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13268
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, CQL
>Reporter: vincent royer
>Priority: Trivial
>  Labels: features
> Fix For: 3.11.x, 4.x
>
> Attachments: 0001-CASSANDRA-13268-custom-index-on-static-columns.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Custom secondary index implementations (like elassandra) could gain avantage 
> to index static columns, even if not searchable with CQL. Here is a proposal 
> to allow index creation on static columns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-13315) Consistency is confusing for new users

2017-03-09 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903882#comment-15903882
 ] 

DOAN DuyHai commented on CASSANDRA-13315:
-

As long as you don't remove the EACH_QUORUM I'm fine. There are very rare cases 
(where 2 DCs are very close geographically to each other) where customer want 
EACH_QUORUM to be sure that mutations has been applied in both DCs.

> Consistency is confusing for new users
> --
>
> Key: CASSANDRA-13315
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13315
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Ryan Svihla
>
> New users really struggle with consistency level and fall into a large number 
> of tarpits trying to decide on the right one.
> o
> 1. There are a LOT of consistency levels and it's up to the end user to 
> reason about what combinations are valid and what is really what they intend 
> it to be. Is there any reason why write at ALL and read at CL TWO is better 
> than read at CL ONE? 
> 2. They require a good understanding of failure modes to do well. It's not 
> uncommon for people to use CL one and wonder why their data is missing.
> 3. The serial consistency level "bucket" is confusing to even write about and 
> easy to get wrong even for experienced users.
> So I propose the following steps (EDIT based on Jonathan's comment):
> 1. Remove the "serial consistency" level of consistency levels and just have 
> all consistency levels in one bucket to set, conditions still need to be 
> required for SERIAL/LOCAL_SERIAL
> 2. add 3 new consistency levels pointing to existing ones but that infer 
> intent much more cleanly:
> EDIT better names bases on comments.
>* EVENTUALLY = LOCAL_ONE reads and writes
>* STRONG = LOCAL_QUORUM reads and writes
>* SERIAL = LOCAL_SERIAL reads and writes (though a ton of folks dont know 
> what SERIAL means so this is why I suggested TRANSACTIONAL even if its not as 
> correct as Id like)
> for global levels of this I propose keeping the old ones around, they're 
> rarely used in the field except by accident or particularly opinionated and 
> advanced users.
> Drivers should put the new consistency levels in a new package and docs 
> should be updated to suggest their use. Likewise setting default CL should 
> only provide those three settings and applying it for reads and writes at the 
> same time.
> CQLSH I'm gonna suggest should default to HIGHLY_CONSISTENT. New sysadmins 
> get surprised by this frequently and I can think of a couple very major 
> escalations because people were confused what the default behavior was.
> The benefit to all this change is we shrink the surface area that one has to 
> understand when learning Cassandra greatly, and we have far less bad initial 
> experiences and surprises. New users will more likely be able to wrap their 
> brains around those 3 ideas more readily then they can "what happens when I 
> have RF2, QUROUM writes and ONE reads". Advanced users get access to all the 
> way still, while new users don't have to learn all the ins and outs of 
> distributed theory just to write data and be able to read it back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection can be very inefficient

2016-11-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15693771#comment-15693771
 ] 

DOAN DuyHai commented on CASSANDRA-12915:
-

About expression ordering, there is already a priority defined on them, here: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java#L343-L370

{code:java|title=Operation.getPriority}

private static int getPriority(Operator op)
{
switch (op)
{
case EQ:
return 5;

case LIKE_PREFIX:
case LIKE_SUFFIX:
case LIKE_CONTAINS:
case LIKE_MATCHES:
return 4;

case GTE:
case GT:
return 3;

case LTE:
case LT:
return 2;

case NEQ:
return 1;

default:
return 0;
}
}
{code}

> SASI: Index intersection can be very inefficient
> 
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
> Fix For: 3.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12654) Query Validation Error : CQL IN operator over last partitioning|clustering column (valid) is rejected if a query fetches collection columns

2016-10-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601766#comment-15601766
 ] 

DOAN DuyHai commented on CASSANDRA-12654:
-

Because we need to understand the *rationale* to forbid selecting collections 
when using {{IN}} clause with _clustering columns_

> Query Validation Error : CQL IN operator over last partitioning|clustering 
> column (valid) is rejected if a query fetches collection columns
> ---
>
> Key: CASSANDRA-12654
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12654
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
>Reporter: Samba Siva Rao Kolusu
>Assignee: Alex Petrov
>Priority: Minor
>  Labels: easyfix
> Fix For: 3.x
>
>
> although IN operator is allowed over the last clustering or partitioning 
> columns, the CQL Query Validator is rejecting queries when they attempt to 
> fetch collection columns in their result set.
> It seems a similar bug (CASSANDRA-5376) was raised some time ago, and a fix 
> (rather mask) was provided to give a better error message to such queries in 
> 1.2.4. 
> Considering that Cassandra & CQL has evolved a great deal from that period, 
> it now seems possible to provide an actual fix to this problem, i.e. allowing 
> queries to fetch collection columns even when IN operator is used.
> please read the following mail thread to understand the context : 
> https://lists.apache.org/thread.html/8e1765d14bd9798bf9c0938a793f1dbc9c9349062a8705db2e28d291@%3Cuser.cassandra.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12654) Query Validation Error : CQL IN operator over last partitioning|clustering column (valid) is rejected if a query fetches collection columns

2016-10-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601618#comment-15601618
 ] 

DOAN DuyHai commented on CASSANDRA-12654:
-

Ok so we need to dig further in the Git history to find the original commit

> Query Validation Error : CQL IN operator over last partitioning|clustering 
> column (valid) is rejected if a query fetches collection columns
> ---
>
> Key: CASSANDRA-12654
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12654
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
>Reporter: Samba Siva Rao Kolusu
>Assignee: Alex Petrov
>Priority: Minor
>  Labels: easyfix
> Fix For: 3.x
>
>
> although IN operator is allowed over the last clustering or partitioning 
> columns, the CQL Query Validator is rejecting queries when they attempt to 
> fetch collection columns in their result set.
> It seems a similar bug (CASSANDRA-5376) was raised some time ago, and a fix 
> (rather mask) was provided to give a better error message to such queries in 
> 1.2.4. 
> Considering that Cassandra & CQL has evolved a great deal from that period, 
> it now seems possible to provide an actual fix to this problem, i.e. allowing 
> queries to fetch collection columns even when IN operator is used.
> please read the following mail thread to understand the context : 
> https://lists.apache.org/thread.html/8e1765d14bd9798bf9c0938a793f1dbc9c9349062a8705db2e28d291@%3Cuser.cassandra.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12654) Query Validation Error : CQL IN operator over last partitioning|clustering column (valid) is rejected if a query fetches collection columns

2016-10-23 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15599554#comment-15599554
 ] 

DOAN DuyHai commented on CASSANDRA-12654:
-

This restriction on collections when using {{IN}} clause for clustering columns 
has been committed as part of CASSANDRA-7981.

Ping [~blerer] for more details

> Query Validation Error : CQL IN operator over last partitioning|clustering 
> column (valid) is rejected if a query fetches collection columns
> ---
>
> Key: CASSANDRA-12654
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12654
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
>Reporter: Samba Siva Rao Kolusu
>Assignee: Alex Petrov
>Priority: Minor
>  Labels: easyfix
> Fix For: 3.x
>
>
> although IN operator is allowed over the last clustering or partitioning 
> columns, the CQL Query Validator is rejecting queries when they attempt to 
> fetch collection columns in their result set.
> It seems a similar bug (CASSANDRA-5376) was raised some time ago, and a fix 
> (rather mask) was provided to give a better error message to such queries in 
> 1.2.4. 
> Considering that Cassandra & CQL has evolved a great deal from that period, 
> it now seems possible to provide an actual fix to this problem, i.e. allowing 
> queries to fetch collection columns even when IN operator is used.
> please read the following mail thread to understand the context : 
> https://lists.apache.org/thread.html/8e1765d14bd9798bf9c0938a793f1dbc9c9349062a8705db2e28d291@%3Cuser.cassandra.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9754) Make index info heap friendly for large CQL partitions

2016-10-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578618#comment-15578618
 ] 

DOAN DuyHai commented on CASSANDRA-9754:


Stupid question: how are those improvement affect compaction ? Did you also 
monitor the compaction time during your benchmark tests and compare the time 
taken by each impl ?

> Make index info heap friendly for large CQL partitions
> --
>
> Key: CASSANDRA-9754
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9754
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Michael Kjellman
>Priority: Minor
> Fix For: 4.x
>
> Attachments: gc_collection_times_with_birch.png, 
> gc_collection_times_without_birch.png, gc_counts_with_birch.png, 
> gc_counts_without_birch.png, 
> perf_cluster_1_with_birch_read_latency_and_counts.png, 
> perf_cluster_1_with_birch_write_latency_and_counts.png, 
> perf_cluster_2_with_birch_read_latency_and_counts.png, 
> perf_cluster_2_with_birch_write_latency_and_counts.png, 
> perf_cluster_3_without_birch_read_latency_and_counts.png, 
> perf_cluster_3_without_birch_write_latency_and_counts.png
>
>
>  Looking at a heap dump of 2.0 cluster, I found that majority of the objects 
> are IndexInfo and its ByteBuffers. This is specially bad in endpoints with 
> large CQL partitions. If a CQL partition is say 6,4GB, it will have 100K 
> IndexInfo objects and 200K ByteBuffers. This will create a lot of churn for 
> GC. Can this be improved by not creating so many objects?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-20 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506923#comment-15506923
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

[~mkrupits]

 I've reproduced the issue and found the root cause. For the sake of clarity, I 
have created another JIRA [CASSANDRA-12674] explaining the issue in details. 
Can you please close this one ?

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, asdqwe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-12674) [SASI] Confusing AND/OR semantics for StandardAnalyzer

2016-09-20 Thread DOAN DuyHai (JIRA)
DOAN DuyHai created CASSANDRA-12674:
---

 Summary: [SASI] Confusing AND/OR semantics for StandardAnalyzer 
 Key: CASSANDRA-12674
 URL: https://issues.apache.org/jira/browse/CASSANDRA-12674
 Project: Cassandra
  Issue Type: Bug
  Components: sasi
 Environment: Cassandra 3.7
Reporter: DOAN DuyHai



{code:sql}
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.7 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh> use test;
cqlsh:test> CREATE TABLE sasi_bug(id int, clustering int, val text, PRIMARY 
KEY((id), clustering));
cqlsh:test> CREATE CUSTOM INDEX ON sasi_bug(val) USING 
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
'mode': 'CONTAINS',
 'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
'analyzed': 'true'};

//1st example SAME PARTITION KEY
cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 1, 
'homeworker');
cqlsh:test> INSERT INTO sasi_bug(id, clustering , val ) VALUES(1, 2, 
'hardworker');
cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%work home%';

 id | clustering | val
++
  1 |  1 | homeworker
  1 |  2 | hardworker

(2 rows)

//2nd example DIFFERENT PARTITION KEY
cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(10, 1, 'speedrun');
cqlsh:test> INSERT INTO sasi_bug(id, clustering, val) VALUES(11, 1, 'longrun');
cqlsh:test> SELECT * FROM sasi_bug WHERE val LIKE '%long run%';

 id | clustering | val
++-
 11 |  1 | longrun

(1 rows)
{code}

In the 1st example, both rows belong to the same partition so SASI returns both 
values. Indeed {{LIKE '%work home%'}} means {{contains 'work' OR 'home'}} so 
the result makes sense

In the 2nd example, only one row is returned whereas we expect 2 rows because 
{{LIKE '%long run%'}} means {{contains 'long' OR 'run'}} so *speedrun* should 
be returned too.

So where is the problem ? Explanation:

When there is only 1 predicate, the root operation type is an *AND*:

{code:java|title=QueryPlan}
private Operation analyze()
{
try
{
Operation.Builder and = new Operation.Builder(OperationType.AND, 
controller);
controller.getExpressions().forEach(and::add);
return and.complete();
}
   ...
}
{code}

During the parsing of {{LIKE '%long run%'}}, SASI creates 2 expressions for the 
searched term: {{long}} and {{run}}, which corresponds to an *OR* logic. 
However, this piece of code just ruins the *OR* logic:

{code:java|title=Operation}
public Operation complete()
{
if (!expressions.isEmpty())
{
ListMultimap analyzedExpressions 
= analyzeGroup(controller, op, expressions);
RangeIterator.Builder range = 
controller.getIndexes(op, analyzedExpressions.values());
 ...
}
{code}

As you can see, we blindly take all the *values* of the MultiMap (which 
contains a single entry for the {{val}} column with 2 expressions) and pass it 
to {{controller.getIndexes(...)}}

{code:java|title=QueryController}
public RangeIterator.Builder getIndexes(OperationType op, 
Collection expressions)
{
if (resources.containsKey(expressions))
throw new IllegalArgumentException("Can't process the same 
expressions multiple times.");

RangeIterator.Builder builder = op == OperationType.OR
? RangeUnionIterator.builder()
: 
RangeIntersectionIterator.builder();
...
}
{code}

And because the root operation has *AND* type, the 
{{RangeIntersectionIterator}} will be used on both expressions {{long}} and 
{{run}}.

So when data belong to different partitions, we have the *AND* logic that 
applies and eliminates _speedrun_

When data belong to the same partition but different row, the 
{{RangeIntersectionIterator}} returns a single partition and then the rows are 
filtered further by {{operationTree.satisfiedBy}} and the results are correct

{code:java|title=QueryPlan}
while (currentKeys.hasNext())
{
DecoratedKey key = currentKeys.next();

if (!keyRange.right.isMinimum() && 
keyRange.right.compareTo(key) < 0)
return endOfData();

try (UnfilteredRowIterator partition = 
controller.getPartition(key, executionController))
{
Row staticRow = partition.staticRow();
List clusters = new ArrayList<>();

while (partition.hasNext())
{
Unfiltered row = partition.next();
   

[jira] [Comment Edited] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501472#comment-15501472
 ] 

DOAN DuyHai edited comment on CASSANDRA-12573 at 9/18/16 6:59 PM:
--

Ok it's my bad.  The root of the operation tree for the QueryPlanner is an 
{{AND}}

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java#L54-L60

The {{'%RevisionDiff%ItemImpl%'}} is split into 2 distincts predicates : 
{{CONTAINS RevisionDiff}} &  {{CONTAINS ItemImpl}} and the *AND* logic does 
apply.

 The comment in the source code is pretty misleading.

Back to the original experiments, exp. 1 is consistent, exp. 2 and 4 results 
are also consistent

Only experiment 3 results are wrong:

{code:sql}
insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;

select c2 from kmv.kmv where c2 like '%w%a%';
{code}

Expected result: qweasd, qwea1.
Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.

 Let me reproduce it


was (Author: doanduyhai):
Ok it's my bad.  The root of the operation tree for the QueryPlanner is an 
{{AND}}

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java#L54-L60

The {{'%RevisionDiff%ItemImpl%'}} is split into 2 distincts predicates : 
{{CONTAINS RevisionDiff}} &  {{CONTAINS ItemImpl}} and the **AND** logic does 
apply.

 The comment in the source code is pretty misleading.

Back to the original experiments, exp. 1 is consistent, exp. 2 and 4 results 
are also consistent

Only experiment 3 results are wrong:

```sql
insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;

select c2 from kmv.kmv where c2 like '%w%a%';

```

Expected result: qweasd, qwea1.
Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.

 Let me reproduce it

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501472#comment-15501472
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Ok it's my bad.  The root of the operation tree for the QueryPlanner is an 
{{AND}}

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/QueryPlan.java#L54-L60

The {{'%RevisionDiff%ItemImpl%'}} is split into 2 distincts predicates : 
{{CONTAINS RevisionDiff}} &  {{CONTAINS ItemImpl}} and the **AND** logic does 
apply.

 The comment in the source code is pretty misleading.

Back to the original experiments, exp. 1 is consistent, exp. 2 and 4 results 
are also consistent

Only experiment 3 results are wrong:

```sql
insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;

select c2 from kmv.kmv where c2 like '%w%a%';

```

Expected result: qweasd, qwea1.
Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.

 Let me reproduce it

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501454#comment-15501454
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Let me reproduce your results with an unit test

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, asdqwe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12662) OOM when using SASI index

2016-09-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501440#comment-15501440
 ] 

DOAN DuyHai commented on CASSANDRA-12662:
-

Default hardcoded value for memIndexTable is 1Gb: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/PerSSTableIndexWriter.java#L369-L372

bq. 2.8Gb of the heap is taken by the index data, pending for flush (see the 
screenshot)

When you have more than 1Gb of index data, SASI flushes the index by chunks of 
1Gb into temporary index files : 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/PerSSTableIndexWriter.java#L247-L250

Then it needs a 2nd pass to merge them into memory to write the final index 
file, see here:  
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/disk/PerSSTableIndexWriter.java#L311-L326

Thus, it may take a while to write the final SASI index file. The speed of all 
this depends on many factors, mainly CPU and Disk I/O. 

What are you disk hardware specs ? SSD ? Spinning disk ? shared storage ?

bq. Why can't Cassandra keep up with the inserted data and flush it?

Write are CPU-intensive. Compactions are more disk I/O intensive

bq. What resources/configuration should be changed to improve the performance?

Right now, 4 cores CPU is below the official recommendation to run Cassandra in 
production, which is 8 cores CPU. Same for RAM, recommendation is 32Gb, see 
here: http://cassandra.apache.org/doc/latest/operating/hardware.html


> OOM when using SASI index
> -
>
> Key: CASSANDRA-12662
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12662
> Project: Cassandra
>  Issue Type: Bug
> Environment: Linux, 4 CPU cores, 16Gb RAM, Cassandra process utilizes 
> ~8Gb, of which ~4Gb is Java heap
>Reporter: Maxim Podkolzine
>Priority: Critical
> Fix For: 3.6
>
> Attachments: memory-dump.png
>
>
> 2.8Gb of the heap is taken by the index data, pending for flush (see the 
> screenshot). As a result the node fails with OOM.
> Questions:
> - Why can't Cassandra keep up with the inserted data and flush it?
> - What resources/configuration should be changed to improve the performance?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15501413#comment-15501413
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

bq. That's good news. When do you plan to merge it?
See this JIRA:  [CASSANDRA-10765] (second comment)

bq. As a customer I have a slightly different view on this. My expectations are 
based on prior experience and common sense.

 What are you talking about ? Customer of what ? Apache Cassandra is 
open-source.

bq. My current impression is that this feature is half-baked and not well 
tested. But it's just my opinion.

Well that are the risks of open source software, you don't have any strong 
guarantees/SLA or whatsoever. But you can contribute to improve SASI. Any pull 
request is welcomed of course. The community will be more than happy to have 
contributors

bq. After that I run the queries with '%' inside. As you can see multi-patterns 
are handled by AND:

Absolutely not. Your examples just show how the index mode {{CONTAINS}} works. 

First query {{name like '%RevisionDiff%';}} means give me all names containing 
{{RevisionDiff}} substring

2nd query {{name like '%ItemImpl%';}} means give me all names containing 
{{ItemImpl}} substring

3rd query {{name like '%RevisionDiff%ItemImpl%';}} means give me all names 
containing {{RevisionDiff}} substring OR 'ItemImpl' substring

Nowhere I see the *AND* semantic in your example





> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496927#comment-15496927
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

"Since CQL doesn't allow multiple LIKE constraints in one query, it does look 
like a bug."

--> Wrong, a bug is something that does not work as expected e.g that does not 
work as documented.

For SASI, nowhere we say that % in the middle of the search term works like a 
wildcard... it's not a bug.

What you ask for is an enhancement

"Can you suggest a workaround to search for several patterns?"

--> use StandardAnalyzer to split the sentence into token and they query with 
LIKE 'pattern1 pattern2'. However you'll get OR semantics, not AND

 SASI initially support multiple predicates, something like :  WHERE ((col1=xxx 
OR col2=yyy) AND (col3 LIKE '%zzz')) but it is not merged yet into the 3.x trunk


> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496060#comment-15496060
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

"To let users to query strings by more flexible patterns. E.g. to find 
'123foo456bar789' by the '%foo%bar%' pattern." 

--> Feasible but not sure how complex it would be, and if it will be as 
optimized as the current impl

Anyway, please create a new JIRA for this feature request and close this one 
since it's not a bug

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496023#comment-15496023
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

"Am I right that '%' interprets as a wildcard only when it is a first or last 
character and it is an expected behaviour?"
  --> yes it is

"If so then are there any plans to change it (interpret it as a wildcard in the 
middle of a string too)?" 

  --> why should we change it ? And if we change it, it will only make sense 
for NonTokeninzingAnalyzer because the StandardAnalyzer will always consired 
{{%}} as a line separator

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494341#comment-15494341
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Ok I get around the issue about %w%a%

So this will be interpreter first by the CQL parser as LIKE CONTAINS with 
searched term = w%a

And then things get complicated

1) if you're using NonTokeninzingAnalyzer or NoOpAnalyzer, everything is fine, 
the % in 'w%a' is interpreted as simple literal and not wildcard character

2) if you're using StandardAnalyzer, it's an entirely different story. During 
the parsing of the search predicates by the query planer, the term 'w%a' is 
passed to the analyzer (StandardAnalyzer here):  
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java#L303-L323

The StandardAnalyzer is tokenizing the search term so 'w%a' becomes 2 distinct 
token, 'w' OR 'a'. Why does it ignore the % ? Because according to Unicode line 
breaking rule, % is a separator, read here: 
http://www.unicode.org/Public/UNIDATA/LineBreak.txt

Nowhere in the source code we can see this, in fact you'll need to look into 
the JFlex grammar file 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/analyzer/StandardTokenizerImpl.jflex
 to see a reference to Unicode word breaking rules ...

So indeed when using StandardAnalyzer, any % character will be interpreter as a 
separator so our LIKE '%w%a%' is indeed transformed into a LIKE '%w%' OR LIKE 
'%a%' e.g all words containing 'w' OR 'a', irrespective of their relative 
position to each other ...

Why is it an OR predicate and not an AND predicate ? The answer is a comment in 
the source code here: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/plan/Operation.java#L290-L295

Experiment 1 returns 0 rows because using NonTokenizingAnalyzer, CORRECT
Experiment 2 returns 3 rows (asdqwe, qweasd, qwea1) because using 
StandardAnalyzer and all the words contains 'w' OR 'a', CORRECT

Same remark for experiments 3 & 4.

Indeed it is not really a bug, it is because you're using the StandardAnalyzer 
with tokenization ...


> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if 

[jira] [Issue Comment Deleted] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12573:

Comment: was deleted

(was: Right, the escaping issue does not matter here. What we want to 
understand is how SASI interprets the {{%}} in the middle of the term.

Please note that you're using C* 3.7. I have contributed a bug fix (that was 
scheduled for 3.9 and is in trunk) about skip stop words being applied after 
stemming whereas it should be applied before. I'm not sure if it is relevant to 
the current data set here but it rings a bell in my head when you get weird 
behaviors only when using StandardAnalyzer)

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493798#comment-15493798
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Right, the escaping issue does not matter here. What we want to understand is 
how SASI interprets the {{%}} in the middle of the term.

Please note that you're using C* 3.7. I have contributed a bug fix (that was 
scheduled for 3.9 and is in trunk) about skip stop words being applied after 
stemming whereas it should be applied before. I'm not sure if it is relevant to 
the current data set here but it rings a bell in my head when you get weird 
behaviors only when using StandardAnalyzer

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493799#comment-15493799
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Right, the escaping issue does not matter here. What we want to understand is 
how SASI interprets the {{%}} in the middle of the term.

Please note that you're using C* 3.7. I have contributed a bug fix (that was 
scheduled for 3.9 and is in trunk) about skip stop words being applied after 
stemming whereas it should be applied before. I'm not sure if it is relevant to 
the current data set here but it rings a bell in my head when you get weird 
behaviors only when using StandardAnalyzer

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493749#comment-15493749
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

I'm going to try reproducing the issue. But anyway right now there is indeed 
*no escaping* of {{%}}, either for the first, last character or in the middle 
of the term.

I'm attempting escaping for first & last character. 

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, asdqwe.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493699#comment-15493699
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Experiments 2, 3, 4 also contains a {{%}} in the middle of the searched term ...

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, asdqwe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493680#comment-15493680
 ] 

DOAN DuyHai edited comment on CASSANDRA-12573 at 9/15/16 3:33 PM:
--

In your data set, there is no row containing the substring {{w%a}}. The {{%}} 
is interpreted as a literal % value and not as *all characters*


was (Author: doanduyhai):
In your data set, there is no row containing the substring '%w%a%'

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, 

[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493680#comment-15493680
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

In your data set, there is no row containing the substring '%w%a%'

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, asdqwe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12573) SASI index. Incorrect results for '%foo%bar%'-like search pattern.

2016-09-15 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493422#comment-15493422
 ] 

DOAN DuyHai commented on CASSANDRA-12573:
-

Currently SASI can only understand the {{%}} for the beginning (suffix) or 
ending (prefix) position. Any expression containing the {{%}} in the middle 
like {{%w%a%}} will *not* be interpreter by SASI as wildcard.

{{%w%a%}} will translate into "Give me all results containing {{w%a}} 

> SASI index. Incorrect results for '%foo%bar%'-like search pattern. 
> ---
>
> Key: CASSANDRA-12573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12573
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mikhail Krupitskiy
>Assignee: Alex Petrov
>Priority: Critical
>  Labels: sasi
>
> We use Cassandra 3.7 and have faced a strange behaviour of SELECT requests 
> with "LIKE '%foo%bar%'" constraints on a column with SASI index.
> Below are few experiments that show this behaviour.
> Experiment 1:
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: no rows.
> Experiment 2 (NOTE: definition of index is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int primary key, c1 text, c2 text);
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (2, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (3, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (4, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (5, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: asdqwe, qweasd, qwea1.
> Experiment 3 (NOTE: primary key is compound now and inserted data was 
> changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w%a%';
> {noformat}
> Expected result: qweasd, qwea1.
> Actual result: qwe, qweasd, qwea1, 1qwe, asdqwe.
> Experiment 4 (NOTE: search criteria is changed):
> {noformat}
> drop keyspace if exists kmv;
> create keyspace if not exists kmv WITH REPLICATION = { 'class' : 
> 'SimpleStrategy', 'replication_factor':'1'} ;
> use kmv;
> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
> c1));
> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {
>  'mode': 'CONTAINS',
>  'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer',
>  'analyzed': 'true'
> };
> insert into kmv (id, c1, c2) values (1, 'f21', 'qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f22', 'qweasd') ;
> insert into kmv (id, c1, c2) values (1, 'f23', 'qwea1') ;
> insert into kmv (id, c1, c2) values (1, 'f24', '1qwe') ;
> insert into kmv (id, c1, c2) values (1, 'f25', 'asdqwe') ;
> select c2 from kmv.kmv where c2 like '%w22%a%';
> {noformat}
> Expected result: no rows.
> Actual result: qweasd, qwea1, 

[jira] [Commented] (CASSANDRA-10993) Make read and write requests paths fully non-blocking, eliminate related stages

2016-08-25 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437027#comment-15437027
 ] 

DOAN DuyHai commented on CASSANDRA-10993:
-

Although I understand all the benefits of using off-the-shelf libraries (do not 
re-invent the wheels etc ...), I think that in this particular case, 
[~iamaleksey] point is very relevant. 

 For such a core functionality as TPC, we must have a total control over our 
destiny by giving enough room to change internals and low level impls whenever 
needed. Of course forking and modifying RxJava or other libraries to suit our 
requirement is always possible but *if* we had to go that way later, I'm not 
sure the effort we'll have to put in that task (of rewriting/modifying) is a 
win vs writing our own core lib for TPC purpose. 

My 2 cents

> Make read and write requests paths fully non-blocking, eliminate related 
> stages
> ---
>
> Key: CASSANDRA-10993
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10993
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Local Write-Read Paths
>Reporter: Aleksey Yeschenko
>Assignee: Tyler Hobbs
> Fix For: 3.x
>
> Attachments: 10993-reads-no-evloop-integration-six-node-stress.svg, 
> tpc-benchmarks-2.txt, tpc-benchmarks.txt
>
>
> Building on work done by [~tjake] (CASSANDRA-10528), [~slebresne] 
> (CASSANDRA-5239), and others, convert read and write request paths to be 
> fully non-blocking, to enable the eventual transition from SEDA to TPC 
> (CASSANDRA-10989)
> Eliminate {{MUTATION}}, {{COUNTER_MUTATION}}, {{VIEW_MUTATION}}, {{READ}}, 
> and {{READ_REPAIR}} stages, move read and write execution directly to Netty 
> context.
> For lack of decent async I/O options on Linux, we’ll still have to retain an 
> extra thread pool for serving read requests for data not residing in our page 
> cache (CASSANDRA-5863), however.
> Implementation-wise, we only have two options available to us: explicit FSMs 
> and chained futures. Fibers would be the third, and easiest option, but 
> aren’t feasible in Java without resorting to direct bytecode manipulation 
> (ourselves or using [quasar|https://github.com/puniverse/quasar]).
> I have seen 4 implementations bases on chained futures/promises now - three 
> in Java and one in C++ - and I’m not convinced that it’s the optimal (or 
> sane) choice for representing our complex logic - think 2i quorum read 
> requests with timeouts at all levels, read repair (blocking and 
> non-blocking), and speculative retries in the mix, {{SERIAL}} reads and 
> writes.
> I’m currently leaning towards an implementation based on explicit FSMs, and 
> intend to provide a prototype - soonish - for comparison with 
> {{CompletableFuture}}-like variants.
> Either way the transition is a relatively boring straightforward refactoring.
> There are, however, some extension points on both write and read paths that 
> we do not control:
> - authorisation implementations will have to be non-blocking. We have control 
> over built-in ones, but for any custom implementation we will have to execute 
> them in a separate thread pool
> - 2i hooks on the write path will need to be non-blocking
> - any trigger implementations will not be allowed to block
> - UDFs and UDAs
> We are further limited by API compatibility restrictions in the 3.x line, 
> forbidding us to alter, or add any non-{{default}} interface methods to those 
> extension points, so these pose a problem.
> Depending on logistics, expecting to get this done in time for 3.4 or 3.6 
> feature release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11594) Too many open files on directories

2016-08-25 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436412#comment-15436412
 ] 

DOAN DuyHai commented on CASSANDRA-11594:
-

[~n0rad] To confirm that prometheus exporter is the culprit, you can also try 
to deactivate it for a few time and watch if the number of open files decreases 
or remains stable

> Too many open files on directories
> --
>
> Key: CASSANDRA-11594
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11594
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: n0rad
>Assignee: Stefania
>Priority: Critical
> Attachments: Grafana   Cassandra   Cluster.png, openfiles.zip, 
> screenshot.png
>
>
> I have a 6 nodes cluster in prod in 3 racks.
> each node :
> - 4Gb commitlogs on 343 files
> - 275Gb data on 504 files 
> On saturday, 1 node in each rack crash with with too many open files (seems 
> to be the similar node in each rack).
> {code}
> lsof -n -p $PID give me 66899 out of 65826 max
> {code}
> it contains 64527 open directories (2371 uniq)
> a part of the list :
> {code}
> java19076 root 2140r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2141r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2142r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2143r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2144r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2145r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2146r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2147r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2148r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2149r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2150r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2151r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2152r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2153r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2154r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2155r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> {code}
> The 3 others nodes crashes 4 hours later



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11521) Implement streaming for bulk read requests

2016-08-08 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411391#comment-15411391
 ] 

DOAN DuyHai commented on CASSANDRA-11521:
-

Thanks Stefania, great works!




> Implement streaming for bulk read requests
> --
>
> Key: CASSANDRA-11521
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11521
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Local Write-Read Paths
>Reporter: Stefania
>Assignee: Stefania
>  Labels: client-impacting, protocolv5
> Fix For: 3.x
>
> Attachments: final-patch-jfr-profiles-1.zip
>
>
> Allow clients to stream data from a C* host, bypassing the coordination layer 
> and eliminating the need to query individual pages one by one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-08-01 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402741#comment-15402741
 ] 

DOAN DuyHai commented on CASSANDRA-11990:
-

bq. backward-compatibility support

Can you clarify about this backward-compatibility ? Between what and what ?

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Attachments: perf.pdf, size_comparison.png
>
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such would enable different 
> partitioners, collections support, primary key indexing etc.), 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12268) Make MV Index creation robust for wide referent rows

2016-07-21 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388481#comment-15388481
 ] 

DOAN DuyHai edited comment on CASSANDRA-12268 at 7/21/16 9:41 PM:
--

Additional remark:

 We can allow the user define the thresholds:

 * either a maximum number of mutation per logged batch to send to view 
paired-replica
 * or/and the maximum size (in bytes) of the mutation per logged batch to send 
to view paired-replica

At runtime, either threshold will apply depending on the situation

/cc [~carlyeks]


was (Author: doanduyhai):
Additional remark:

 We can allow the user define the thresholds:

 * either a maximum number of mutation per logged batch to send to view 
paired-replica
 * or/and the maximum size (in bytes) of the mutation per logged batch to send 
to view paired-replica

At runtime, either threshold will apply depending on the situation

> Make MV Index creation robust for wide referent rows
> 
>
> Key: CASSANDRA-12268
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12268
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Jonathan Shook
>
> When creating an index for a materialized view for extant data, heap pressure 
> is very dependent on the cardinality of of rows associated with each index 
> value. With the way that per-index value rows are created within the index, 
> this can cause unbounded heap pressure, which can cause OOM. This appears to 
> be a side-effect of how each index row is applied atomically as with batches.
> The commit logs can accumulate enough during the process to prevent the node 
> from being restarted. Given that this occurs during global index creation, 
> this can happen on multiple nodes, making stable recovery of a node set 
> difficult, as co-replicas become unavailable to assist in back-filling data 
> from commitlogs.
> While it is understandable that you want to avoid having relatively wide rows 
>  even in materialized views, this scenario represent a particularly difficult 
> scenario for triage.
> The basic recommendation for improving this is to sub-group the index 
> creation into smaller chunks internally, providing a maximal bound against 
> the heap pressure when it is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12268) Make MV Index creation robust for wide referent rows

2016-07-21 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388481#comment-15388481
 ] 

DOAN DuyHai commented on CASSANDRA-12268:
-

Additional remark:

 We can allow the user define the thresholds:

 * either a maximum number of mutation per logged batch to send to view 
paired-replica
 * or/and the maximum size (in bytes) of the mutation per logged batch to send 
to view paired-replica

At runtime, either threshold will apply depending on the situation

> Make MV Index creation robust for wide referent rows
> 
>
> Key: CASSANDRA-12268
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12268
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Jonathan Shook
>
> When creating an index for a materialized view for extant data, heap pressure 
> is very dependent on the cardinality of of rows associated with each index 
> value. With the way that per-index value rows are created within the index, 
> this can cause unbounded heap pressure, which can cause OOM. This appears to 
> be a side-effect of how each index row is applied atomically as with batches.
> The commit logs can accumulate enough during the process to prevent the node 
> from being restarted. Given that this occurs during global index creation, 
> this can happen on multiple nodes, making stable recovery of a node set 
> difficult, as co-replicas become unavailable to assist in back-filling data 
> from commitlogs.
> While it is understandable that you want to avoid having relatively wide rows 
>  even in materialized views, this scenario represent a particularly difficult 
> scenario for triage.
> The basic recommendation for improving this is to sub-group the index 
> creation into smaller chunks internally, providing a maximal bound against 
> the heap pressure when it is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12149) NullPointerException on SELECT with SASI index

2016-07-11 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371429#comment-15371429
 ] 

DOAN DuyHai commented on CASSANDRA-12149:
-

bq. How could I partition SELECT results hitting a single large Cassandra 
partition, when I do not know values of clustering columns?

 Use server-side paging:

 "SELECT * FROM mytable WHERE partition = 'xxx'"

 Then set a fetchSize on the statement using the driver and then retrieve an 
Iterator from the ResultSet and start iterating with 
{{while(iterator.hasNext())}}

> NullPointerException on SELECT with SASI index
> --
>
> Key: CASSANDRA-12149
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12149
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
>Reporter: Andrey Konstantinov
> Attachments: CASSANDRA-12149.txt
>
>
> If I execute the sequence of queries (see the attached file), Cassandra 
> aborts a connection reporting NPE on server side. SELECT query without token 
> range filter works, but does not work when token range filter is specified. 
> My intent was to issue multiple SELECT queries targeting the same single 
> partition, filtered by a column indexed by SASI, partitioning results by 
> different token ranges.
> Output from cqlsh on SELECT is the following:
> cqlsh> SELECT namespace, entity, timestamp, feature1, feature2 FROM 
> mykeyspace.myrecordtable WHERE namespace = 'ns2' AND entity = 'entity2' AND 
> feature1 > 11 AND feature1 < 31  AND token(namespace, entity) <= 
> 9223372036854775807;
> ServerError:  message="java.lang.NullPointerException">



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12149) NullPointerException on SELECT with SASI index

2016-07-09 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369264#comment-15369264
 ] 

DOAN DuyHai commented on CASSANDRA-12149:
-

The query used = {{SELECT namespace, entity, timestamp, feature1, feature2 FROM 
mykeyspace.myrecordtable WHERE namespace = 'ns2' AND entity = 'entity2' AND 
feature1= 11 AND token(namespace, entity) <= 9223372036854775807;}}

Ok, after successfully re-producing the NPE in debug mode, it is *not* a SASI 
bug but is rather related to how Cassandra optimises the *WHERE* clause in 
general.

 NPE root cause is {{slice.bound(b) == null}} in 
{{TokenRestriction.bounds(Bound b, QueryOptions options)}}:

{code:java}
public List bounds(Bound b, QueryOptions options) throws 
InvalidRequestException
{
return 
Collections.singletonList(slice.bound(b).bindAndGet(options));
}
{code}

Going up the call stack, the culprit is at 
{{StatementRestrictions.getPartitionKeyBounds(IPartitioner p, QueryOptions 
options)}}:

{code:java}
private AbstractBounds 
getPartitionKeyBounds(IPartitioner p,

QueryOptions options)
{
ByteBuffer startKeyBytes = getPartitionKeyBound(Bound.START, options);
ByteBuffer finishKeyBytes = getPartitionKeyBound(Bound.END, options);

{code}

 Since we have no lower bound restriction on the {{token(namespace, entity)}}, 
the call to {{getPartitionKeyBound(Bound.START, options)}} generates the NPE.

 I try another query without using secondary index {{SELECT namespace, entity, 
timestamp, feature1, feature2 FROM mykeyspace.myrecordtable WHERE namespace = 
'ns2' AND entity = 'entity2' AND token(namespace, entity) <= 
9223372036854775807;}} and this time, no NPE.

 The place in the call stack where both queries diverge is at 
{{SelectStatements.getQuery(QueryOptions options, int nowInSec, int userLimit, 
int perPartitionLimit)}}:

{code:java}
public ReadQuery getQuery(QueryOptions options, int nowInSec, int 
userLimit, int perPartitionLimit) throws RequestValidationException
{
DataLimits limit = getDataLimits(userLimit, perPartitionLimit);
if (restrictions.isKeyRange() || restrictions.usesSecondaryIndexing())
return getRangeCommand(options, limit, nowInSec);

return getSliceCommands(options, limit, nowInSec);
}
{code}

 When using secondary index, we fall in the *if* condition obviously. When NOT 
using secondary index as per my 2nd SELECT statement, the *if* condition is not 
verified so the code path is {{return getSliceCommands(options, limit, 
nowInSec);}}.

 Strangely enough, even with the 2nd SELECT, {{restrictions.isKeyRange()}} 
should be *true*, why is it *false* 

 The culprit is at 
{{StatementRestrictions.processPartitionKeyRestrictions(boolean 
hasQueriableIndex)}}

{code:java}
...
if (partitionKeyRestrictions.isOnToken())
isKeyRange = true;

if (hasUnrestrictedPartitionKeyComponents())
{
if (!partitionKeyRestrictions.isEmpty())
{
if (!hasQueriableIndex)
throw invalidRequest("Partition key parts: %s must be 
restricted as other parts are",
 Joiner.on(", 
").join(getPartitionKeyUnrestrictedComponents()));
}

isKeyRange = true;
usesSecondaryIndexing = hasQueriableIndex;
}
}
{code}

 Condition {{if (partitionKeyRestrictions.isOnToken())}} evaluates to *false* 
so variable _isKeyRange_ is never set to *true*.

 Condition {{if (partitionKeyRestrictions.isOnToken())}} evaluates to *false* 
because of {{TokenFilter.isOnToken()}}:

{code:java}
public boolean isOnToken()
{
// if all partition key columns have non-token restrictions, we can 
simply use the token range to filter
// those restrictions and then ignore the token range
return restrictions.size() < tokenRestriction.size();
}
{code}

 Here, since there are *more* conditions on primary key than on token 
restriction, the SELECT is not considered to be a token restriction, which is 
sensible.

 By the way, if we look at the original WHERE clause {{WHERE namespace = 'ns2' 
AND entity = 'entity2' AND feature1= 11 AND token(namespace, entity) <= 
9223372036854775807;}}, the restriction using *token* function is *useless* 
because we already provide the complete partition key restriction.

 And I tried the original query by removing {{token(namespace, entity) <= 
9223372036854775807}} and it works like a charm.

 I would consider this JIRA to be *not an issue*

cc [~xedin]  [~beobal] and [~avkonst]
 




> NullPointerException on SELECT with SASI index
> --
>
> Key: CASSANDRA-12149
> 

[jira] [Commented] (CASSANDRA-12149) NullPointerException on SELECT with SASI index

2016-07-08 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368141#comment-15368141
 ] 

DOAN DuyHai commented on CASSANDRA-12149:
-

I'll debug this issue this weekend using the given sample dataset

> NullPointerException on SELECT with SASI index
> --
>
> Key: CASSANDRA-12149
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12149
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
>Reporter: Andrey Konstantinov
> Attachments: CASSANDRA-12149.txt
>
>
> If I execute the sequence of queries (see the attached file), Cassandra 
> aborts a connection reporting NPE on server side. SELECT query without token 
> range filter works, but does not work when token range filter is specified. 
> My intent was to issue multiple SELECT queries targeting the same single 
> partition, filtered by a column indexed by SASI, partitioning results by 
> different token ranges.
> Output from cqlsh on SELECT is the following:
> cqlsh> SELECT namespace, entity, timestamp, feature1, feature2 FROM 
> mykeyspace.myrecordtable WHERE namespace = 'ns2' AND entity = 'entity2' AND 
> feature1 > 11 AND feature1 < 31  AND token(namespace, entity) <= 
> 9223372036854775807;
> ServerError:  message="java.lang.NullPointerException">



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12149) NullPointerException on SELECT with SASI index

2016-07-08 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367302#comment-15367302
 ] 

DOAN DuyHai commented on CASSANDRA-12149:
-

I can try to test and reproduce

> NullPointerException on SELECT with SASI index
> --
>
> Key: CASSANDRA-12149
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12149
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
>Reporter: Andrey Konstantinov
> Attachments: CASSANDRA-12149.txt
>
>
> If I execute the sequence of queries (see the attached file), Cassandra 
> aborts a connection reporting NPE on server side. SELECT query without token 
> range filter works, but does not work when token range filter is specified. 
> My intent was to issue multiple SELECT queries targeting the same single 
> partition, filtered by a column indexed by SASI, partitioning results by 
> different token ranges.
> Output from cqlsh on SELECT is the following:
> cqlsh> SELECT namespace, entity, timestamp, feature1, feature2 FROM 
> mykeyspace.myrecordtable WHERE namespace = 'ns2' AND entity = 'entity2' AND 
> feature1 > 11 AND feature1 < 31  AND token(namespace, entity) <= 
> 9223372036854775807;
> ServerError:  message="java.lang.NullPointerException">



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12149) NullPointerException on SELECT with SASI index

2016-07-07 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367255#comment-15367255
 ] 

DOAN DuyHai commented on CASSANDRA-12149:
-

WHich version of Cassandra are you using ?

> NullPointerException on SELECT with SASI index
> --
>
> Key: CASSANDRA-12149
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12149
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
>Reporter: Andrey Konstantinov
> Attachments: CASSANDRA-12149.txt
>
>
> If I execute the sequence of queries (see the attached file), Cassandra 
> aborts a connection reporting NPE on server side. SELECT query without token 
> range filter works, but does not work when token range filter is specified. 
> My intent was to issue multiple SELECT queries targeting the same single 
> partition, filtered by a column indexed by SASI, partitioning results by 
> different token ranges.
> Output from cqlsh on SELECT is the following:
> cqlsh> SELECT namespace, entity, timestamp, feature1, feature2 FROM 
> mykeyspace.myrecordtable WHERE namespace = 'ns2' AND entity = 'entity2' AND 
> feature1 > 11 AND feature1 < 31  AND token(namespace, entity) <= 
> 9223372036854775807;
> ServerError:  message="java.lang.NullPointerException">



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-07-04 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361102#comment-15361102
 ] 

DOAN DuyHai commented on CASSANDRA-12073:
-

[~xedin]  new patch attached (V2) with your inline suggestion + 1 extra unit 
test for {{OnDiskIndex}}

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.x
>
> Attachments: patch_PREFIX_search_with_CONTAINS_mode.txt, 
> patch_PREFIX_search_with_CONTAINS_mode_V2.txt
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal 
> |  Official | Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal 
> |  null |Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
> {
> if (currentBlock == null)
> return endOfData();
> if (offset >= 0 && offset < currentBlock.termCount())
> {
> DataTerm currentTerm = currentBlock.getTerm(nextOffset());
> if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
> continue;
> // flip the flag right on the first bounds match
> // to avoid expensive comparisons
> checkLower = false;
> if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
> return endOfData();
> return currentTerm;
> }
> nextBlock();
> }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the term match but also returns *false* when it's a *partial term* !
> {code:java}
> public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
> {
> if (!hasUpper())
> return true;
> if (nonMatchingPartial(term))
> return false;
> int cmp = term.compareTo(validator, upper.value, false);
> return cmp < 0 || cmp == 0 && 

[jira] [Updated] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-07-04 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12073:

Attachment: patch_PREFIX_search_with_CONTAINS_mode_V2.txt

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.x
>
> Attachments: patch_PREFIX_search_with_CONTAINS_mode.txt, 
> patch_PREFIX_search_with_CONTAINS_mode_V2.txt
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal 
> |  Official | Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal 
> |  null |Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
> {
> if (currentBlock == null)
> return endOfData();
> if (offset >= 0 && offset < currentBlock.termCount())
> {
> DataTerm currentTerm = currentBlock.getTerm(nextOffset());
> if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
> continue;
> // flip the flag right on the first bounds match
> // to avoid expensive comparisons
> checkLower = false;
> if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
> return endOfData();
> return currentTerm;
> }
> nextBlock();
> }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the term match but also returns *false* when it's a *partial term* !
> {code:java}
> public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
> {
> if (!hasUpper())
> return true;
> if (nonMatchingPartial(term))
> return false;
> int cmp = term.compareTo(validator, upper.value, false);
> return cmp < 0 || cmp == 0 && upper.inclusive;
> }
> {code}
> By debugging the OnDiskIndex data, I've 

[jira] [Commented] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-07-03 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15360655#comment-15360655
 ] 

DOAN DuyHai commented on CASSANDRA-11990:
-

[~ifesdjeen]  There is also a big change in {{SSTableReader}} that is required 
to implement this JIRA. Currently this class only exposes the method 
{{DecoratedKey  SSTableReader.keyAt(long indexPosition)}}. We'll need to 
introduce a new method {{Unfiltered SSTableReader.rowAt(long rowOffset)}}.

 Modifying the {{SSTableReader}} has a big impact on the overall storage engine 

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such would enable different 
> partitioners, collections support, primary key indexing etc.), 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-27 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352099#comment-15352099
 ] 

DOAN DuyHai commented on CASSANDRA-12073:
-

OK, I'll update the patch with your remarks [~xedin]

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.x
>
> Attachments: patch_PREFIX_search_with_CONTAINS_mode.txt
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal 
> |  Official | Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal 
> |  null |Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
> {
> if (currentBlock == null)
> return endOfData();
> if (offset >= 0 && offset < currentBlock.termCount())
> {
> DataTerm currentTerm = currentBlock.getTerm(nextOffset());
> if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
> continue;
> // flip the flag right on the first bounds match
> // to avoid expensive comparisons
> checkLower = false;
> if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
> return endOfData();
> return currentTerm;
> }
> nextBlock();
> }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the term match but also returns *false* when it's a *partial term* !
> {code:java}
> public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
> {
> if (!hasUpper())
> return true;
> if (nonMatchingPartial(term))
> return false;
> int cmp = term.compareTo(validator, upper.value, false);
> return cmp < 0 || cmp == 0 && upper.inclusive;
> }
> {code}
> By debugging the OnDiskIndex data, I've found:
> {noformat}
> 

[jira] [Updated] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12073:

Attachment: patch_PREFIX_search_with_CONTAINS_mode.txt

Patch attached

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.8
>
> Attachments: patch_PREFIX_search_with_CONTAINS_mode.txt
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal 
> |  Official | Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal 
> |  null |Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
> {
> if (currentBlock == null)
> return endOfData();
> if (offset >= 0 && offset < currentBlock.termCount())
> {
> DataTerm currentTerm = currentBlock.getTerm(nextOffset());
> if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
> continue;
> // flip the flag right on the first bounds match
> // to avoid expensive comparisons
> checkLower = false;
> if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
> return endOfData();
> return currentTerm;
> }
> nextBlock();
> }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the term match but also returns *false* when it's a *partial term* !
> {code:java}
> public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
> {
> if (!hasUpper())
> return true;
> if (nonMatchingPartial(term))
> return false;
> int cmp = term.compareTo(validator, upper.value, false);
> return cmp < 0 || cmp == 0 && upper.inclusive;
> }
> {code}
> By debugging the OnDiskIndex data, I've found:
> {noformat}
> ...
> Data Term 

[jira] [Updated] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12073:

Fix Version/s: (was: 3.x)
   3.8
Reproduced In: 3.7, 3.8  (was: 3.7)
   Status: Patch Available  (was: Open)

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.8
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
>  4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal 
> |  Official | Eh Eh | 2009
>  80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal 
> |  Official | Just Dance Remixes Part 2 | 2008
>  3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal 
> |  null |Just Dance | 2008
>  9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal 
> |  Official |Just Dance | 2009
> (11 rows)
> {noformat}
> *SASI* says that there are only 11 artists whose name starts with {{lady}}.
> However, in the data set, there are:
> * Lady Pank
> * Lady Saw
> * Lady Saw
> * Ladyhawke
> * Ladytron
> * Ladysmith Black Mambazo
> * Lady Gaga
> * Lady Sovereign
> etc ...
> By debugging the source code, the issue is in 
> {{OnDiskIndex.TermIterator::computeNext()}}
> {code:java}
> for (;;)
> {
> if (currentBlock == null)
> return endOfData();
> if (offset >= 0 && offset < currentBlock.termCount())
> {
> DataTerm currentTerm = currentBlock.getTerm(nextOffset());
> if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
> continue;
> // flip the flag right on the first bounds match
> // to avoid expensive comparisons
> checkLower = false;
> if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
> return endOfData();
> return currentTerm;
> }
> nextBlock();
> }
> {code}
>  So the {{endOfData()}} conditions are:
> * currentBlock == null
> * checkUpper && !e.isUpperSatisfiedBy(currentTerm)
> The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether 
> the term match but also returns *false* when it's a *partial term* !
> {code:java}
> public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
> {
> if (!hasUpper())
> return true;
> if (nonMatchingPartial(term))
> return false;
> int cmp = term.compareTo(validator, upper.value, false);
> return cmp < 0 || cmp == 0 && upper.inclusive;
> }
> {code}
> By debugging the OnDiskIndex data, I've found:
> {noformat}
> ...
> Data Term 

[jira] [Comment Edited] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350113#comment-15350113
 ] 

DOAN DuyHai edited comment on CASSANDRA-12073 at 6/26/16 1:32 PM:
--

Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
return endOfData();

return currentTerm;
}
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. 
To test the incorrect next block skipping we'll need to generate enough data so 
that the last term of next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?


was (Author: doanduyhai):
Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped 

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
   

[jira] [Comment Edited] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350113#comment-15350113
 ] 

DOAN DuyHai edited comment on CASSANDRA-12073 at 6/26/16 1:33 PM:
--

Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{Expression::isLowerSatisfiedBy}} and {{Expression::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
return endOfData();

return currentTerm;
}
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. 
To test the incorrect next block skipping we'll need to generate enough data so 
that the last term of next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?


was (Author: doanduyhai):
Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped incorrectly

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && 

[jira] [Commented] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350113#comment-15350113
 ] 

DOAN DuyHai commented on CASSANDRA-12073:
-

Ok I've dug further into the code and the bug has wider implication

1) Premature termination by calling {{endOfData()}} as mentioned above in 

{code:java}
if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
  return endOfData();
{code}

2) Incorrect next block skipping in {{OnDiskIndex.nextBlock()}}

{code:java}
protected void nextBlock()
{
currentBlock = null;

if (blockIndex < 0 || blockIndex >= dataLevel.blockCount)
return;

currentBlock = dataLevel.getBlock(nextBlockIndex());
offset = checkLower ? order.startAt(currentBlock, e) : 
currentBlock.minOffset(order);

// let's check the last term of the new block right away
// if expression's upper bound is satisfied by it such means that we can 
avoid
// doing any expensive upper bound checks for that block.
checkUpper = e.hasUpper() &&
!e.isUpperSatisfiedBy(currentBlock.getTerm(currentBlock.maxOffset(order)));
}
{code}

Above, if we are unlucky and the last term of next block is partial, the entire 
next block will be skipped 

Proposed change:

* remove the {{if (nonMatchingPartial(term))}} check from current 
{{e::isLowerSatisfiedBy}} and {{e::isUpperSatisfiedBy}}
* rename the method {{Expression::nonMatchingPartial}} to 
{{Expression::partialTermAndPrefixOperation}} which describes exactly which 
condition it checks, much clearer name from my pov. Also make the method access 
*public* 
* add an extra condition in {{OnDiskIndex::computeNext()}} to skip the current 
term and process the next one

{code:java}
if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());

if (e.partialTermAndPrefixOperation(currentTerm) || 
(checkLower && !e.isLowerSatisfiedBy(currentTerm)))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
return endOfData();

return currentTerm;
}
{code}

I'll also add an extra unit-test to cover the premature {{endOfData()}} case. 
To test the incorrect next block skipping we'll need to generate enough data so 
that the last term of next block is partial. I'm not sure how feasible it is.

What do you think @xedin, [~jrwest] ?

> [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial 
> results
> ---
>
> Key: CASSANDRA-12073
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL
> Environment: Cassandra 3.7
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.x
>
>
> {noformat}
> cqlsh:music> CREATE TABLE music.albums (
> id uuid PRIMARY KEY,
> artist text,
> country text,
> quality text,
> status text,
> title text,
> year int
> );
> cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
> USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
> 'CONTAINS', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;
>  id   | artist| country| quality 
> | status| title | year
> --+---++-+---+---+--
>  372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal 
> |  Official |   Red and Blue EP | 2006
>  1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal 
> | Promotion |Poker Face | 2008
>  31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal 
> |  Official |   The Cherrytree Sessions | 2009
>  8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal 
> |  Official |Poker Face | 2009
>  98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal 
> |  Official |   Just Dance: The Remixes | 2008
>  a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal 
> |  Official |  The Fame | 2008
>  849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal 
> |  Official |Christmas Tree | 2008
> 

[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-26 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12078:

Attachment: patch_V2.txt

Attached is {{patch_V2.txt}}

> [SASI] Move skip_stop_words filter BEFORE stemming
> --
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
> Environment: Cassandra 3.7, Cassandra 3.8
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.8
>
> Attachments: patch.txt, patch_V2.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>  ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-26 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12078:

Status: Patch Available  (was: Reopened)

> [SASI] Move skip_stop_words filter BEFORE stemming
> --
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
> Environment: Cassandra 3.7, Cassandra 3.8
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.8
>
> Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>  ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-26 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350028#comment-15350028
 ] 

DOAN DuyHai commented on CASSANDRA-12078:
-

[~xedin]

 I have been able to reproduce the unit test failing locally. The error comes 
from test {{testTokenizationAdventuresOfHuckFinn}}. After switching skip stop 
words before stemming, the expected tokens count is *37739* and not *40249*

 There is also a {{NullPointerException}} when switching skip stop words before 
stemming. Indeed in some case, the token is removed by stop words filter so the 
input of the stemming filter is null. I've added an extra null check in the 
{{DefaultStemmingFilter}}

{code:java}
public String process(String input) throws Exception
{
if (input == null || stemmer == null)
return input;
stemmer.setCurrent(input);
return (stemmer.stem()) ? stemmer.getCurrent() : input;
}
{code}

 I have also added a new unit test in {{StandardAnalyzerTest}} to cover the 
french issue mentioned above:

{code:java}
@Test
public void testSkipStopWordBeforeStemmingFrench() throws Exception
{
InputStream is = StandardAnalyzerTest.class.getClassLoader()
   
.getResourceAsStream("tokenization/french_skip_stop_words_before_stemming.txt");

StandardTokenizerOptions options = new 
StandardTokenizerOptions.OptionsBuilder().stemTerms(true)
.ignoreStopTerms(true).useLocale(Locale.FRENCH)
.alwaysLowerCaseTerms(true).build();
StandardAnalyzer tokenizer = new StandardAnalyzer();
tokenizer.init(options);

List tokens = new ArrayList<>();
List words = new ArrayList<>();
tokenizer.reset(is);
while (tokenizer.hasNext())
{
final ByteBuffer nextToken = tokenizer.next();
tokens.add(nextToken);

words.add(UTF8Serializer.instance.deserialize(nextToken.duplicate()));
}

assertEquals(4, tokens.size());
assertEquals("dans", words.get(0));
assertEquals("plui", words.get(1));
assertEquals("chanson", words.get(2));
assertEquals("connu", words.get(3));
}
{code}

> [SASI] Move skip_stop_words filter BEFORE stemming
> --
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
> Environment: Cassandra 3.7, Cassandra 3.8
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.8
>
> Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>  ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348240#comment-15348240
 ] 

DOAN DuyHai commented on CASSANDRA-12078:
-

I'll work on it this weekend and proposed a proprer patch with update Unit tests

> [SASI] Move skip_stop_words filter BEFORE stemming
> --
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
>  Issue Type: Bug
>  Components: sasi
> Environment: Cassandra 3.7, Cassandra 3.8
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Fix For: 3.8
>
> Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>  ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-23 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12078:

Status: Patch Available  (was: Open)

> [SASI] Move skip_stop_words filter BEFORE stemming
> --
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
> Environment: Cassandra 3.7, Cassandra 3.8
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>  ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-23 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-12078:

Description: 
Right now, if skip stop words and stemming are enabled, SASI will put stemming 
in the filter pipeline BEFORE skip_stop_words:

{code:java}

private FilterPipelineTask getFilterPipeline()
{
FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
BasicResultFilters.NoOperation());
 ...
if (options.shouldStemTerms())
builder = builder.add("term_stemming", new 
StemmingFilters.DefaultStemmingFilter(options.getLocale()));
if (options.shouldIgnoreStopTerms())
builder = builder.add("skip_stop_words", new 
StopWordFilters.DefaultStopWordFilter(options.getLocale()));
return builder.build();
}
{code}

The problem is that stemming before removing stop words can yield wrong results.

I have an example:

{code:sql}
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW 
FILTERING;
{code}

Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
vowel is removed). Then skip stop words is applied. Unfortunately *dans* (*in* 
in English) is a stop word in French so it is removed completely.

In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
country='France'}} and of course the results are wrong.

Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming 
filter

/cc [~xedin] [~jrwest] [~beobal]

  was:
Right now, if skip stop words and stemming are enabled, SASI will put stemming 
in the filter pipeline BEFORE skip_stop_words:

{code:java}

private FilterPipelineTask getFilterPipeline()
{
FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
BasicResultFilters.NoOperation());
 ...
if (options.shouldStemTerms())
builder = builder.add("term_stemming", new 
StemmingFilters.DefaultStemmingFilter(options.getLocale()));
if (options.shouldIgnoreStopTerms())
builder = builder.add("skip_stop_words", new 
StopWordFilters.DefaultStopWordFilter(options.getLocale()));
return builder.build();
}
{code}

The problem is that stemming before removing stop words can yield wrong results.

I have an example:

{code:sql}
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW 
FILTERING;
{code}

*danse* = *dance* in English, and because of stemming, it becomes *dans* (the 
final vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
= *in* in English, a stop word in French so it is removed completely.

In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
country='France'}} and of course the results are wrong.

Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming 
filter


> [SASI] Move skip_stop_words filter BEFORE stemming
> --
>
> Key: CASSANDRA-12078
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
> Environment: Cassandra 3.7, Cassandra 3.8
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
> Attachments: patch.txt
>
>
> Right now, if skip stop words and stemming are enabled, SASI will put 
> stemming in the filter pipeline BEFORE skip_stop_words:
> {code:java}
> private FilterPipelineTask getFilterPipeline()
> {
> FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
> BasicResultFilters.NoOperation());
>  ...
> if (options.shouldStemTerms())
> builder = builder.add("term_stemming", new 
> StemmingFilters.DefaultStemmingFilter(options.getLocale()));
> if (options.shouldIgnoreStopTerms())
> builder = builder.add("skip_stop_words", new 
> StopWordFilters.DefaultStopWordFilter(options.getLocale()));
> return builder.build();
> }
> {code}
> The problem is that stemming before removing stop words can yield wrong 
> results.
> I have an example:
> {code:sql}
> SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' 
> ALLOW FILTERING;
> {code}
> Because of stemming *danse* ( *dance* in English) becomes *dans* (the final 
> vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
> (*in* in English) is a stop word in French so it is removed completely.
> In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
> country='France'}} and of course the results are wrong.
> Attached is a trivial patch to move the skip_stop_words filter BEFORE 
> stemming filter
> /cc [~xedin] [~jrwest] [~beobal]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming

2016-06-23 Thread DOAN DuyHai (JIRA)
DOAN DuyHai created CASSANDRA-12078:
---

 Summary: [SASI] Move skip_stop_words filter BEFORE stemming
 Key: CASSANDRA-12078
 URL: https://issues.apache.org/jira/browse/CASSANDRA-12078
 Project: Cassandra
  Issue Type: Improvement
  Components: CQL
 Environment: Cassandra 3.7, Cassandra 3.8
Reporter: DOAN DuyHai
Assignee: DOAN DuyHai
 Attachments: patch.txt

Right now, if skip stop words and stemming are enabled, SASI will put stemming 
in the filter pipeline BEFORE skip_stop_words:

{code:java}

private FilterPipelineTask getFilterPipeline()
{
FilterPipelineBuilder builder = new FilterPipelineBuilder(new 
BasicResultFilters.NoOperation());
 ...
if (options.shouldStemTerms())
builder = builder.add("term_stemming", new 
StemmingFilters.DefaultStemmingFilter(options.getLocale()));
if (options.shouldIgnoreStopTerms())
builder = builder.add("skip_stop_words", new 
StopWordFilters.DefaultStopWordFilter(options.getLocale()));
return builder.build();
}
{code}

The problem is that stemming before removing stop words can yield wrong results.

I have an example:

{code:sql}
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW 
FILTERING;
{code}

*danse* = *dance* in English, and because of stemming, it becomes *dans* (the 
final vowel is removed). Then skip stop words is applied. Unfortunately *dans* 
= *in* in English, a stop word in French so it is removed completely.

In the end the query is equivalent to {{SELECT * FROM music.albums WHERE 
country='France'}} and of course the results are wrong.

Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming 
filter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10661) Integrate SASI to Cassandra

2016-06-23 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345995#comment-15345995
 ] 

DOAN DuyHai commented on CASSANDRA-10661:
-

[~dbrosius] Can you expand further why 
{{someMapEntry.equals(someAbstractTrie)}} will always be false ?

According to the contract of {{Map.Entry::equals}}, as long as the key and 
value are equal, the equality holds.

I've tried an unit test and it works:

{code:java}
public class AbstractTrieTest
{

@Test
public void should_test_equality() throws Exception {
Map map = new HashMap<>();
map.put("10", 10L);

final Map.Entry mapEntry = 
map.entrySet().iterator().next();

final AbstractTrie.BasicEntry trieEntry = new 
AbstractPatriciaTrie.TrieEntry<>("10", 10L, 0);

Assert.assertTrue("mapEntry.equals(trieEntry)", 
mapEntry.equals(trieEntry));
}
}
{code}

{noformat}
% ant testsome 
-Dtest.name=org.apache.cassandra.index.sasi.utils.AbstractTrieTest 
-Dtest.methods=should_test_equality   
...
testsome:
[junit] WARNING: multiple versions of ant detected in path for junit
[junit]  
jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/Users/archinnovinfo/perso/cassandra/build/lib/jars/ant-1.9.4.jar!/org/apache/tools/ant/Project.class
[junit] Testsuite: org.apache.cassandra.index.sasi.utils.AbstractTrieTest
[junit] Testsuite: org.apache.cassandra.index.sasi.utils.AbstractTrieTest 
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.058 sec
{noformat}

> Integrate SASI to Cassandra
> ---
>
> Key: CASSANDRA-10661
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10661
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local Write-Read Paths
>Reporter: Pavel Yaskevich
>Assignee: Pavel Yaskevich
>  Labels: sasi
> Fix For: 3.4
>
>
> We have recently released new secondary index engine 
> (https://github.com/xedin/sasi) build using SecondaryIndex API, there are 
> still couple of things to work out regarding 3.x since it's currently 
> targeted on 2.0 released. I want to make this an umbrella issue to all of the 
> things related to integration of SASI, which are also tracked in 
> [sasi_issues|https://github.com/xedin/sasi/issues], into mainline Cassandra 
> 3.x release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-12073) [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial results

2016-06-22 Thread DOAN DuyHai (JIRA)
DOAN DuyHai created CASSANDRA-12073:
---

 Summary: [SASI] PREFIX search on CONTAINS/NonTokenizer mode 
returns only partial results
 Key: CASSANDRA-12073
 URL: https://issues.apache.org/jira/browse/CASSANDRA-12073
 Project: Cassandra
  Issue Type: Bug
  Components: CQL
 Environment: Cassandra 3.7
Reporter: DOAN DuyHai


{noformat}
cqlsh:music> CREATE TABLE music.albums (
id uuid PRIMARY KEY,
artist text,
country text,
quality text,
status text,
title text,
year int
);

cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) 
USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 
'CONTAINS', 'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
'case_sensitive': 'false'};

cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%'  LIMIT 100;

 id   | artist| country| quality | 
status| title | year
--+---++-+---+---+--
 372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga |USA |  normal |  
Official |   Red and Blue EP | 2006
 1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga |Unknown |  normal | 
Promotion |Poker Face | 2008
 31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga |USA |  normal |  
Official |   The Cherrytree Sessions | 2009
 8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga |Unknown |  normal |  
Official |Poker Face | 2009
 98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga |USA |  normal |  
Official |   Just Dance: The Remixes | 2008
 a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga |  Italy |  normal |  
Official |  The Fame | 2008
 849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga |USA |  normal |  
Official |Christmas Tree | 2008
 4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga |  Australia |  normal |  
Official | Eh Eh | 2009
 80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga |USA |  normal |  
Official | Just Dance Remixes Part 2 | 2008
 3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga |Unknown |  normal |  
null |Just Dance | 2008
 9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom |  normal |  
Official |Just Dance | 2009

(11 rows)
{noformat}

*SASI* says that there are only 11 artists whose name starts with {{lady}}.

However, in the data set, there are:

* Lady Pank
* Lady Saw
* Lady Saw
* Ladyhawke
* Ladytron
* Ladysmith Black Mambazo
* Lady Gaga
* Lady Sovereign

etc ...

By debugging the source code, the issue is in 
{{OnDiskIndex.TermIterator::computeNext()}}

{code:java}
for (;;)
{
if (currentBlock == null)
return endOfData();

if (offset >= 0 && offset < currentBlock.termCount())
{
DataTerm currentTerm = currentBlock.getTerm(nextOffset());
if (checkLower && !e.isLowerSatisfiedBy(currentTerm))
continue;

// flip the flag right on the first bounds match
// to avoid expensive comparisons
checkLower = false;

if (checkUpper && !e.isUpperSatisfiedBy(currentTerm))
return endOfData();
return currentTerm;
}
nextBlock();
}
{code}

 So the {{endOfData()}} conditions are:

* currentBlock == null
* checkUpper && !e.isUpperSatisfiedBy(currentTerm)

The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether the 
term match but also returns *false* when it's a *partial term* !

{code:java}
public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term)
{
if (!hasUpper())
return true;

if (nonMatchingPartial(term))
return false;

int cmp = term.compareTo(validator, upper.value, false);
return cmp < 0 || cmp == 0 && upper.inclusive;
}
{code}

By debugging the OnDiskIndex data, I've found:

{noformat}
...
Data Term (partial ? false) : lady gaga. 0x0, TokenTree offset : 21120
Data Term (partial ? true) : lady of bells. 0x0, TokenTree offset : 21360
Data Term (partial ? false) : lady pank. 0x0, TokenTree offset : 21440
...
{noformat}

*lady gaga* did match but then *lady of bells* fails because it is a partial 
match. Then SASI returns {{endOfData()}} instead continuing to check *lady pank*

One quick & dirty fix I've found is to add 
{{!e.nonMatchingPartial(currentTerm)}} to the *if* condition

{code:java}
 if (checkUpper && !e.isUpperSatisfiedBy(currentTerm) && 

[jira] [Commented] (CASSANDRA-12015) Rebuilding from another DC should use different sources

2016-06-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334609#comment-15334609
 ] 

DOAN DuyHai commented on CASSANDRA-12015:
-

+1 on the paired replica approach used for MVs. 

 However, beware of different RF in different DCs. You may have RF=3 in source 
DC and RF=5 in target DC, what will be the paired replica of the 4th replica of 
target DC ? Maybe use some modulo function. Same kind of issue if target DC RF 
> source DC RF. 

> Rebuilding from another DC should use different sources
> ---
>
> Key: CASSANDRA-12015
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12015
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Fabien Rousseau
>
> Currently, when adding a new DC (ex: DC2) and rebuilding it from an existing 
> DC (ex: DC1), only the closest replica is used as a "source of data".
> It works but is not optimal, because in case of an RF=3 and 3 nodes cluster, 
> only one node in DC1 is streaming the data to DC2. 
> To build the new DC in a reasonable time, it would be better, in that case, 
> to stream from multiple sources, thus distributing more evenly the load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12015) Rebuilding from another DC should use different sources

2016-06-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334412#comment-15334412
 ] 

DOAN DuyHai commented on CASSANDRA-12015:
-

I have checked the Git history and it seems that the 
{{snitch.getSortedListByProximity(address, rangeAddresses.get(range))}} line 
has always been there since 2012. Look likes nobody has noticed that 
DynamicSnitch can create this kind of hotspots since then, so yes we may update 
this.

However, it does not help the original concern of this JIRA, which is to have 
some sort of randomization/round-robin selection for the source replica to 
stream data from.

If we replace the dynamic snitch by {{AbstractEndpointSnitch.sortByProximity}}, 
it also will always pick the *same* replica for a given token range, whereas 
the idea is to pick randomly or to round-robin the replica. But it also mean 
that we need to keep *state* to know which replica has been selected previously 
for a given token range so that we can move to the next one. And I'm not sure 
whether having *state* is desirable or technically feasible 


> Rebuilding from another DC should use different sources
> ---
>
> Key: CASSANDRA-12015
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12015
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Fabien Rousseau
>
> Currently, when adding a new DC (ex: DC2) and rebuilding it from an existing 
> DC (ex: DC1), only the closest replica is used as a "source of data".
> It works but is not optimal, because in case of an RF=3 and 3 nodes cluster, 
> only one node in DC1 is streaming the data to DC2. 
> To build the new DC in a reasonable time, it would be better, in that case, 
> to stream from multiple sources, thus distributing more evenly the load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12015) Rebuilding from another DC should use different sources

2016-06-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334081#comment-15334081
 ] 

DOAN DuyHai commented on CASSANDRA-12015:
-

[~pauloricardomg] The problem is that this call to 
{{snitch.getSortedListByProximity(address, rangeAddresses.get(range))}} is 
inside {{RangeStreamer}} class, which is also used for Bootstrap also (maybe 
for other operations too, I did not do a comprehensive check) 

> Rebuilding from another DC should use different sources
> ---
>
> Key: CASSANDRA-12015
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12015
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Fabien Rousseau
>
> Currently, when adding a new DC (ex: DC2) and rebuilding it from an existing 
> DC (ex: DC1), only the closest replica is used as a "source of data".
> It works but is not optimal, because in case of an RF=3 and 3 nodes cluster, 
> only one node in DC1 is streaming the data to DC2. 
> To build the new DC in a reasonable time, it would be better, in that case, 
> to stream from multiple sources, thus distributing more evenly the load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12015) Rebuilding from another DC should use different sources

2016-06-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333698#comment-15333698
 ] 

DOAN DuyHai commented on CASSANDRA-12015:
-

Here is the important code path

1) org.apache.cassandra.tools.nodetool.Rebuild::execute()
2) StorageService::rebuild(String sourceDc, String keyspace, String tokens)
3) RangeStreamer::getAllRangesWithSourcesFor(String keyspaceName, 
Collection desiredRanges)

Inside the last metho, we call the snitch to sort replicas :

  List preferred = snitch.getSortedListByProximity(address, 
rangeAddresses.get(range));

If you're rebuilding nodes in new DC with "nodetool rebuild" command very fast, 
it may happen that one replica has better latency that the others so it will be 
picked up by DynamicSnitch 



> Rebuilding from another DC should use different sources
> ---
>
> Key: CASSANDRA-12015
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12015
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Fabien Rousseau
>
> Currently, when adding a new DC (ex: DC2) and rebuilding it from an existing 
> DC (ex: DC1), only the closest replica is used as a "source of data".
> It works but is not optimal, because in case of an RF=3 and 3 nodes cluster, 
> only one node in DC1 is streaming the data to DC2. 
> To build the new DC in a reasonable time, it would be better, in that case, 
> to stream from multiple sources, thus distributing more evenly the load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11734) Enable partition component index for SASI

2016-05-10 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278394#comment-15278394
 ] 

DOAN DuyHai commented on CASSANDRA-11734:
-

bq. {{QueryController#hasIndexFor(ColumnDefinition)}} on every run of the 
results checking logic is very inefficient

Agree

So we'll postpone this JIRA until [CASSANDRA-10765] is done then

> Enable partition component index for SASI
> -
>
> Key: CASSANDRA-11734
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11734
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
>  Labels: doc-impacting, sasi, secondaryIndex
> Fix For: 3.8
>
> Attachments: patch.txt
>
>
> Enable partition component index for SASI
> For the given schema:
> {code:sql}
> CREATE TABLE test.comp (
> pk1 int,
> pk2 text,
> val text,
> PRIMARY KEY ((pk1, pk2))
> );
> CREATE CUSTOM INDEX comp_val_idx ON test.comp (val) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex';
> CREATE CUSTOM INDEX comp_pk2_idx ON test.comp (pk2) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'PREFIX', 
> 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> CREATE CUSTOM INDEX comp_pk1_idx ON test.comp (pk1) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex';
> {code}
> The following queries are possible:
> {code:sql}
> SELECT * FROM test.comp WHERE pk1=1;
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5;
> SELECT * FROM test.comp WHERE pk1=1 AND val='xxx' ALLOW FILTERING;
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND val='xxx' ALLOW FILTERING;
> SELECT * FROM test.comp WHERE pk2='some text';
> SELECT * FROM test.comp WHERE pk2 LIKE 'prefix%';
> SELECT * FROM test.comp WHERE pk2='some text' AND val='xxx' ALLOW FILTERING;
> SELECT * FROM test.comp WHERE pk2 LIKE 'prefix%' AND val='xxx' ALLOW 
> FILTERING;
> //Without using SASI
> SELECT * FROM test.comp WHERE pk1 = 1 AND pk2='some text';
> SELECT * FROM test.comp WHERE pk1 IN(1,2,3) AND pk2='some text';
> SELECT * FROM test.comp WHERE pk1 = 1 AND pk2 IN ('text1','text2');
> SELECT * FROM test.comp WHERE pk1 IN(1,2,3) AND pk2 IN ('text1','text2');
> {code}
> However, the following queries *are not possible*
> {code:sql}
> SELECT * FROM test.comp WHERE pk1=1 AND pk2 LIKE 'prefix%';
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND pk2 = 'some text';
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND pk2 LIKE 'prefix%';
> {code}
> All of them are throwing the following exception
> {noformat}
> ava.lang.UnsupportedOperationException: null
>   at 
> org.apache.cassandra.cql3.restrictions.SingleColumnRestriction$LikeRestriction.appendTo(SingleColumnRestriction.java:715)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.restrictions.PartitionKeySingleRestrictionSet.values(PartitionKeySingleRestrictionSet.java:86)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.restrictions.StatementRestrictions.getPartitionKeys(StatementRestrictions.java:585)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.getSliceCommands(SelectStatement.java:473)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.getQuery(SelectStatement.java:265)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:230)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:79)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:208)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:239) 
> ~[main/:na]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:224) 
> ~[main/:na]
>   at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507)
>  [main/:na]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401)
>  [main/:na]
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:32)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> 

[jira] [Updated] (CASSANDRA-11734) Enable partition component index for SASI

2016-05-08 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-11734:

Status: Patch Available  (was: Open)

> Enable partition component index for SASI
> -
>
> Key: CASSANDRA-11734
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11734
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: DOAN DuyHai
>Assignee: DOAN DuyHai
>  Labels: doc-impacting, sasi, secondaryIndex
> Fix For: 3.8
>
> Attachments: patch.txt
>
>
> Enable partition component index for SASI
> For the given schema:
> {code:sql}
> CREATE TABLE test.comp (
> pk1 int,
> pk2 text,
> val text,
> PRIMARY KEY ((pk1, pk2))
> );
> CREATE CUSTOM INDEX comp_val_idx ON test.comp (val) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex';
> CREATE CUSTOM INDEX comp_pk2_idx ON test.comp (pk2) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'PREFIX', 
> 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
> 'case_sensitive': 'false'};
> CREATE CUSTOM INDEX comp_pk1_idx ON test.comp (pk1) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex';
> {code}
> The following queries are possible:
> {code:sql}
> SELECT * FROM test.comp WHERE pk1=1;
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5;
> SELECT * FROM test.comp WHERE pk1=1 AND val='xxx' ALLOW FILTERING;
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND val='xxx' ALLOW FILTERING;
> SELECT * FROM test.comp WHERE pk2='some text';
> SELECT * FROM test.comp WHERE pk2 LIKE 'prefix%';
> SELECT * FROM test.comp WHERE pk2='some text' AND val='xxx' ALLOW FILTERING;
> SELECT * FROM test.comp WHERE pk2 LIKE 'prefix%' AND val='xxx' ALLOW 
> FILTERING;
> //Without using SASI
> SELECT * FROM test.comp WHERE pk1 = 1 AND pk2='some text';
> SELECT * FROM test.comp WHERE pk1 IN(1,2,3) AND pk2='some text';
> SELECT * FROM test.comp WHERE pk1 = 1 AND pk2 IN ('text1','text2');
> SELECT * FROM test.comp WHERE pk1 IN(1,2,3) AND pk2 IN ('text1','text2');
> {code}
> However, the following queries *are not possible*
> {code:sql}
> SELECT * FROM test.comp WHERE pk1=1 AND pk2 LIKE 'prefix%';
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND pk2 = 'some text';
> SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND pk2 LIKE 'prefix%';
> {code}
> All of them are throwing the following exception
> {noformat}
> ava.lang.UnsupportedOperationException: null
>   at 
> org.apache.cassandra.cql3.restrictions.SingleColumnRestriction$LikeRestriction.appendTo(SingleColumnRestriction.java:715)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.restrictions.PartitionKeySingleRestrictionSet.values(PartitionKeySingleRestrictionSet.java:86)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.restrictions.StatementRestrictions.getPartitionKeys(StatementRestrictions.java:585)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.getSliceCommands(SelectStatement.java:473)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.getQuery(SelectStatement.java:265)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:230)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:79)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:208)
>  ~[main/:na]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:239) 
> ~[main/:na]
>   at 
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:224) 
> ~[main/:na]
>   at 
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
>  ~[main/:na]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507)
>  [main/:na]
>   at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401)
>  [main/:na]
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:32)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:283)
>  [netty-all-4.0.36.Final.jar:4.0.36.Final]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_45]
>   at 
> 

[jira] [Created] (CASSANDRA-11734) Enable partition component index for SASI

2016-05-08 Thread DOAN DuyHai (JIRA)
DOAN DuyHai created CASSANDRA-11734:
---

 Summary: Enable partition component index for SASI
 Key: CASSANDRA-11734
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11734
 Project: Cassandra
  Issue Type: Improvement
  Components: CQL
Reporter: DOAN DuyHai
Assignee: DOAN DuyHai
 Fix For: 3.8
 Attachments: patch.txt

Enable partition component index for SASI

For the given schema:

{code:sql}
CREATE TABLE test.comp (
pk1 int,
pk2 text,
val text,
PRIMARY KEY ((pk1, pk2))
);

CREATE CUSTOM INDEX comp_val_idx ON test.comp (val) USING 
'org.apache.cassandra.index.sasi.SASIIndex';
CREATE CUSTOM INDEX comp_pk2_idx ON test.comp (pk2) USING 
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'PREFIX', 
'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
'case_sensitive': 'false'};
CREATE CUSTOM INDEX comp_pk1_idx ON test.comp (pk1) USING 
'org.apache.cassandra.index.sasi.SASIIndex';
{code}

The following queries are possible:

{code:sql}
SELECT * FROM test.comp WHERE pk1=1;
SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5;

SELECT * FROM test.comp WHERE pk1=1 AND val='xxx' ALLOW FILTERING;
SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND val='xxx' ALLOW FILTERING;

SELECT * FROM test.comp WHERE pk2='some text';
SELECT * FROM test.comp WHERE pk2 LIKE 'prefix%';

SELECT * FROM test.comp WHERE pk2='some text' AND val='xxx' ALLOW FILTERING;
SELECT * FROM test.comp WHERE pk2 LIKE 'prefix%' AND val='xxx' ALLOW FILTERING;

//Without using SASI
SELECT * FROM test.comp WHERE pk1 = 1 AND pk2='some text';
SELECT * FROM test.comp WHERE pk1 IN(1,2,3) AND pk2='some text';

SELECT * FROM test.comp WHERE pk1 = 1 AND pk2 IN ('text1','text2');
SELECT * FROM test.comp WHERE pk1 IN(1,2,3) AND pk2 IN ('text1','text2');
{code}

However, the following queries *are not possible*

{code:sql}
SELECT * FROM test.comp WHERE pk1=1 AND pk2 LIKE 'prefix%';

SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND pk2 = 'some text';

SELECT * FROM test.comp WHERE pk1>=1 AND pk1<=5 AND pk2 LIKE 'prefix%';
{code}

All of them are throwing the following exception

{noformat}
ava.lang.UnsupportedOperationException: null
at 
org.apache.cassandra.cql3.restrictions.SingleColumnRestriction$LikeRestriction.appendTo(SingleColumnRestriction.java:715)
 ~[main/:na]
at 
org.apache.cassandra.cql3.restrictions.PartitionKeySingleRestrictionSet.values(PartitionKeySingleRestrictionSet.java:86)
 ~[main/:na]
at 
org.apache.cassandra.cql3.restrictions.StatementRestrictions.getPartitionKeys(StatementRestrictions.java:585)
 ~[main/:na]
at 
org.apache.cassandra.cql3.statements.SelectStatement.getSliceCommands(SelectStatement.java:473)
 ~[main/:na]
at 
org.apache.cassandra.cql3.statements.SelectStatement.getQuery(SelectStatement.java:265)
 ~[main/:na]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:230)
 ~[main/:na]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:79)
 ~[main/:na]
at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:208)
 ~[main/:na]
at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:239) 
~[main/:na]
at 
org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:224) 
~[main/:na]
at 
org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
 ~[main/:na]
at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507)
 [main/:na]
at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401)
 [main/:na]
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 [netty-all-4.0.36.Final.jar:4.0.36.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 [netty-all-4.0.36.Final.jar:4.0.36.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:32)
 [netty-all-4.0.36.Final.jar:4.0.36.Final]
at 
io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:283)
 [netty-all-4.0.36.Final.jar:4.0.36.Final]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_45]
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
 [main/:na]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:106) 
[main/:na]
{noformat}

Indeed, to allow the above queries we'll need to update the class 
{{StatementRestrictions}} and introduce a new class 
{{PartitionKeyMultipleRestrictionsSet}} and I don't feel like doing it, 

[jira] [Commented] (CASSANDRA-8844) Change Data Capture (CDC)

2016-04-29 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264753#comment-15264753
 ] 

DOAN DuyHai commented on CASSANDRA-8844:


> I don't see reject all being viable since it'll shut down your cluster, and 
> reject none on a slow consumer scenario would just lead to disk space 
> exhaustion and lost CDC data.

 Didn't we agreed in the past on the idea of having a parameter in Yaml that 
works similar to disk_failure_policy and let users decide which behavior they 
want on CDC overflow (reject writes/stop creating CDC/ ) ?

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a predictable naming schema, making it 
> triivial to process them in order.
> - Daemons should be able to checkpoint their work, and resume from where they 
> left off. This means they would have to leave some file artifact in the CDC 
> log's directory.
> - A sophisticated daemon should be able to be written that could 
> -- Catch up, in written-order, even when it is multiple logfiles behind in 
> processing
> -- Be able to 

[jira] [Comment Edited] (CASSANDRA-8844) Change Data Capture (CDC)

2016-04-29 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264753#comment-15264753
 ] 

DOAN DuyHai edited comment on CASSANDRA-8844 at 4/29/16 9:11 PM:
-

bq. I don't see reject all being viable since it'll shut down your cluster, and 
reject none on a slow consumer scenario would just lead to disk space 
exhaustion and lost CDC data.

 Didn't we agree in the past on the idea of having a parameter in Yaml that 
works similar to disk_failure_policy and let users decide which behavior they 
want on CDC overflow (reject writes/stop creating CDC/ ) ?


was (Author: doanduyhai):
> I don't see reject all being viable since it'll shut down your cluster, and 
> reject none on a slow consumer scenario would just lead to disk space 
> exhaustion and lost CDC data.

 Didn't we agreed in the past on the idea of having a parameter in Yaml that 
works similar to disk_failure_policy and let users decide which behavior they 
want on CDC overflow (reject writes/stop creating CDC/ ) ?

> Change Data Capture (CDC)
> -
>
> Key: CASSANDRA-8844
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Local Write-Read Paths
>Reporter: Tupshin Harper
>Assignee: Joshua McKenzie
>Priority: Critical
> Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be 

[jira] [Commented] (CASSANDRA-11538) Give secondary index on partition column the same treatment as static column

2016-04-29 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263801#comment-15263801
 ] 

DOAN DuyHai commented on CASSANDRA-11538:
-

Ok to make it clearer

For static column index (*RegularColumnIndex*):

{code:java}
public CBuilder buildIndexClusteringPrefix(ByteBuffer partitionKey,
   ClusteringPrefix prefix,
   CellPath path)
{
CBuilder builder = CBuilder.create(getIndexComparator());
builder.add(partitionKey);
for (int i = 0; i < prefix.size(); i++)
builder.add(prefix.get(i));

// Note: if indexing a static column, prefix will be 
Clustering.STATIC_CLUSTERING
// so the Clustering obtained from builder::build will contain a value 
for only
// the partition key. At query time though, this is all that's needed 
as the entire
// base table partition should be returned for any mathching index 
entry.
return builder;
}
{code}

For partition component index (*PartitionKeyIndex*):


{code:java}
public CBuilder buildIndexClusteringPrefix(ByteBuffer partitionKey,
   ClusteringPrefix prefix,
   CellPath path)
{
CBuilder builder = CBuilder.create(getIndexComparator());
builder.add(partitionKey);
for (int i = 0; i < prefix.size(); i++)
builder.add(prefix.get(i));
return builder;
}
{code}

As you can see, for static column index, we only store the singleton 
*Clustering.STATIC_CLUSTERING* because it is sufficient. However, the source 
code for partition component index stores all the clustering values for each 
row, which is not necessary since we only need the full partition key to access 
the complete partition.

> Give secondary index on partition column the same treatment as static column
> 
>
> Key: CASSANDRA-11538
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11538
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
> Environment: Cassandra 3.4
>Reporter: DOAN DuyHai
>Priority: Minor
>
> For index on static column, in the index table we store:
> - partition key = base table static column value
> - clustering =  base table complete partition key (as a single value)
> The way we handle index on partition column is different, we store:
> - partition key = base table partition column value
> - clustering 1 = base table complete partition key (as a single value)
> - clustering 2 ... N+1 = N clustering values of the base table
> It is more consistent to give partition column index the same treatment as 
> the one for static column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10783) Allow literal value as parameter of UDF & UDA

2016-04-27 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259922#comment-15259922
 ] 

DOAN DuyHai commented on CASSANDRA-10783:
-

Ok so it'll be in 3.8 then

> Allow literal value as parameter of UDF & UDA
> -
>
> Key: CASSANDRA-10783
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10783
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: DOAN DuyHai
>Assignee: Robert Stupp
>Priority: Minor
>  Labels: CQL3, UDF, client-impacting, doc-impacting
> Fix For: 3.x
>
>
> I have defined the following UDF
> {code:sql}
> CREATE OR REPLACE FUNCTION  maxOf(current int, testValue int) RETURNS NULL ON 
> NULL INPUT 
> RETURNS int 
> LANGUAGE java 
> AS  'return Math.max(current,testValue);'
> CREATE TABLE maxValue(id int primary key, val int);
> INSERT INTO maxValue(id, val) VALUES(1, 100);
> SELECT maxOf(val, 101) FROM maxValue WHERE id=1;
> {code}
> I got the following error message:
> {code}
> SyntaxException:  message="line 1:19 no viable alternative at input '101' (SELECT maxOf(val1, 
> [101]...)">
> {code}
>  It would be nice to allow literal value as parameter of UDF and UDA too.
>  I was thinking about an use-case for an UDA groupBy() function where the end 
> user can *inject* at runtime a literal value to select which aggregation he 
> want to display, something similar to GROUP BY ... HAVING 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11434) Support EQ/PREFIX queries in CONTAINS mode without tokenization by augmenting SA metadata per term

2016-04-25 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256419#comment-15256419
 ] 

DOAN DuyHai commented on CASSANDRA-11434:
-

[~xedin] [~jrwest]

Curious question, can we make this support for EQ/PREFIX work for CONTAINS mode 
and StandardAnalyzer ?  If not, why ?

> Support EQ/PREFIX queries in CONTAINS mode without tokenization by augmenting 
> SA metadata per term
> --
>
> Key: CASSANDRA-11434
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11434
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Pavel Yaskevich
>Assignee: Jordan West
> Fix For: 3.6
>
>
> We can support EQ/PREFIX requests to CONTAINS indexes by tracking 
> "partiality" of the data stored in the OnDiskIndex and IndexMemtable, if we 
> know exactly if current match represents part of the term or it's original 
> form it would be trivial to support EQ/PREFIX since PREFIX is subset of 
> SUFFIX matches.
> Since we attach uint16 size to each term stored we can take advantage of sign 
> bit so size of the index is not impacted at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-11632) Commitlog corruption

2016-04-22 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai resolved CASSANDRA-11632.
-
Resolution: Cannot Reproduce

> Commitlog corruption
> 
>
> Key: CASSANDRA-11632
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11632
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.6 SNAPSHOT
>Reporter: DOAN DuyHai
>
> {noformat}
> INFO  10:00:08 Replaying 
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
>  
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
> ERROR 10:00:08 Exiting due to error while processing commit log during 
> initialization.
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
> Unexpected error deserializing mutation; saved to 
> /var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat.
>   This may be caused by replaying a mutation against a table with the same 
> name but incompatible schema.  Exception follows: 
> org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 
> 0th field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
> [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
> [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
> [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558)
>  [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
> [main/:na]
> {noformat}
> I'm using the 3.6_SNAPSHOT at _ae063e8 JSON datetime formatting needs 
> timezone_
> The commilog files are downloadable here: 
> https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11632) Commitlog corruption

2016-04-22 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254113#comment-15254113
 ] 

DOAN DuyHai commented on CASSANDRA-11632:
-

Ok, I can't reproduce it anymore, it's probably a issue with 
dropping/recreating schema. Closing it now as not reproductible

> Commitlog corruption
> 
>
> Key: CASSANDRA-11632
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11632
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.6 SNAPSHOT
>Reporter: DOAN DuyHai
>
> {noformat}
> INFO  10:00:08 Replaying 
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
>  
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
> ERROR 10:00:08 Exiting due to error while processing commit log during 
> initialization.
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
> Unexpected error deserializing mutation; saved to 
> /var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat.
>   This may be caused by replaying a mutation against a table with the same 
> name but incompatible schema.  Exception follows: 
> org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 
> 0th field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
>  [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
> [main/:na]
>   at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
> [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
> [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558)
>  [main/:na]
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
> [main/:na]
> {noformat}
> I'm using the 3.6_SNAPSHOT at _ae063e8 JSON datetime formatting needs 
> timezone_
> The commilog files are downloadable here: 
> https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11632) Commitlog corruption

2016-04-22 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-11632:

Description: 
{noformat}
INFO  10:00:08 Replaying 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
ERROR 10:00:08 Exiting due to error while processing commit log during 
initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
Unexpected error deserializing mutation; saved to 
/var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat. 
 This may be caused by replaying a mutation against a table with the same name 
but incompatible schema.  Exception follows: 
org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 0th 
field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
[main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
[main/:na]
{noformat}

I'm using the 3.6_SNAPSHOT at _ae063e8 JSON datetime formatting needs timezone_

The commilog files are downloadable here: 
https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing


  was:
{noformat}
INFO  10:00:08 Replaying 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
ERROR 10:00:08 Exiting due to error while processing commit log during 
initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
Unexpected error deserializing mutation; saved to 
/var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat. 
 This may be caused by replaying a mutation against a table with the same name 
but incompatible schema.  Exception follows: 
org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 0th 
field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
[main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
[main/:na]
{noformat}

I'm using the 3.6_SNAPSHOT at `ae063e8 JSON datetime formatting needs timezone`

The commilog files are downloadable here: 
https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing



> Commitlog corruption
> 
>
> Key: CASSANDRA-11632
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11632
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.6 SNAPSHOT
>Reporter: DOAN DuyHai
>
> {noformat}
> INFO  10:00:08 Replaying 
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
>  
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
> ERROR 10:00:08 Exiting due to error while processing commit log during 
> initialization.
> 

[jira] [Updated] (CASSANDRA-11632) Commitlog corruption

2016-04-22 Thread DOAN DuyHai (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DOAN DuyHai updated CASSANDRA-11632:

Description: 
{noformat}
INFO  10:00:08 Replaying 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
ERROR 10:00:08 Exiting due to error while processing commit log during 
initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
Unexpected error deserializing mutation; saved to 
/var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat. 
 This may be caused by replaying a mutation against a table with the same name 
but incompatible schema.  Exception follows: 
org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 0th 
field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
[main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
[main/:na]
{noformat}

I'm using the 3.6_SNAPSHOT at `ae063e8 JSON datetime formatting needs timezone`

The commilog files are downloadable here: 
https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing


  was:
{noformat}
INFO  10:00:08 Replaying 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
ERROR 10:00:08 Exiting due to error while processing commit log during 
initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
Unexpected error deserializing mutation; saved to 
/var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat. 
 This may be caused by replaying a mutation against a table with the same name 
but incompatible schema.  Exception follows: 
org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 0th 
field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
[main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
[main/:na]
{noformat}

The commilog files are downloadable here: 
https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing



> Commitlog corruption
> 
>
> Key: CASSANDRA-11632
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11632
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 3.6 SNAPSHOT
>Reporter: DOAN DuyHai
>
> {noformat}
> INFO  10:00:08 Replaying 
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
>  
> /Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
> ERROR 10:00:08 Exiting due to error while processing commit log during 
> initialization.
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
> Unexpected error deserializing 

[jira] [Created] (CASSANDRA-11632) Commitlog corruption

2016-04-22 Thread DOAN DuyHai (JIRA)
DOAN DuyHai created CASSANDRA-11632:
---

 Summary: Commitlog corruption
 Key: CASSANDRA-11632
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11632
 Project: Cassandra
  Issue Type: Bug
 Environment: Cassandra 3.6 SNAPSHOT
Reporter: DOAN DuyHai


{noformat}
INFO  10:00:08 Replaying 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041762.log,
 
/Users/archinnovinfo/perso/cassandra/data/commitlog/CommitLog-6-1461260041763.log
ERROR 10:00:08 Exiting due to error while processing commit log during 
initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
Unexpected error deserializing mutation; saved to 
/var/folders/9s/gkchxg6x7qb0vh0k6_cmdl54gn/T/mutation916280897052665587dat. 
 This may be caused by replaying a mutation against a table with the same name 
but incompatible schema.  Exception follows: 
org.apache.cassandra.serializers.MarshalException: Not enough bytes to read 0th 
field java.nio.HeapByteBuffer[pos=0 lim=6 cap=6]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:611)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replayMutation(CommitLogReplayer.java:568)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replaySyncSection(CommitLogReplayer.java:521)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:407)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:236)
 [main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:193) 
[main/:na]
at 
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:558) 
[main/:na]
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:687) 
[main/:na]
{noformat}

The commilog files are downloadable here: 
https://drive.google.com/file/d/0B6wR2aj4Cb6wUXpZc1dQcmZvb1U/view?usp=sharing




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11595) Cassandra cannot start because of empty commitlog

2016-04-19 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248361#comment-15248361
 ] 

DOAN DuyHai commented on CASSANDRA-11595:
-

Ok so it's like a vicious circle.

 But then all those things are just *consequences* or another problem, which is 
*too many opened files*. Sure we can fix this side effect but we need to find 
the root cause of why there are too many opened files.

> Cassandra cannot start because of empty commitlog
> -
>
> Key: CASSANDRA-11595
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11595
> Project: Cassandra
>  Issue Type: Bug
>Reporter: n0rad
>Assignee: Benjamin Lerer
>
> After the crash of CASSANDRA-11594.
> Cassandra try to restart and fail because of commit log replay.
> Same on 4 of the crashed nodes out of 6.
> ```
> org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: 
> Could not read commit log descriptor in file 
> /data/commitlog/CommitLog-6-1460632496764.log
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:644)
>  [apache-cassandra-3.0.5.jar:3.0.5]
> ```
> This file is empty and is not the commitlog with the latest date.
> ```
> ...
> -rw-r--r-- 1 root root 32M Apr 16 21:46 CommitLog-6-1460632496761.log
> -rw-r--r-- 1 root root 32M Apr 16 21:47 CommitLog-6-1460632496762.log
> -rw-r--r-- 1 root root 32M Apr 16 21:47 CommitLog-6-1460632496763.log
> -rw-r--r-- 1 root root   0 Apr 16 21:47 CommitLog-6-1460632496764.log
> -rw-r--r-- 1 root root 32M Apr 16 21:50 CommitLog-6-1460843401097.log
> -rw-r--r-- 1 root root 32M Apr 16 21:51 CommitLog-6-1460843513346.log
> -rw-r--r-- 1 root root 32M Apr 16 21:53 CommitLog-6-1460843619271.log
> -rw-r--r-- 1 root root 32M Apr 16 21:55 CommitLog-6-1460843730533.log
> -rw-r--r-- 1 root root 32M Apr 16 21:57 CommitLog-6-1460843834129.log
> -rw-r--r-- 1 root root 32M Apr 16 21:58 CommitLog-6-1460843935094.log
> -rw-r--r-- 1 root root 32M Apr 16 22:00 CommitLog-6-1460844038543.log
> -rw-r--r-- 1 root root 32M Apr 16 22:02 CommitLog-6-1460844141003.log
> ...
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11594) Too many open files on directories

2016-04-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246449#comment-15246449
 ] 

DOAN DuyHai commented on CASSANDRA-11594:
-

Can you also upload some logs with full exception stack trace to narrow down 
the root cause ?

> Too many open files on directories
> --
>
> Key: CASSANDRA-11594
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11594
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: n0rad
>Priority: Critical
> Attachments: openfiles.zip, screenshot.png
>
>
> I have a 6 nodes cluster in prod in 3 racks.
> each node :
> - 4Gb commitlogs on 343 files
> - 275Gb data on 504 files 
> On saturday, 1 node in each rack crash with with too many open files (seems 
> to be the similar node in each rack).
> {code}
> lsof -n -p $PID give me 66899 out of 65826 max
> {code}
> it contains 64527 open directories (2371 uniq)
> a part of the list :
> {code}
> java19076 root 2140r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2141r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2142r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2143r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2144r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2145r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2146r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2147r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2148r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2149r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2150r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2151r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2152r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2153r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2154r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2155r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> {code}
> The 3 others nodes crashes 4 hours later



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11594) Too many open files on directories

2016-04-18 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246388#comment-15246388
 ] 

DOAN DuyHai commented on CASSANDRA-11594:
-

[~n0rad] Do you know which event that triggers the spikes in open folders/files 
? Compaction ? Repair ? 

> Too many open files on directories
> --
>
> Key: CASSANDRA-11594
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11594
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: n0rad
>Priority: Critical
> Attachments: screenshot.png
>
>
> I have a 6 nodes cluster in prod in 3 racks.
> each node :
> - 4Gb commitlogs on 343 files
> - 275Gb data on 504 files 
> On saturday, 1 node in each rack crash with with too many open files (seems 
> to be the similar node in each rack).
> {code}
> lsof -n -p $PID give me 66899 out of 65826 max
> {code}
> it contains 64527 open directories (2371 uniq)
> a part of the list :
> {code}
> java19076 root 2140r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2141r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2142r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2143r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2144r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2145r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2146r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2147r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2148r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2149r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2150r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2151r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2152r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2153r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2154r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java19076 root 2155r  DIR   8,17  143360 4386718705 
> /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> {code}
> The 3 others nodes crashes 4 hours later



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11586) Avoid Silent Insert or Update Failure In Clusters With Time Skew

2016-04-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244325#comment-15244325
 ] 

DOAN DuyHai edited comment on CASSANDRA-11586 at 4/16/16 5:53 PM:
--

bq. The client receives no error when the last statement is executed even 
though the request was dropped.

 To be able to detect that the existing cell (in memtable *AND* on disk) has a 
timestamp greater than the timestamp of the current mutation, Cassandra will 
necessarily need to read data before validating the mutation.

 That's what we call *read-before-write* and this is an anti-pattern because it 
just kills the write throughput.

 I think this JIRA is not an issue because synchronizing system time is out of 
Cassandra scope. This is the system administration responsibility



was (Author: doanduyhai):
bq. The client receives no error when the last statement is executed even 
though the request was dropped.

 To be able to detect that the existing cell (in memtable **AND** on disk) has 
a timestamp greater than the timestamp of the current mutation, Cassandra will 
necessarily need to read data before validating the mutation.

 That's what we call **read-before-write** and this is an anti-pattern because 
it just kills the write throughput.

 I think this JIRA is not an issue because synchronizing system time is out of 
Cassandra scope. This is the system administration responsibility


> Avoid Silent Insert or Update Failure In Clusters With Time Skew
> 
>
> Key: CASSANDRA-11586
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11586
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core, CQL
>Reporter: Mukil Kesavan
>
> It isn't uncommon to have a cluster of Cassandra servers with clock skew 
> ranging from a few milliseconds to seconds or even minutes even with NTP 
> configured on them. We use the coordinator's timestamp for all insert/update 
> requests. Currently, an update to an already existing row with an older 
> timestamp (because the request coordinator's clock is lagging behind) results 
> in a successful response to the client even though the update was dropped. 
> Here's a sample sequence of requests:
> * Consider 3 Cassandra servers with times, T+10, T+5 and T respectively
> * INSERT INTO TABLE1 (id, data) VALUES (1, "one"); is coordinated by server 1 
> with timestamp (T+10)
> * UPDATE TABLE1 SET data='One' where id=1; is coordinated by server 3 with 
> timestamp T
> The client receives no error when the last statement is executed even though 
> the request was dropped.
> It will be really helpful if we could return an error or response to the 
> client indicating that the request was dropped. This gives the client an 
> option to handle this situation gracefully. If this is useful, I can work on 
> a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11586) Avoid Silent Insert or Update Failure In Clusters With Time Skew

2016-04-16 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244325#comment-15244325
 ] 

DOAN DuyHai commented on CASSANDRA-11586:
-

bq. The client receives no error when the last statement is executed even 
though the request was dropped.

 To be able to detect that the existing cell (in memtable **AND** on disk) has 
a timestamp greater than the timestamp of the current mutation, Cassandra will 
necessarily need to read data before validating the mutation.

 That's what we call **read-before-write** and this is an anti-pattern because 
it just kills the write throughput.

 I think this JIRA is not an issue because synchronizing system time is out of 
Cassandra scope. This is the system administration responsibility


> Avoid Silent Insert or Update Failure In Clusters With Time Skew
> 
>
> Key: CASSANDRA-11586
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11586
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core, CQL
>Reporter: Mukil Kesavan
>
> It isn't uncommon to have a cluster of Cassandra servers with clock skew 
> ranging from a few milliseconds to seconds or even minutes even with NTP 
> configured on them. We use the coordinator's timestamp for all insert/update 
> requests. Currently, an update to an already existing row with an older 
> timestamp (because the request coordinator's clock is lagging behind) results 
> in a successful response to the client even though the update was dropped. 
> Here's a sample sequence of requests:
> * Consider 3 Cassandra servers with times, T+10, T+5 and T respectively
> * INSERT INTO TABLE1 (id, data) VALUES (1, "one"); is coordinated by server 1 
> with timestamp (T+10)
> * UPDATE TABLE1 SET data='One' where id=1; is coordinated by server 3 with 
> timestamp T
> The client receives no error when the last statement is executed even though 
> the request was dropped.
> It will be really helpful if we could return an error or response to the 
> client indicating that the request was dropped. This gives the client an 
> option to handle this situation gracefully. If this is useful, I can work on 
> a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >