[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-12078: Resolution: Fixed Status: Resolved (was: Patch Available) Committed, thanks [~doanduyhai]! I've removed already committed part from the patch and included only change for {{StemmingFilters}} and tests. > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.8 > > Attachments: patch.txt, patch_V2.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DOAN DuyHai updated CASSANDRA-12078: Attachment: patch_V2.txt Attached is {{patch_V2.txt}} > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.8 > > Attachments: patch.txt, patch_V2.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DOAN DuyHai updated CASSANDRA-12078: Status: Patch Available (was: Reopened) > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.8 > > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-12078: Resolution: Fixed Fix Version/s: (was: 3.7) 3.8 Status: Resolved (was: Patch Available) Committed. > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.8 > > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-12078: Issue Type: Bug (was: Improvement) > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Bug > Components: sasi > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.7 > > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-12078: Component/s: (was: CQL) sasi > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Improvement > Components: sasi > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.7 > > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Yaskevich updated CASSANDRA-12078: Reviewer: Pavel Yaskevich Fix Version/s: 3.7 > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Improvement > Components: CQL > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Fix For: 3.7 > > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DOAN DuyHai updated CASSANDRA-12078: Status: Patch Available (was: Open) > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Improvement > Components: CQL > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12078) [SASI] Move skip_stop_words filter BEFORE stemming
[ https://issues.apache.org/jira/browse/CASSANDRA-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DOAN DuyHai updated CASSANDRA-12078: Description: Right now, if skip stop words and stemming are enabled, SASI will put stemming in the filter pipeline BEFORE skip_stop_words: {code:java} private FilterPipelineTask getFilterPipeline() { FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation()); ... if (options.shouldStemTerms()) builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale())); if (options.shouldIgnoreStopTerms()) builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale())); return builder.build(); } {code} The problem is that stemming before removing stop words can yield wrong results. I have an example: {code:sql} SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING; {code} Because of stemming *danse* ( *dance* in English) becomes *dans* (the final vowel is removed). Then skip stop words is applied. Unfortunately *dans* (*in* in English) is a stop word in French so it is removed completely. In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}} and of course the results are wrong. Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter /cc [~xedin] [~jrwest] [~beobal] was: Right now, if skip stop words and stemming are enabled, SASI will put stemming in the filter pipeline BEFORE skip_stop_words: {code:java} private FilterPipelineTask getFilterPipeline() { FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation()); ... if (options.shouldStemTerms()) builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale())); if (options.shouldIgnoreStopTerms()) builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale())); return builder.build(); } {code} The problem is that stemming before removing stop words can yield wrong results. I have an example: {code:sql} SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING; {code} *danse* = *dance* in English, and because of stemming, it becomes *dans* (the final vowel is removed). Then skip stop words is applied. Unfortunately *dans* = *in* in English, a stop word in French so it is removed completely. In the end the query is equivalent to {{SELECT * FROM music.albums WHERE country='France'}} and of course the results are wrong. Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter > [SASI] Move skip_stop_words filter BEFORE stemming > -- > > Key: CASSANDRA-12078 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12078 > Project: Cassandra > Issue Type: Improvement > Components: CQL > Environment: Cassandra 3.7, Cassandra 3.8 >Reporter: DOAN DuyHai >Assignee: DOAN DuyHai > Attachments: patch.txt > > > Right now, if skip stop words and stemming are enabled, SASI will put > stemming in the filter pipeline BEFORE skip_stop_words: > {code:java} > private FilterPipelineTask getFilterPipeline() > { > FilterPipelineBuilder builder = new FilterPipelineBuilder(new > BasicResultFilters.NoOperation()); > ... > if (options.shouldStemTerms()) > builder = builder.add("term_stemming", new > StemmingFilters.DefaultStemmingFilter(options.getLocale())); > if (options.shouldIgnoreStopTerms()) > builder = builder.add("skip_stop_words", new > StopWordFilters.DefaultStopWordFilter(options.getLocale())); > return builder.build(); > } > {code} > The problem is that stemming before removing stop words can yield wrong > results. > I have an example: > {code:sql} > SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' > ALLOW FILTERING; > {code} > Because of stemming *danse* ( *dance* in English) becomes *dans* (the final > vowel is removed). Then skip stop words is applied. Unfortunately *dans* > (*in* in English) is a stop word in French so it is removed completely. > In the end the query is equivalent to {{SELECT * FROM music.albums WHERE > country='France'}} and of course the results are wrong. > Attached is a trivial patch to move the skip_stop_words filter BEFORE > stemming filter > /cc [~xedin] [~jrwest] [~beobal] -- This message was sent by Atlassian JIRA (v6.3.4#6332)