[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511879#comment-16511879 ] Jim Ferenczi commented on LUCENE-8344: -- +1 to the patch and +1 to backport to 7.4 especially if ConcatenateGraphFilter is released in this version. > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511547#comment-16511547 ] David Smiley commented on LUCENE-8344: -- What about [~areek] who AFAIK wrote this stuff originally? > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511445#comment-16511445 ] Adrien Grand commented on LUCENE-8344: -- FYI Jim is on vacation for a couple weeks. > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511415#comment-16511415 ] David Smiley commented on LUCENE-8344: -- Any thoughts on this one [~jim.ferenczi]? > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506649#comment-16506649 ] David Smiley commented on LUCENE-8344: -- The patch may be hard to review as a diff. There are 3 tests now in TestPrefixCompletionQuery that are the same in data and queries but differ in expected results based on different CompletionAnalyzer settings. I think it may be hard to maintain this as-such... it ought to be one so we don't have so much duplication and it may become easier to understand how the change in settings adjusts the expectations. But hopefully you all think it's fine as is. After some reflection, I figured that if preserveSep=false, then preservePositionIncrement is irrelevant, and so that's why we have one fewer test method than 2x2 would suggest. This ought to throw an exception to the user. Perhaps 3 factory methods would be better than the one constructor with two booleans? There's likely an analogous situation with AnalyzingSuggester's long constructor. Anyway this proposal doesn't belong in this issue. Suggested CHANGES.txt notes: * LUCENE-8344: TokenStreamToAutomaton (used by some suggesters) was not ignoring a trailing position increment when the preservePositionIncrement setting is false. (David Smiley, Jim Ferenczi) Upgrading _(a new section)_ * LUCENE-8344: If you are using the AnalyzingSuggester or FuzzySuggester subclass, and if you explicitly use the preservePositionIncrements=false setting (not the default), then you ought to rebuild your suggester index. If you don't, queries or indexed data with trailing position gaps (e.g. stop words) may not work correctly. > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503745#comment-16503745 ] ASF subversion and git services commented on LUCENE-8344: - Commit 33b1c1d1416ed3b8dbce4066ad4b982a15e1b0d0 in lucene-solr's branch refs/heads/branch_7x from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=33b1c1d ] SOLR-12376: AwaitsFix testStopWords pending LUCENE-8344 (cherry picked from commit 7c6d743) > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503741#comment-16503741 ] ASF subversion and git services commented on LUCENE-8344: - Commit 7c6d74376a784224963b57cb8380a07279fd7608 in lucene-solr's branch refs/heads/master from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7c6d743 ] SOLR-12376: AwaitsFix testStopWords pending LUCENE-8344 > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501717#comment-16501717 ] Michael McCandless commented on LUCENE-8344: {quote}we could simply consider the fix a breaking change and discuss if it's acceptable to backport to 7x ? {quote} +1 > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501217#comment-16501217 ] David Smiley commented on LUCENE-8344: -- bq. Considering that re-build should be trivial in the AnalyzingSuggester, we could simply consider the fix a breaking change and discuss if it's acceptable to backport to 7x ? That gets my vote! > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500922#comment-16500922 ] Jim Ferenczi commented on LUCENE-8344: -- The exact match pass filters prefix paths that don't end with END_BYTE so we'd have to change it to ignore trailing POS_SEPs (line 709 and 727). Though we have no way to infer the value of preservePositionIncrements for an indexed suggestion so I am not even sure that we can handle the BWC safely. Considering that re-build should be trivial in the AnalyzingSuggester, we could simply consider the fix a breaking change and discuss if it's acceptable to backport to 7x ? > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500848#comment-16500848 ] David Smiley commented on LUCENE-8344: -- RE NRT Doc Suggester: "This is expected" – Okay I see what you mean. I guess if any user (past/present/future) wants to use preservePositionIncrements=false effectively then they need to be using CompletionAnalyzer/CompletionTokenStream both at index _and_ query time. The existing tests are not doing that – it is using the input analyzer at query time. The particular two queries it did use in a test, "fo" and "foob" didn't demonstrate something important this test should be testing for – position increments (stopwords) in the _query_. Ditto for some similar test methods here (pos and negative assertions). I'll try and improve this some. RE AnalyzingSuggester: Hmmm. What if the "exactFirst" logic first phase captured the "output2" lookup results in a place that could be examined by the second pass? I think this would be more robust, and wouldn't need to even invoke sameSurfaceForm in second phase. If the FST was built with the bug (7.3 or prior) then an exact match of a trailing stopword with this setting wouldn't be recognized as an exact match, but I think that's a minor loss easily fixed with reindexing? > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500766#comment-16500766 ] Jim Ferenczi commented on LUCENE-8344: -- {quote} org.apache.lucene.search.suggest.document.TestPrefixCompletionQuery#testAnalyzerWithSepAndNoPreservePos see "test trailing stopword with a new document" {quote} If you index with preservePositionIncrements=false you cannot match a query that preserves the position increments and contains a stop word. This is expected. "baz the" indexed with preservePositionIncrements=false cannot match the query "baz the" if you preserve the position increments. However it should work if you query "baz" with and without preserving the pos increment. This is why I said that the completion field (and all the related queries) should be fine with this change. It works without reindexing. {quote} org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest#testStandard see the "round trip" test With BUG==true: fails (bad for back-compat) With BUG==false: passes (therefore a reindex fixes) {quote} This one is more tricky because it tries to find exact match first so the indexed version and the query version should be the same otherwise the assertion line 789 of the AnalyzingSuggester fails. We can probably fix the discrepancy by adding a BWC layer that removes the trailing POS_SEP of the indexed version when sameSurfaceForm is called and preservePosInc is false ? WDYT ? This would remove the need to rebuild the FST on a version that contains the fix. > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500343#comment-16500343 ] David Smiley commented on LUCENE-8344: -- To demonstrate the issue in the patch I added a TokenStreamToAutomaton.BUG boolean flag so a test can see what happens when the suggest index had trailing holes but differs at query time. org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest#testStandard see the "round trip" test With BUG==true: fails (bad for back-compat) With BUG==false: passes (therefore a reindex fixes) org.apache.lucene.search.suggest.document.TestPrefixCompletionQuery#testAnalyzerWithSepAndNoPreservePos see "test trailing stopword with a new document" With BUG==true: passes (good for back-compat) With BUG==false: fails(*) (*): however if you flip the analyzer passed to the PrefixCompletionQuery constructor to the "completionAnalyzer" (instead of the plain/original "analyzer"), then it passes. So apparently this may require users change how it's used? (ouch) CC [~areek] > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch, LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498455#comment-16498455 ] Jim Ferenczi commented on LUCENE-8344: -- I don't think a reindex is needed. With the fix we'll have an automaton that doesn't contain the hole which should still match the indexed version without the fix since it's a prefix of it. Did I miss something ? > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false
[ https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498266#comment-16498266 ] David Smiley commented on LUCENE-8344: -- The patch fixes the issue and has a couple tests. It's a bit WIP; I just want to get this up here for you all to see. I don't intend on working more on it today. I think we need to identify back-compat issues with this. Is it okay to tell people that on a minor release that they need to rebuild their suggester index to avoid this edge case bug? For the NRT Doc Suggester, that may be a lot to task as it's not a side-car index. > TokenStreamToAutomaton doesn't ignore trailing posInc when > preservePositionIncrements=false > --- > > Key: LUCENE-8344 > URL: https://issues.apache.org/jira/browse/LUCENE-8344 > Project: Lucene - Core > Issue Type: Bug > Components: modules/suggest >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8344.patch > > > TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester > (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the > SolrTextTagger. It has a setting {{preservePositionIncrements}} defaulting > to true. If it's set to false (e.g. to ignore stopwords) and if there is a > _trailing_ position increment greater than 1, TS2A will _still_ add position > increments (holes) into the automata even though it was configured not to. > I'm filing this issue separate from LUCENE-8332 where I first found it. The > fix is very simple but I'm concerned about back-compat ramifications so I'm > filing it separately. I'll attach a patch to show the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org