[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-13 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511879#comment-16511879
 ] 

Jim Ferenczi commented on LUCENE-8344:
--

+1 to the patch and +1 to backport to 7.4 especially if ConcatenateGraphFilter 
is released in this version.

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-13 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511547#comment-16511547
 ] 

David Smiley commented on LUCENE-8344:
--

What about [~areek] who AFAIK wrote this stuff originally?

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-13 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511445#comment-16511445
 ] 

Adrien Grand commented on LUCENE-8344:
--

FYI Jim is on vacation for a couple weeks.

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-13 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511415#comment-16511415
 ] 

David Smiley commented on LUCENE-8344:
--

Any thoughts on this one [~jim.ferenczi]?

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-08 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506649#comment-16506649
 ] 

David Smiley commented on LUCENE-8344:
--

The patch may be hard to review as a diff.  There are 3 tests now in 
TestPrefixCompletionQuery that are the same in data and queries but differ in 
expected results based on different CompletionAnalyzer settings.  I think it 
may be hard to maintain this as-such... it ought to be one so we don't have so 
much duplication and it may become easier to understand how the change in 
settings adjusts the expectations.  But hopefully you all think it's fine as is.

After some reflection, I figured that if preserveSep=false, then 
preservePositionIncrement is irrelevant, and so that's why we have one fewer 
test method than 2x2 would suggest.  This ought to throw an exception to the 
user.  Perhaps 3 factory methods would be better than the one constructor with 
two booleans?  There's likely an analogous situation with AnalyzingSuggester's 
long constructor.  Anyway this proposal doesn't belong in this issue.

Suggested CHANGES.txt notes:
* LUCENE-8344: TokenStreamToAutomaton (used by some suggesters) was not 
ignoring a trailing position increment when the preservePositionIncrement 
setting is false.  (David Smiley, Jim Ferenczi)

Upgrading _(a new section)_
*  LUCENE-8344: If you are using the AnalyzingSuggester or FuzzySuggester 
subclass, and if you explicitly use the preservePositionIncrements=false 
setting (not the default), then you ought to rebuild your suggester index.  If 
you don't, queries or indexed data with trailing position gaps (e.g. stop 
words) may not work correctly.

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503745#comment-16503745
 ] 

ASF subversion and git services commented on LUCENE-8344:
-

Commit 33b1c1d1416ed3b8dbce4066ad4b982a15e1b0d0 in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=33b1c1d ]

SOLR-12376: AwaitsFix testStopWords pending LUCENE-8344

(cherry picked from commit 7c6d743)


> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503741#comment-16503741
 ] 

ASF subversion and git services commented on LUCENE-8344:
-

Commit 7c6d74376a784224963b57cb8380a07279fd7608 in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7c6d743 ]

SOLR-12376: AwaitsFix testStopWords pending LUCENE-8344


> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501717#comment-16501717
 ] 

Michael McCandless commented on LUCENE-8344:


{quote}we could simply consider the fix a breaking change and discuss if it's 
acceptable to backport to 7x ?
{quote}
+1

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-04 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501217#comment-16501217
 ] 

David Smiley commented on LUCENE-8344:
--

bq. Considering that re-build should be trivial in the AnalyzingSuggester, we 
could simply consider the fix a breaking change and discuss if it's acceptable 
to backport to 7x ?

That gets my vote!

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-04 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500922#comment-16500922
 ] 

Jim Ferenczi commented on LUCENE-8344:
--

The exact match pass filters prefix paths that don't end with END_BYTE so we'd 
have to change it to ignore trailing POS_SEPs (line 709 and 727). Though we 
have no way to infer the value of preservePositionIncrements for an indexed 
suggestion so I am not even sure that we can handle the BWC safely. Considering 
that re-build should be trivial in the AnalyzingSuggester, we could simply 
consider the fix a breaking change and discuss if it's acceptable to backport 
to 7x ?

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-04 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500848#comment-16500848
 ] 

David Smiley commented on LUCENE-8344:
--

RE NRT Doc Suggester:  "This is expected" – Okay I see what you mean.  I guess 
if any user (past/present/future) wants to use  
preservePositionIncrements=false effectively then they need to be using 
CompletionAnalyzer/CompletionTokenStream both at index _and_ query time.  The 
existing tests are not doing that – it is using the input analyzer at query 
time. The particular two queries it did use in a test, "fo" and "foob" didn't 
demonstrate something important this test should be testing for – position 
increments (stopwords) in the _query_.  Ditto for some similar test methods 
here (pos and negative assertions).  I'll try and improve this some.

RE AnalyzingSuggester:  Hmmm.  What if the "exactFirst" logic first phase 
captured the "output2" lookup results in a place that could be examined by the 
second pass?  I think this would be more robust, and wouldn't need to even 
invoke sameSurfaceForm in second phase.  If the FST was built with the bug (7.3 
or prior) then an exact match of a trailing stopword with this setting wouldn't 
be recognized as an exact match, but I think that's a minor loss easily fixed 
with reindexing?

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-04 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500766#comment-16500766
 ] 

Jim Ferenczi commented on LUCENE-8344:
--

{quote}
org.apache.lucene.search.suggest.document.TestPrefixCompletionQuery#testAnalyzerWithSepAndNoPreservePos
 see "test trailing stopword with a new document"
{quote}

If you index with preservePositionIncrements=false you cannot match a query 
that preserves the position increments and contains a stop word. This is 
expected. "baz the" indexed with preservePositionIncrements=false cannot match 
the query "baz the" if you preserve the position increments. However it should 
work if you query "baz" with and without preserving the pos increment. This is 
why I said that the completion field (and all the related queries) should be 
fine with this change. It works without reindexing.

{quote}
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest#testStandard 
see the "round trip" test
With BUG==true: fails (bad for back-compat)
With BUG==false: passes (therefore a reindex fixes)
{quote}

This one is more tricky because it tries to find exact match first so the 
indexed version and the query version should be the same otherwise the 
assertion line 789 of the AnalyzingSuggester fails. We can probably fix the 
discrepancy by adding a BWC layer that removes the trailing POS_SEP of the 
indexed version when sameSurfaceForm is called and preservePosInc is false ? 
WDYT ? 
This would remove the need to rebuild the FST on a version that contains the 
fix.



> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-04 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500343#comment-16500343
 ] 

David Smiley commented on LUCENE-8344:
--

To demonstrate the issue in the patch I added a 
 TokenStreamToAutomaton.BUG boolean flag so a test can see what happens when 
the suggest index had trailing holes but differs at query time.

org.apache.lucene.search.suggest.analyzing.AnalyzingSuggesterTest#testStandard 
see the "round trip" test
 With BUG==true: fails (bad for back-compat)
 With BUG==false: passes (therefore a reindex fixes)

org.apache.lucene.search.suggest.document.TestPrefixCompletionQuery#testAnalyzerWithSepAndNoPreservePos
 see "test trailing stopword with a new document"
 With BUG==true: passes (good for back-compat)
 With BUG==false: fails(*) 
 (*): however if you flip the analyzer passed to the PrefixCompletionQuery 
constructor to the "completionAnalyzer" (instead of the plain/original 
"analyzer"), then it passes.  So apparently this may require users change how 
it's used? (ouch)

CC [~areek]

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch, LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-01 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498455#comment-16498455
 ] 

Jim Ferenczi commented on LUCENE-8344:
--

I don't think a reindex is needed. With the fix we'll have an automaton that 
doesn't contain the hole which should still match the indexed version without 
the fix since it's a prefix of it. Did I miss something ?

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8344) TokenStreamToAutomaton doesn't ignore trailing posInc when preservePositionIncrements=false

2018-06-01 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498266#comment-16498266
 ] 

David Smiley commented on LUCENE-8344:
--

The patch fixes the issue and has a couple tests.  It's a bit WIP; I just want 
to get this up here for you all to see.  I don't intend on working more on it 
today.  I think we need to identify back-compat issues with this.  Is it okay 
to tell people that on a minor release that they need to rebuild their 
suggester index to avoid this edge case bug?  For the NRT Doc Suggester, that 
may be a lot to task as it's not a side-car index.

> TokenStreamToAutomaton doesn't ignore trailing posInc when 
> preservePositionIncrements=false
> ---
>
> Key: LUCENE-8344
> URL: https://issues.apache.org/jira/browse/LUCENE-8344
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/suggest
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8344.patch
>
>
> TokenStreamToAutomaton in Lucene core is used by the AnalyzingSuggester 
> (incl. FuzzySuggester subclass ) and NRT Document Suggester and soon the 
> SolrTextTagger.  It has a setting {{preservePositionIncrements}} defaulting 
> to true.  If it's set to false (e.g. to ignore stopwords) and if there is a 
> _trailing_ position increment greater than 1, TS2A will _still_ add position 
> increments (holes) into the automata even though it was configured not to.
> I'm filing this issue separate from LUCENE-8332 where I first found it.  The 
> fix is very simple but I'm concerned about back-compat ramifications so I'm 
> filing it separately.  I'll attach a patch to show the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org