[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951982#comment-16951982
 ] 

David Wayne Smiley commented on LUCENE-8509:
--------------------------------------------

[~romseygeek] why was this option not added as a new configuration flag?  This 
is sort of an internal implementation detail, so it's not a big deal but If it 
were, it'd be easier for the tests to toggle this flag.  It's also 
disappointing to see yet another constructor arg when we already have a bit 
field for booleans.

Also:
* the CHANGES.txt claims offset adjusting is false, but actually it defaults to 
true.
* there was no documentation change.  At least the javadocs of this class which 
shows all the other options.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 8.0
>
>         Attachments: LUCENE-8509.patch, LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to