[ https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951982#comment-16951982 ]
David Wayne Smiley commented on LUCENE-8509: -------------------------------------------- [~romseygeek] why was this option not added as a new configuration flag? This is sort of an internal implementation detail, so it's not a big deal but If it were, it'd be easier for the tests to toggle this flag. It's also disappointing to see yet another constructor arg when we already have a bit field for booleans. Also: * the CHANGES.txt claims offset adjusting is false, but actually it defaults to true. * there was no documentation change. At least the javadocs of this class which shows all the other options. > NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can > produce backwards offsets > ---------------------------------------------------------------------------------------------------- > > Key: LUCENE-8509 > URL: https://issues.apache.org/jira/browse/LUCENE-8509 > Project: Lucene - Core > Issue Type: Task > Reporter: Alan Woodward > Assignee: Alan Woodward > Priority: Major > Fix For: 8.0 > > Attachments: LUCENE-8509.patch, LUCENE-8509.patch > > > Discovered by an elasticsearch user and described here: > https://github.com/elastic/elasticsearch/issues/33710 > The ngram tokenizer produces tokens "a b" and " bb" (note the space at the > beginning of the second token). The WDGF takes the first token and splits it > into two, adjusting the offsets of the second token, so we get "a"[0,1] and > "b"[2,3]. The trim filter removes the leading space from the second token, > leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space > has already been stripped, WDGF sees no need to adjust offsets, and emits the > token as-is, resulting in the start offsets of the tokenstream being [0, 2, > 1], and the IndexWriter rejecting it. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org