David Wayne Smiley created LUCENE-9006:
------------------------------------------

             Summary: Ensure WordDelimiterGraphFilter always emits catenateAll 
token early
                 Key: LUCENE-9006
                 URL: https://issues.apache.org/jira/browse/LUCENE-9006
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: David Wayne Smiley
            Assignee: David Wayne Smiley


Ideally, the first token of WDGF is the preserveOriginal (if configured to 
emit), and the second should be the catenateAll (if configured to emit).  The 
deprecated WDF does this but WDGF can sometimes put the first other token 
earlier when there is a non-emitted candidate sub-token.

Example input "8-other" when only generateWordParts and catenateAll -- *not* 
generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
the "other" token and the catenated "8other" will appear at the same internal 
position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to