[ https://issues.apache.org/jira/browse/LUCENE-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christian Moen reassigned LUCENE-7181: -------------------------------------- Assignee: Christian Moen > JapaneseTokenizer: Validate segmentation of User Dictionary entries on > creation > ------------------------------------------------------------------------------- > > Key: LUCENE-7181 > URL: https://issues.apache.org/jira/browse/LUCENE-7181 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Tomás Fernández Löbbe > Assignee: Christian Moen > Attachments: LUCENE-7181.patch > > > From the [conversation on the dev > list|http://mail-archives.apache.org/mod_mbox/lucene-dev/201604.mbox/%3CCAMJgJxR8gLnXi7WXkN3KFfxHu=posevxxarbbg+chce1tzh...@mail.gmail.com%3E] > The user dictionary in the {{JapaneseTokenizer}} allows users to customize > how a stream is broken into tokens using a specific set of rules provided > like: > AABBBCC -> AA BBB CC > It does not allow users to change any of the token characters like: > (1) AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", > seems to only care about positions) > It also doesn't let a character be part of more than one token, like: > (2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE) > ..or make the output token bigger than the input text: > (3) AA -> AAA (Also AIOOBE) > Currently there is no validation for those cases, case 1 doesn't fail but > provide unexpected tokens. Cases 2 and 3 fail when the input text is > analyzed. We should add validation to the {{UserDictionary}} creation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org