[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Rowe reassigned LUCENE-8527: ---------------------------------- Assignee: Steve Rowe Attachment: LUCENE-8527.patch Patch, passes most Lucene/Solr tests (see below), including the test built with Unicode 9.0's word break test data: {{WordBreakTestUnicode_9_0_0}}. {quote}So the stuff in the ICU tokenizer uses some properties to tag the "stuff between breaks" as emoji token type versus something else. I looked at latest jflex, it seems it would need those props? {quote} Yes, JFlex 1.7.0 doesn't have the Emoji props it needs to properly tokenize and type as emoji, since these props' definitions are not included with release-specific data. For Lucene's use it should be possible to script pulling in Unicode data to augment the scanner specs, which would allow proper emoji tokenization/typing to work. (I've make a note to add these properties to future JFlex releases.) Failing tests with the patch: {{ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=B33609C22A50A253 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-VE -Dtests.timezone=Africa/Blantyre -Dtests.asserts=true -Dtests.file.encoding=UTF-8}} {{ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStrings -Dtests.seed=DA01A0705C379738 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ru-RU -Dtests.timezone=Europe/Sarajevo -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1}} In both ^^ of these cases, {{BaseTokenStreamTestCase.checkAnalysisConsistency()}} fails with unexpected tokenization after randomly choosing to use a spoon-feed reader wrapper: {{MockReaderWrapper}}. If I disable the wrapping with those seeds, the tests pass. I'll work on making a simplified test case demonstrating the problem; I'm not sure what's going wrong. > Upgrade JFlex to 1.7.0 > ---------------------- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis > Reporter: Steve Rowe > Assignee: Steve Rowe > Priority: Minor > Attachments: LUCENE-8527.patch > > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org