[ 
https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe reassigned LUCENE-8527:
----------------------------------

      Assignee: Steve Rowe
    Attachment: LUCENE-8527.patch

Patch, passes most Lucene/Solr tests (see below), including the test built with 
Unicode 9.0's word break test data: {{WordBreakTestUnicode_9_0_0}}.
{quote}So the stuff in the ICU tokenizer uses some properties to tag the "stuff 
between breaks" as emoji token type versus something else. I looked at latest 
jflex, it seems it would need those props?
{quote}
Yes, JFlex 1.7.0 doesn't have the Emoji props it needs to properly tokenize and 
type as emoji, since these props' definitions are not included with 
release-specific data. For Lucene's use it should be possible to script pulling 
in Unicode data to augment the scanner specs, which would allow proper emoji 
tokenization/typing to work. (I've make a note to add these properties to 
future JFlex releases.)

Failing tests with the patch:

{{ant test -Dtestcase=TestStandardAnalyzer 
-Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=B33609C22A50A253 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=es-VE 
-Dtests.timezone=Africa/Blantyre -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8}}

{{ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStrings 
-Dtests.seed=DA01A0705C379738 -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=ru-RU -Dtests.timezone=Europe/Sarajevo -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1}}

In both ^^ of these cases, 
{{BaseTokenStreamTestCase.checkAnalysisConsistency()}} fails with unexpected 
tokenization after randomly choosing to use a spoon-feed reader wrapper: 
{{MockReaderWrapper}}. If I disable the wrapping with those seeds, the tests 
pass. I'll work on making a simplified test case demonstrating the problem; I'm 
not sure what's going wrong.

> Upgrade JFlex to 1.7.0
> ----------------------
>
>                 Key: LUCENE-8527
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8527
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: general/build, modules/analysis
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>         Attachments: LUCENE-8527.patch
>
>
> JFlex 1.7.0, supporting Unicode 9.0, was released recently: 
> [http://jflex.de/changelog.html#jflex-1.7.0].  We should upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to