[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737386#comment-16737386 ] ASF subversion and git services commented on LUCENE-8527: - Commit 283b19a8da6ab9e0b7e9a75b132d3067218d5502 in lucene-solr's branch refs/heads/master from Steven Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=283b19a ] LUCENE-8527: Upgrade JFlex to 1.7.0. StandardTokenizer and UAX29URLEmailTokenizer now support Unicode 9.0, and provide UTS#51 v11.0 Emoji tokenization with the '' token type. > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Assignee: Steve Rowe >Priority: Minor > Attachments: LUCENE-8527.patch, LUCENE-8527.patch, LUCENE-8527.patch > > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737384#comment-16737384 ] ASF subversion and git services commented on LUCENE-8527: - Commit e8c65da6bb8be626242cfba18989e497180e82aa in lucene-solr's branch refs/heads/branch_7x from Steven Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e8c65da ] LUCENE-8527: Upgrade JFlex to 1.7.0. StandardTokenizer and UAX29URLEmailTokenizer now support Unicode 9.0, and provide UTS#51 v11.0 Emoji tokenization with the '' token type. > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Assignee: Steve Rowe >Priority: Minor > Attachments: LUCENE-8527.patch, LUCENE-8527.patch, LUCENE-8527.patch > > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737385#comment-16737385 ] ASF subversion and git services commented on LUCENE-8527: - Commit 0e903cab47e98c75d4fe0bb2a33a84e8f3c648ff in lucene-solr's branch refs/heads/branch_8x from Steven Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0e903ca ] LUCENE-8527: Upgrade JFlex to 1.7.0. StandardTokenizer and UAX29URLEmailTokenizer now support Unicode 9.0, and provide UTS#51 v11.0 Emoji tokenization with the '' token type. > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Assignee: Steve Rowe >Priority: Minor > Attachments: LUCENE-8527.patch, LUCENE-8527.patch, LUCENE-8527.patch > > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732330#comment-16732330 ] Steve Rowe commented on LUCENE-8527: bq. \[W\]ith the default skeleton, JFlex 1.7.0 generates scanners that misbehaves when given a spoon-feeding reader (i.e. a reader that returns at least one char but fewer than the requested number of chars) \[\] I'll make a JFlex issue for this bug. I created an issue: https://github.com/jflex-de/jflex/issues/538 bq. \[I\]nvocations of JFlex's Ant target inherit options used by previous invocations in the same Ant session. I'll make a JFlex issue for this bug. There has been an ongoing effort to fix this, see: https://github.com/jflex-de/jflex/pull/258 , likely will be included in next JFlex release. > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Assignee: Steve Rowe >Priority: Minor > Attachments: LUCENE-8527.patch, LUCENE-8527.patch > > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714304#comment-16714304 ] Steve Rowe commented on LUCENE-8527: FYI the patch does not include generated files, since that would make it much larger. Run {{ant jflex}} in {{lucene/core}} and {{lucene/analysis/common}} to do the generation. > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Assignee: Steve Rowe >Priority: Minor > Attachments: LUCENE-8527.patch > > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713510#comment-16713510 ] Robert Muir commented on LUCENE-8527: - It would be really nice. I don't think the tricky part is really segmentation at all (as far as finding breaks) but instead the problem of assigning the proper "label" to the token (tag it as a emoji type). So the stuff in the ICU tokenizer uses some properties to tag the "stuff between breaks" as emoji token type versus something else. I looked at latest jflex, it seems it would need those props? And its a little tricky, e.g. ordinary ascii digit 7 is [:Emoji:] in unicode. So thats why the isEmoji there is a bit crazy. > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Priority: Minor > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713437#comment-16713437 ] Steve Rowe commented on LUCENE-8527: [~rcmuir ] mentioned on LUCENE-8125 that StandardTokenizer should give such sequences the {{}} token type - see the logic in the {{icu}} module's {{BreakIteratorWrapper}}. JFlex 1.7.0 supports Unicode 9.0, which, if I'm interpreting the discussion at http://www.unicode.org/L2/L2016/16315r-handling-seg-emoji.pdf properly, does not (fully) include Emoji sequence support (though customized rules that would do that properly in Unicode 9.0 are listed in that doc). Should we include the (post-9.0) customized rules for Unicode 9.0? > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Priority: Minor > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8527) Upgrade JFlex to 1.7.0
[ https://issues.apache.org/jira/browse/LUCENE-8527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713295#comment-16713295 ] Uwe Schindler commented on LUCENE-8527: --- +1 > Upgrade JFlex to 1.7.0 > -- > > Key: LUCENE-8527 > URL: https://issues.apache.org/jira/browse/LUCENE-8527 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, modules/analysis >Reporter: Steve Rowe >Priority: Minor > > JFlex 1.7.0, supporting Unicode 9.0, was released recently: > [http://jflex.de/changelog.html#jflex-1.7.0]. We should upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org