[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-20 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934721#comment-16934721
 ] 

Michael Gibney commented on LUCENE-8972:


I have pushed  [PR #892|https://github.com/apache/lucene-solr/pull/892], with 
the proposed new classes, tests, and docs. Initially I've mostly just used 
modified versions of the tests for {{ICUTransformFilter*}} ... (btw, 
{{testRandomStrings()}} is great!).

Most of the code complexity is due to the need to incrementally process one 
input character at a time in order get offset correction as accurate as 
possible, and implement "rollback" (following the same approach as ICU 
Transliterator code does internally, but not exposed via public API).

The following discusses "rollback" in a little more depth, including some of 
the performance implications and an idea for future performance improvement:

Regarding "rollback", see comments "To understand the need for rollback" [in 
source 
code|https://github.com/unicode-org/icu/blob/a075ac9c/icu4j/main/classes/translit/src/com/ibm/icu/text/Transliterator.java#L1137]
 for private method {{Transliterator#filteredTransliterate(Replaceable, 
Position, boolean, boolean)}}. 
[CompoundTransliterator|https://github.com/unicode-org/icu/blob/a075ac9c/icu4j/main/classes/translit/src/com/ibm/icu/text/CompoundTransliterator.java]'s
 compliance with the extant top-level Transliterator abstraction here induces 
some serious performance hits (for some not-uncommon cases, like trailing NFC 
in the "Cyrillic-Latin" transliteration, shifting character blocks around on 
_every_ incremental character insertion. FWIW, "incremental character insertion 
and rollback" is essentially how ICU handles this situation in the source code 
referenced above).

For future consideration (absent a change in the ICU API) I'm thinking that it 
might be possible to reimplement the essence of CompoundTransliterator in 
external (Lucene) application code, with separately tracked "position" for each 
"leaf" Transliterator in the Transliterator tree. This would allow positions 
that were blocked partway through depth-first traveral of the Transliterator 
tree to avoid:
 # being double-processing by (potentially not idempotent) leading 
Transliterators, and/or
 # bypassing trailing Transliterators on account of higher-level filters that 
block the partially-processed character

My sense is that the performance gain could be significant.

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-16 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930759#comment-16930759
 ] 

Robert Muir commented on LUCENE-8972:
-

Yes, this would be another thing, good one for tests. But the whole idea is 
sound, I think you should be able to make it work!

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8972) CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization

2019-09-16 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930664#comment-16930664
 ] 

Michael Gibney commented on LUCENE-8972:


Thanks for the feedback/advice, [~rcmuir]. Along the same lines as what you 
mention, I think some attention also needs to be payed to the 
resolution/accuracy of offset correction. I'm going to take a crack at this and 
hope to have something shortly.

> CharFilter version of ICUTransformFilter, to better support dictionary-based 
> tokenization
> -
>
> Key: LUCENE-8972
> URL: https://issues.apache.org/jira/browse/LUCENE-8972
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> The ICU Transliteration API is currently exposed through Lucene only 
> post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly 
> dictionary-based) may assume pre-normalized input (e.g., for Chinese 
> characters, there may be an assumption of traditional-only or simplified-only 
> input characters, at the level of either all input, or 
> per-dictionary-defined-token).
> The potential usefulness of a CharFilter that exposes the ICU Transliteration 
> API was suggested in a [thread on the Solr mailing 
> list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E],
>  and my hope is that this issue can facilitate more detailed discussion of 
> the proposed addition.
> A concrete example of mixed traditional/simplified characters that are 
> currently tokenized differently by the ICUTokenizer are:
>  * 红楼梦 (SSS)
>  * 紅樓夢 (TTT)
>  * 紅楼夢 (TST)
> The first two tokens (simplified-only and traditional-only, respectively) are 
> included in the [CJ dictionary that backs 
> ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt],
>  but the last (a mixture of traditional and simplified characters) is not, 
> and is not recognized as a token. Even _if_ we assume this to be an 
> intentional omission from the dictionary that results in behavior that could 
> be desirable for some use cases, there are surely some use cases that would 
> benefit from a more permissive dictionary-based tokenization strategy (such 
> as could be supported by pre-tokenizer transliteration).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org