Re: ICUTokenizer acting very strangely with oriental characters

Shawn Heisey Tue, 12 Aug 2014 18:02:29 -0700

On 8/12/2014 6:29 PM, Steve Rowe wrote:
> Shawn,
> 
> ICUTokenizer is operating as designed here.  
> 
> The key to understanding this is 
> o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from 
> ScriptIterator.next() with the scripts of two consecutive characters; these 
> methods together find script boundaries.  Here’s 
> ScriptIterator.isSameScript():
> 
>   /** Determine if two scripts are compatible. */
>   private static boolean isSameScript(int scriptOne, int scriptTwo) {
>     return scriptOne <= UScript.INHERITED || scriptTwo <= UScript.INHERITED
>         || scriptOne == scriptTwo;
>   }
> 
> ASCII digits are in the Unicode script named “Common” (see 
> <http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON 
> (0) is less than UScript.INHERITED (1) (see 
> <http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>),
>  so there will be no script boundary detected between a character from an 
> oriental script followed by an ASCII digit, or vice versa - the ASCII digit 
> will be assigned the same script as the preceding character.
> 
> See UAX#24 for more info: <http://www.unicode.org/reports/tr24/tr24-21.html> 
> (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9).


So the punctuation isn't considered break-worthy?

This input:

[政 治],100foo

Becomes 政 治, 100, and foo.

Thanks,
Shawn

Re: ICUTokenizer acting very strangely with oriental characters

Reply via email to