Re: ICUTokenizer acting very strangely with oriental characters

Shawn Heisey Wed, 13 Aug 2014 10:55:39 -0700

On 8/12/2014 9:13 PM, Steve Rowe wrote:
> In the table below, the "IsSameS" (is same script) and "SBreak?" (script
> break = not IsSameS) decisions are based on what I mentioned in my previous
> message, and the "WBreak" (word break) decision is based on UAX#29 word
> break rules:
>
> Char    Code Point   Script        IsSameS?    SBreak?  WBreak?
> ------    --------------   -------        -------------    ---------
> -----------
> 治        U+6CBB       Han          Yes              No            Yes
> ]          U+005D        Common   Yes              No            Yes
> ,          U+002C        Common   Yes              No            Yes
> 1         U+0031         Common   --                 --              --
>
> First, script boundaries are found and used as token boundaries - in the
> above case, no script boundary is found between "治" and "1" - and then
> UAX#29 word break rules are used to find token boundaries inbetween script
> boundaries - in the above case, there are word boundaries between each
> character, but ICUTokenizer throws away punctuation-only sequences between
> token boundaries.


What should we use as a dividing character for situations like this? 
Should we tell our customer that they can't start keywords like this
(for searching/filtering) with a number?

Thanks,
Shawn

Re: ICUTokenizer acting very strangely with oriental characters

Reply via email to