On 8/12/2014 9:13 PM, Steve Rowe wrote: > In the table below, the "IsSameS" (is same script) and "SBreak?" (script > break = not IsSameS) decisions are based on what I mentioned in my previous > message, and the "WBreak" (word break) decision is based on UAX#29 word > break rules: > > Char Code Point Script IsSameS? SBreak? WBreak? > ------ -------------- ------- ------------- --------- > ----------- > 治 U+6CBB Han Yes No Yes > ] U+005D Common Yes No Yes > , U+002C Common Yes No Yes > 1 U+0031 Common -- -- -- > > First, script boundaries are found and used as token boundaries - in the > above case, no script boundary is found between "治" and "1" - and then > UAX#29 word break rules are used to find token boundaries inbetween script > boundaries - in the above case, there are word boundaries between each > character, but ICUTokenizer throws away punctuation-only sequences between > token boundaries.
What should we use as a dividing character for situations like this? Should we tell our customer that they can't start keywords like this (for searching/filtering) with a number? Thanks, Shawn