On 8/12/2014 6:29 PM, Steve Rowe wrote: > Shawn, > > ICUTokenizer is operating as designed here. > > The key to understanding this is > o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from > ScriptIterator.next() with the scripts of two consecutive characters; these > methods together find script boundaries. Here’s > ScriptIterator.isSameScript(): > > /** Determine if two scripts are compatible. */ > private static boolean isSameScript(int scriptOne, int scriptTwo) { > return scriptOne <= UScript.INHERITED || scriptTwo <= UScript.INHERITED > || scriptOne == scriptTwo; > } > > ASCII digits are in the Unicode script named “Common” (see > <http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt>), and UScript.COMMON > (0) is less than UScript.INHERITED (1) (see > <http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html>), > so there will be no script boundary detected between a character from an > oriental script followed by an ASCII digit, or vice versa - the ASCII digit > will be assigned the same script as the preceding character. > > See UAX#24 for more info: <http://www.unicode.org/reports/tr24/tr24-21.html> > (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9).
So the punctuation isn't considered break-worthy? This input: [政 治],100foo Becomes 政 治, 100, and foo. Thanks, Shawn