Re: ICUTokenizer acting very strangely with oriental characters

2014-08-14 Thread Steve Rowe
On Aug 13, 2014, at 1:53 PM, Shawn Heisey s...@elyograg.org wrote: On 8/12/2014 9:13 PM, Steve Rowe wrote: In the table below, the IsSameS (is same script) and SBreak? (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the WBreak (word break)

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-13 Thread Shawn Heisey
On 8/12/2014 9:13 PM, Steve Rowe wrote: In the table below, the IsSameS (is same script) and SBreak? (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the WBreak (word break) decision is based on UAX#29 word break rules: CharCode Point

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Shawn Heisey
See the original message on this thread for full details. Some additional information: This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot showing the analysis problem in more detail. The first line you can see is the ICUTokenizer.

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Rik Tamm-Daniels
mmn jnbbbjb)n9nooon Sent from my HTC - Reply message - From: Shawn Heisey s...@elyograg.org To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: ICUTokenizer acting very strangely with oriental characters Date: Tue, Aug 12, 2014 19:00 See the original message on

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Steve Rowe
Shawn, ICUTokenizer is operating as designed here. The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods together find script boundaries. Here’s

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Shawn Heisey
On 8/12/2014 6:29 PM, Steve Rowe wrote: Shawn, ICUTokenizer is operating as designed here. The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Steve Rowe
In the table below, the IsSameS (is same script) and SBreak? (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the WBreak (word break) decision is based on UAX#29 word break rules: CharCode Point ScriptIsSameS?SBreak? WBreak?