On Aug 13, 2014, at 1:53 PM, Shawn Heisey s...@elyograg.org wrote:
On 8/12/2014 9:13 PM, Steve Rowe wrote:
In the table below, the IsSameS (is same script) and SBreak? (script
break = not IsSameS) decisions are based on what I mentioned in my previous
message, and the WBreak (word break)
On 8/12/2014 9:13 PM, Steve Rowe wrote:
In the table below, the IsSameS (is same script) and SBreak? (script
break = not IsSameS) decisions are based on what I mentioned in my previous
message, and the WBreak (word break) decision is based on UAX#29 word
break rules:
CharCode Point
See the original message on this thread for full details. Some
additional information:
This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot
showing the analysis problem in more detail. The first line you can see
is the ICUTokenizer.
mmn
jnbbbjb)n9nooon
Sent from my HTC
- Reply message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Subject: ICUTokenizer acting very strangely with oriental characters
Date: Tue, Aug 12, 2014 19:00
See the original message on
Shawn,
ICUTokenizer is operating as designed here.
The key to understanding this is
o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
ScriptIterator.next() with the scripts of two consecutive characters; these
methods together find script boundaries. Here’s
On 8/12/2014 6:29 PM, Steve Rowe wrote:
Shawn,
ICUTokenizer is operating as designed here.
The key to understanding this is
o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
ScriptIterator.next() with the scripts of two consecutive characters; these
methods
In the table below, the IsSameS (is same script) and SBreak? (script
break = not IsSameS) decisions are based on what I mentioned in my previous
message, and the WBreak (word break) decision is based on UAX#29 word
break rules:
CharCode Point ScriptIsSameS?SBreak? WBreak?