Re: ICUTokenizer acting very strangely with oriental characters
On Aug 13, 2014, at 1:53 PM, Shawn Heisey s...@elyograg.org wrote: On 8/12/2014 9:13 PM, Steve Rowe wrote: In the table below, the IsSameS (is same script) and SBreak? (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the WBreak (word break) decision is based on UAX#29 word break rules: CharCode Point ScriptIsSameS?SBreak? WBreak? ---- ----- --- 治U+6CBB Han Yes NoYes ] U+005DCommon Yes NoYes , U+002CCommon Yes NoYes 1 U+0031 Common -- -- -- First, script boundaries are found and used as token boundaries - in the above case, no script boundary is found between 治 and 1 - and then UAX#29 word break rules are used to find token boundaries inbetween script boundaries - in the above case, there are word boundaries between each character, but ICUTokenizer throws away punctuation-only sequences between token boundaries. What should we use as a dividing character for situations like this? Should we tell our customer that they can't start keywords like this (for searching/filtering) with a number? Assuming you don’t want to add new features to ICUTokenizer (like maybe treating Common script chars in ASCII as if they were in the Latin script): 1. Yes, you could tell the customer to start all Latin-script-containing keywords with a Latin script character (which ASCII digits are not; as described above, they are in the Common script). 2. You could use a separator that forces the script to become Latin, e.g. “;BBllAAhh;” (w/o the quotes), and then use a stop filter to remove them (e.g. “BBllAAhh” in this case (w/o the quotes) - you’ll want to choose something that won’t ever occur as a meaningful token. That’s all I can think of ATM. Steve
Re: ICUTokenizer acting very strangely with oriental characters
On 8/12/2014 9:13 PM, Steve Rowe wrote: In the table below, the IsSameS (is same script) and SBreak? (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the WBreak (word break) decision is based on UAX#29 word break rules: CharCode Point ScriptIsSameS?SBreak? WBreak? ---- ----- --- 治U+6CBB Han Yes NoYes ] U+005DCommon Yes NoYes , U+002CCommon Yes NoYes 1 U+0031 Common -- -- -- First, script boundaries are found and used as token boundaries - in the above case, no script boundary is found between 治 and 1 - and then UAX#29 word break rules are used to find token boundaries inbetween script boundaries - in the above case, there are word boundaries between each character, but ICUTokenizer throws away punctuation-only sequences between token boundaries. What should we use as a dividing character for situations like this? Should we tell our customer that they can't start keywords like this (for searching/filtering) with a number? Thanks, Shawn
Re: ICUTokenizer acting very strangely with oriental characters
See the original message on this thread for full details. Some additional information: This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot showing the analysis problem in more detail. The first line you can see is the ICUTokenizer. https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png The original field value was: 20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury Thanks, Shawn
Re: ICUTokenizer acting very strangely with oriental characters
mmn jnbbbjb)n9nooon Sent from my HTC - Reply message - From: Shawn Heisey s...@elyograg.org To: solr-user@lucene.apache.org solr-user@lucene.apache.org Subject: ICUTokenizer acting very strangely with oriental characters Date: Tue, Aug 12, 2014 19:00 See the original message on this thread for full details. Some additional information: This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot showing the analysis problem in more detail. The first line you can see is the ICUTokenizer. https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png The original field value was: 20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury Thanks, Shawn
Re: ICUTokenizer acting very strangely with oriental characters
Shawn, ICUTokenizer is operating as designed here. The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods together find script boundaries. Here’s ScriptIterator.isSameScript(): /** Determine if two scripts are compatible. */ private static boolean isSameScript(int scriptOne, int scriptTwo) { return scriptOne = UScript.INHERITED || scriptTwo = UScript.INHERITED || scriptOne == scriptTwo; } ASCII digits are in the Unicode script named “Common” (see http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt), and UScript.COMMON (0) is less than UScript.INHERITED (1) (see http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html), so there will be no script boundary detected between a character from an oriental script followed by an ASCII digit, or vice versa - the ASCII digit will be assigned the same script as the preceding character. See UAX#24 for more info: http://www.unicode.org/reports/tr24/tr24-21.html (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9). Steve On Aug 12, 2014, at 7:00 PM, Shawn Heisey s...@elyograg.org wrote: See the original message on this thread for full details. Some additional information: This happens on version 4.6.1, 4.7.2, and 4.9.0. Here is a screenshot showing the analysis problem in more detail. The first line you can see is the ICUTokenizer. https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png The original field value was: 20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury Thanks, Shawn
Re: ICUTokenizer acting very strangely with oriental characters
On 8/12/2014 6:29 PM, Steve Rowe wrote: Shawn, ICUTokenizer is operating as designed here. The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods together find script boundaries. Here’s ScriptIterator.isSameScript(): /** Determine if two scripts are compatible. */ private static boolean isSameScript(int scriptOne, int scriptTwo) { return scriptOne = UScript.INHERITED || scriptTwo = UScript.INHERITED || scriptOne == scriptTwo; } ASCII digits are in the Unicode script named “Common” (see http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt), and UScript.COMMON (0) is less than UScript.INHERITED (1) (see http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html), so there will be no script boundary detected between a character from an oriental script followed by an ASCII digit, or vice versa - the ASCII digit will be assigned the same script as the preceding character. See UAX#24 for more info: http://www.unicode.org/reports/tr24/tr24-21.html (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9). So the punctuation isn't considered break-worthy? This input: [政 治],100foo Becomes 政 治, 100, and foo. Thanks, Shawn
Re: ICUTokenizer acting very strangely with oriental characters
In the table below, the IsSameS (is same script) and SBreak? (script break = not IsSameS) decisions are based on what I mentioned in my previous message, and the WBreak (word break) decision is based on UAX#29 word break rules: CharCode Point ScriptIsSameS?SBreak? WBreak? ---- ----- --- 治U+6CBB Han Yes NoYes ] U+005DCommon Yes NoYes , U+002CCommon Yes NoYes 1 U+0031 Common -- -- -- First, script boundaries are found and used as token boundaries - in the above case, no script boundary is found between 治 and 1 - and then UAX#29 word break rules are used to find token boundaries inbetween script boundaries - in the above case, there are word boundaries between each character, but ICUTokenizer throws away punctuation-only sequences between token boundaries. Steve www.lucidworks.com On Tue, Aug 12, 2014 at 9:01 PM, Shawn Heisey s...@elyograg.org wrote: On 8/12/2014 6:29 PM, Steve Rowe wrote: Shawn, ICUTokenizer is operating as designed here. The key to understanding this is o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from ScriptIterator.next() with the scripts of two consecutive characters; these methods together find script boundaries. Here’s ScriptIterator.isSameScript(): /** Determine if two scripts are compatible. */ private static boolean isSameScript(int scriptOne, int scriptTwo) { return scriptOne = UScript.INHERITED || scriptTwo = UScript.INHERITED || scriptOne == scriptTwo; } ASCII digits are in the Unicode script named “Common” (see http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt), and UScript.COMMON (0) is less than UScript.INHERITED (1) (see http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html), so there will be no script boundary detected between a character from an oriental script followed by an ASCII digit, or vice versa - the ASCII digit will be assigned the same script as the preceding character. See UAX#24 for more info: http://www.unicode.org/reports/tr24/tr24-21.html (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9). So the punctuation isn't considered break-worthy? This input: [政 治],100foo Becomes 政 治, 100, and foo. Thanks, Shawn