Re: ICUTokenizer acting very strangely with oriental characters

2014-08-14 Thread Steve Rowe

On Aug 13, 2014, at 1:53 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/12/2014 9:13 PM, Steve Rowe wrote:
 In the table below, the IsSameS (is same script) and SBreak? (script
 break = not IsSameS) decisions are based on what I mentioned in my previous
 message, and the WBreak (word break) decision is based on UAX#29 word
 break rules:
 
 CharCode Point   ScriptIsSameS?SBreak?  WBreak?
 ----   -----
 ---
 治U+6CBB   Han  Yes  NoYes
 ]  U+005DCommon   Yes  NoYes
 ,  U+002CCommon   Yes  NoYes
 1 U+0031 Common   -- --  --
 
 First, script boundaries are found and used as token boundaries - in the
 above case, no script boundary is found between 治 and 1 - and then
 UAX#29 word break rules are used to find token boundaries inbetween script
 boundaries - in the above case, there are word boundaries between each
 character, but ICUTokenizer throws away punctuation-only sequences between
 token boundaries.
 
 What should we use as a dividing character for situations like this? 
 Should we tell our customer that they can't start keywords like this
 (for searching/filtering) with a number?

Assuming you don’t want to add new features to ICUTokenizer (like maybe 
treating Common script chars in ASCII as if they were in the Latin script):

1. Yes, you could tell the customer to start all Latin-script-containing 
keywords with a Latin script character (which ASCII digits are not; as 
described above, they are in the Common script).

2. You could use a separator that forces the script to become Latin, e.g. 
“;BBllAAhh;” (w/o the quotes), and then use a stop filter to remove them (e.g. 
“BBllAAhh” in this case (w/o the quotes) - you’ll want to choose something that 
won’t ever occur as a meaningful token.

That’s all I can think of ATM.

Steve

Re: ICUTokenizer acting very strangely with oriental characters

2014-08-13 Thread Shawn Heisey
On 8/12/2014 9:13 PM, Steve Rowe wrote:
 In the table below, the IsSameS (is same script) and SBreak? (script
 break = not IsSameS) decisions are based on what I mentioned in my previous
 message, and the WBreak (word break) decision is based on UAX#29 word
 break rules:

 CharCode Point   ScriptIsSameS?SBreak?  WBreak?
 ----   -----
 ---
 治U+6CBB   Han  Yes  NoYes
 ]  U+005DCommon   Yes  NoYes
 ,  U+002CCommon   Yes  NoYes
 1 U+0031 Common   -- --  --

 First, script boundaries are found and used as token boundaries - in the
 above case, no script boundary is found between 治 and 1 - and then
 UAX#29 word break rules are used to find token boundaries inbetween script
 boundaries - in the above case, there are word boundaries between each
 character, but ICUTokenizer throws away punctuation-only sequences between
 token boundaries.

What should we use as a dividing character for situations like this? 
Should we tell our customer that they can't start keywords like this
(for searching/filtering) with a number?

Thanks,
Shawn



Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Shawn Heisey
See the original message on this thread for full details.  Some
additional information:

This happens on version 4.6.1, 4.7.2, and 4.9.0.  Here is a screenshot
showing the analysis problem in more detail.  The first line you can see
is the ICUTokenizer.

https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png

The original field value was:

20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導
者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury

Thanks,
Shawn



Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Rik Tamm-Daniels
mmn

jnbbbjb)n9nooon

Sent from my HTC

- Reply message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Subject: ICUTokenizer acting very strangely with oriental characters
Date: Tue, Aug 12, 2014 19:00

See the original message on this thread for full details.  Some
additional information:

This happens on version 4.6.1, 4.7.2, and 4.9.0.  Here is a screenshot
showing the analysis problem in more detail.  The first line you can see
is the ICUTokenizer.

https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png

The original field value was:

20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導
者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury

Thanks,
Shawn



Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Steve Rowe
Shawn,

ICUTokenizer is operating as designed here.  

The key to understanding this is 
o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from 
ScriptIterator.next() with the scripts of two consecutive characters; these 
methods together find script boundaries.  Here’s ScriptIterator.isSameScript():

  /** Determine if two scripts are compatible. */
  private static boolean isSameScript(int scriptOne, int scriptTwo) {
return scriptOne = UScript.INHERITED || scriptTwo = UScript.INHERITED
|| scriptOne == scriptTwo;
  }

ASCII digits are in the Unicode script named “Common” (see 
http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt), and UScript.COMMON (0) 
is less than UScript.INHERITED (1) (see 
http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html),
 so there will be no script boundary detected between a character from an 
oriental script followed by an ASCII digit, or vice versa - the ASCII digit 
will be assigned the same script as the preceding character.

See UAX#24 for more info: http://www.unicode.org/reports/tr24/tr24-21.html 
(that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9).

Steve
 

On Aug 12, 2014, at 7:00 PM, Shawn Heisey s...@elyograg.org wrote:

 See the original message on this thread for full details.  Some
 additional information:
 
 This happens on version 4.6.1, 4.7.2, and 4.9.0.  Here is a screenshot
 showing the analysis problem in more detail.  The first line you can see
 is the ICUTokenizer.
 
 https://www.dropbox.com/s/9wbi7lz77ivya9j/ICUTokenizer-wrong-analysis.png
 
 The original field value was:
 
 20世紀の100人;ポートレートアーカイブス;政治家・軍人;政治家・指導
 者・軍人;[政 治],100peopeof20century,pploftwentycentury,pploftwentycentury
 
 Thanks,
 Shawn
 



Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Shawn Heisey
On 8/12/2014 6:29 PM, Steve Rowe wrote:
 Shawn,
 
 ICUTokenizer is operating as designed here.  
 
 The key to understanding this is 
 o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from 
 ScriptIterator.next() with the scripts of two consecutive characters; these 
 methods together find script boundaries.  Here’s 
 ScriptIterator.isSameScript():
 
   /** Determine if two scripts are compatible. */
   private static boolean isSameScript(int scriptOne, int scriptTwo) {
 return scriptOne = UScript.INHERITED || scriptTwo = UScript.INHERITED
 || scriptOne == scriptTwo;
   }
 
 ASCII digits are in the Unicode script named “Common” (see 
 http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt), and UScript.COMMON 
 (0) is less than UScript.INHERITED (1) (see 
 http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html),
  so there will be no script boundary detected between a character from an 
 oriental script followed by an ASCII digit, or vice versa - the ASCII digit 
 will be assigned the same script as the preceding character.
 
 See UAX#24 for more info: http://www.unicode.org/reports/tr24/tr24-21.html 
 (that’s the Unicode 6.3.0 version, which is supported by Lucene/Solr 4.9).

So the punctuation isn't considered break-worthy?

This input:

[政 治],100foo

Becomes 政 治, 100, and foo.

Thanks,
Shawn



Re: ICUTokenizer acting very strangely with oriental characters

2014-08-12 Thread Steve Rowe
In the table below, the IsSameS (is same script) and SBreak? (script
break = not IsSameS) decisions are based on what I mentioned in my previous
message, and the WBreak (word break) decision is based on UAX#29 word
break rules:

CharCode Point   ScriptIsSameS?SBreak?  WBreak?
----   -----
---
治U+6CBB   Han  Yes  NoYes
]  U+005DCommon   Yes  NoYes
,  U+002CCommon   Yes  NoYes
1 U+0031 Common   -- --  --

First, script boundaries are found and used as token boundaries - in the
above case, no script boundary is found between 治 and 1 - and then
UAX#29 word break rules are used to find token boundaries inbetween script
boundaries - in the above case, there are word boundaries between each
character, but ICUTokenizer throws away punctuation-only sequences between
token boundaries.

Steve
www.lucidworks.com


On Tue, Aug 12, 2014 at 9:01 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/12/2014 6:29 PM, Steve Rowe wrote:
  Shawn,
 
  ICUTokenizer is operating as designed here.
 
  The key to understanding this is
 o.a.l.analysis.icu.segmentation.ScriptIterator.isSameScript(), called from
 ScriptIterator.next() with the scripts of two consecutive characters; these
 methods together find script boundaries.  Here’s
 ScriptIterator.isSameScript():
 
/** Determine if two scripts are compatible. */
private static boolean isSameScript(int scriptOne, int scriptTwo) {
  return scriptOne = UScript.INHERITED || scriptTwo =
 UScript.INHERITED
  || scriptOne == scriptTwo;
}
 
  ASCII digits are in the Unicode script named “Common” (see 
 http://www.unicode.org/Public/6.3.0/ucd/Scripts.txt), and UScript.COMMON
 (0) is less than UScript.INHERITED (1) (see 
 http://www.icu-project.org/~mow/ICU4JCodeCoverage/Current/com/ibm/icu/lang/UScript.html),
 so there will be no script boundary detected between a character from an
 oriental script followed by an ASCII digit, or vice versa - the ASCII digit
 will be assigned the same script as the preceding character.
 
  See UAX#24 for more info: 
 http://www.unicode.org/reports/tr24/tr24-21.html (that’s the Unicode
 6.3.0 version, which is supported by Lucene/Solr 4.9).

 So the punctuation isn't considered break-worthy?

 This input:

 [政 治],100foo

 Becomes 政 治, 100, and foo.

 Thanks,
 Shawn