Re: Problem with Han character in ICUFoldingFilter

2016-10-30 Thread Steve Rowe
Among several other foldings, ICUFoldingFilter performs the Unicode NFC 
transform, which consists of canonical decomposition (NFD) followed by 
canonical composition.  NFD transforms U+FA04 to U+5B85, and canonical 
composition leaves U+5B85 as-is.

U+FA04 is in the “Pronunciation variants from KS X 1001:1998" sub-block - KS X 
1001 is a Korean encoding standard - in the "CJK Compatibility Ideographs" 
block .  I don’t know why these 
variants were included in Unicode, but the NFD transform includes the 
compatibility->canonical tranform, so it’s likely many other compatibility 
characters in your data will be affected, not just this one.  If the 
compatibility->canonical tranform is problematic, why are you using 
ICUFoldingFilter?

If you like some of the foldings included in ICUFoldingFilter but not others, 
check out the “gennorm2” and “gen-utr30-data-files” targets in the Lucene/Solr 
source code at lucene/analysis/icu/build.xml - you could build and use a 
modified binary tranform data file - this file is distributed as part of the 
lucene-analyzers-icu jar at org/apache/lucene/analysis/icu/utr30.nrm.
 
--
Steve
www.lucidworks.com

> On Oct 30, 2016, at 10:29 AM, Ahmet Arslan  wrote:
> 
> Hi Eyal,
> 
> ICUFoldingFilter uses http://site.icu-project.org under the hood.
> If you think there is a bug, it is better to ask its mailing list.
> 
> Ahmet
> 
> 
> 
> On Sunday, October 30, 2016 3:41 PM, "eyal.naam...@exlibrisgroup.com" 
>  wrote:
> Hi,
> 
> I was wondering if anyone ran into the following issue, or a similar one:
> In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
> It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the 
> wrong character being indexed.
> Does anyone have any idea if and how this can be resolved? Is there an option 
> to add an exception rule to ICUFoldingFilter?
> Thanks,
> Eyal



Re: Problem with Han character in ICUFoldingFilter

2016-10-30 Thread Ahmet Arslan
Hi Eyal,

ICUFoldingFilter uses http://site.icu-project.org under the hood.
If you think there is a bug, it is better to ask its mailing list.

Ahmet



On Sunday, October 30, 2016 3:41 PM, "eyal.naam...@exlibrisgroup.com" 
 wrote:
Hi,

I was wondering if anyone ran into the following issue, or a similar one:
In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the 
wrong character being indexed.
Does anyone have any idea if and how this can be resolved? Is there an option 
to add an exception rule to ICUFoldingFilter?
Thanks,
Eyal


Problem with Han character in ICUFoldingFilter

2016-10-30 Thread eyal.naam...@exlibrisgroup.com
Hi,

I was wondering if anyone ran into the following issue, or a similar one:
In Han script there are two separate characters - 宅 (FA04) and 宅 (5B85).
It seems that ICUFoldingFilter converts FA04 to 5B85, which results in the 
wrong character being indexed.
Does anyone have any idea if and how this can be resolved? Is there an option 
to add an exception rule to ICUFoldingFilter?
Thanks,
Eyal