RE: ICUTransformFilter with traditional to simplified Chinese

2017-12-19 Thread Eyal Naamati
Thanks!
 I actually did ready the Stanford posts when we implemented our index, it was 
very helpful!

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Tuesday, December 19, 2017 1:31 AM
To: solr-user@lucene.apache.org
Subject: Re: ICUTransformFilter with traditional to simplified Chinese

On 12/18/2017 9:49 AM, Eyal Naamati wrote:
> We are using the ICUTransformFilter to normalize traditional Chinese text to 
> simplified Chinese.
> We received feedback from some of our Chinese customers that there are some 
> traditional characters that are not converted to their simplified variants.
> For example:
> "眞" should be converted to "真"
> "硏" should be converted to "研"
> "夲" should be converted to "本"
>
> Does anyone know if this is indeed a problem with the filter?
> Or if there are other options to use instead of this filter that handle more 
> characters?

I have one index for a website we built for a customer in Japan.  While 
researching how to effectively handle CJK characters, I came across an entire 
series of blog posts.  Here's the first post, you can check other posts on the 
same blog for most posts on the same subject.  There are a lot of them:

https://urldefense.proofpoint.com/v2/url?u=http-3A__discovery-2Dgrindstone.blogspot.com_2013_10_cjk-2Dwith-2Dsolr-2Dfor-2Dlibraries-2Dpart-2D1.html=DwIDaQ=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw=ZsqkNmNtZFgRxog-CW6KYJ28NtGoZq91tuixLQ8lJIw=

One of the filters that Stanford utilized (and we also implemented) is a custom 
filter that they wrote, apparently specifically because there are things that 
the ICU filters included with Lucene do not catch.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sul-2Ddlss_CJKFoldingFilter=DwIDaQ=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw=3-FHJky_wxpuxfDuVVbukGBeYtL43_G49vBH7xaTStY=

Looking into the code for the custom filter and checking into your first 
example, this filter actually seems to go in the reverse direction -- it 
converts 真 to 眞.  I did not look into the other examples, and I'm completely 
clueless about CJK characters, so I don't know what those characters are or 
what the correct action would be.

That third-party custom filter would probably be helpful to you.  Even though 
it goes in the reverse direction for your first example, as long as the 
behavior at index time and query time is the same, you should still get 
matches.  End users would most likely never see the results of the analysis.

Whether or not the behavior you've noticed is a bug with ICUTransformFilter is 
a question that I cannot answer.  If it is, then the bug will be in ICU, not 
Lucene.

https://urldefense.proofpoint.com/v2/url?u=http-3A__lucene.apache.org_core_7-5F1-5F0_analyzers-2Dicu_org_apache_lucene_analysis_icu_ICUTransformFilter.html=DwIDaQ=WMhnfwkfN4LR6wX29ZSgFCZf_hw4vy5MAv7iZJNaAD4=S7QZWqfcOLl62Mpd8PUcA3-3z78voYLEsrnT2uiQKyE=J8kpguaEjPrrfMdxNEkG3iroVDzr3790eDeGGSR38iw=XoPsu6iF8r_aEHXuep-m3vILU8vIfilW0uv82ZRQtUA=

Thanks,
Shawn



ICUTransformFilter with traditional to simplified Chinese

2017-12-18 Thread Eyal Naamati
Hi All,
We are using the ICUTransformFilter to normalize traditional Chinese text to 
simplified Chinese.
We received feedback from some of our Chinese customers that there are some 
traditional characters that are not converted to their simplified variants.
For example:
"�w" should be converted to "真"
"�x" should be converted to "研"
"��" should be converted to "本"

Does anyone know if this is indeed a problem with the filter?
Or if there are other options to use instead of this filter that handle more 
characters?

Thanks for any feedback
Eyal



RE: multi term analyzer error

2015-12-31 Thread Eyal Naamati
Hi Erick,

Thanks for the detailed response!
My use case is exactly as you described in your 'Eric*' example. Our index and 
query analyzers replace a dash "-" with an underscore "_". So when a user tries 
to search for something that has a dash in it, and the query has a wildcard 
(for example Eyal-Naa*), he doesn't find anything even if the term exists. The 
reason is, as you said, that solr uses a different analyzer to analyze wildcard 
queries. 
Our solution was to add a multiterm analyzer that will do the same thing as the 
query analyzer - replace the dash with an underscore. This does solve the 
issue, even though the ' PatternReplaceCharFilterFactory ' does not implement 
the 'MultiTermAwareComponent' interface.
But adding the new analyzer causes a new problem, and I don't think it is 
related to the PatternReplaceCharFilterFactory.
When an empty wildcard query is sent, such as just "*" to query the whole 
index, there is a failure with "analyzer returned no terms for multiTerm term 
*".
These queries do work for the default analyzer so I guess there is a way to 
handle them.
Thanks!
Eyal

Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com

www.exlibrisgroup.com

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, December 30, 2015 6:42 PM
To: solr-user
Subject: Re: multi term analyzer error

Right, you may be one of the few people to actually implement your own 
multiTerm analyzer function despite the fact that this has been in the code for 
years!

If you look at the factories and see if they implement the 
"MultiTermAwareComponent" interface, and PatternReplaceCharFitlerFactory does 
_not_. Thus it can't be used in a multiTerm analysis chain.

A bit of background here. The whole "MultiTermAwareComponent" was implemented 
to handle simple cases that were causing endless questions. For instance, 
anything with a wildcard would do no analysis. Thus people would define a field 
with, say, LowerCaseFilterFactory and then ask "Why don't we find 'Eric*'  when 
Erick is in the field?" The answer was that "wildcard terms are not sent 
through the analysis chain, you have to do those kinds of transformations in 
the client." This was not terribly satisfactory...

There are various sound reasons why "doing the right thing" with wildcards in a 
filter that breaks a single token into two or more tokens this is very hard in 
the general case. Any filter that generates two or more tokens is impossible to 
get right. Does this mean both tokens should be wildcards? The first? The 
second? Neither? Any decision is the wrong decision. And don't even get me 
started on something like Ngrams or Shingles.

OK, finally answering your question. The only filters that are multi-term aware 
are ones that are _guaranteed_ to produce one and only one token from any input 
token.
PatternReplaceCharFilterFactory cannot honor that contract so I'm pretty sure 
that's what's causing your error. Assuming the substitutions you're doing would 
work on the whole string, you might be able to use 
PatterhReplaceCharFilterFactory since that operates on the whole input string 
rather than the tokens and thus could be used.

But I have to ask "why are you implementing a multiTerm analyzer"?
What is the use-case you're
trying to solve? Because from your example, it looks like you're trying to 
search over a string-type
(untokenized) input and if so this not the right approach at all.

Best,
Erick

On Tue, Dec 29, 2015 at 10:16 PM, Eyal  Naamati 
<eyal.naam...@exlibrisgroup.com> wrote:
> Hi Ahmet,
> Yes there is a space in my example.
> This is my multiterm analyzer:
>
> 
>  pattern="\-" replacement="\_" />
> 
>  
>
> Thanks!
>
> Eyal Naamati
> Alma Developer
> Tel: +972-2-6499313
> Mobile: +972-547915255
> eyal.naam...@exlibrisgroup.com
>
> www.exlibrisgroup.com
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
> Sent: Tuesday, December 29, 2015 5:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: multi term analyzer error
>
> Hi Eyal,
>
> What is your analyzer definition for multi-term?
> In your example, is star charter separated from the term by a space?
>
>
> Ahmet
>
> On Tuesday, December 29, 2015 3:26 PM, Eyal Naamati 
> <eyal.naam...@exlibrisgroup.com> wrote:
>
>
>
>
> Hi,
>
> I defined a multi-term analyzer to my analysis chain, and it works as I 
> expect. However, for some queries (for example '* or 'term *') I get an 
> exception "analyzer returned no terms for multiTerm term". These queries work 
> when I don't customize a multi-term analyzer.
> My question: is there a w

RE: multi term analyzer error

2015-12-29 Thread Eyal Naamati
Hi Ahmet,
Yes there is a space in my example.
This is my multiterm analyzer:







Thanks!

Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com

www.exlibrisgroup.com

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Tuesday, December 29, 2015 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: multi term analyzer error

Hi Eyal,

What is your analyzer definition for multi-term?
In your example, is star charter separated from the term by a space?


Ahmet

On Tuesday, December 29, 2015 3:26 PM, Eyal Naamati 
<eyal.naam...@exlibrisgroup.com> wrote:




Hi,
 
I defined a multi-term analyzer to my analysis chain, and it works as I expect. 
However, for some queries (for example '* or 'term *') I get an exception 
"analyzer returned no terms for multiTerm term". These queries work when I 
don't customize a multi-term analyzer.
My question: is there a way to handle this in the analyzer configuration (in my 
schema.xml)? I realize that I can also change the query I am sending the 
analyzer, but that is difficult for me since there are many places in our 
program that use this.
Thanks!
 
Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com

www.exlibrisgroup.com


multi term analyzer error

2015-12-29 Thread Eyal Naamati
Hi,

I defined a multi-term analyzer to my analysis chain, and it works as I expect. 
However, for some queries (for example '* or 'term *') I get an exception 
"analyzer returned no terms for multiTerm term". These queries work when I 
don't customize a multi-term analyzer.
My question: is there a way to handle this in the analyzer configuration (in my 
schema.xml)? I realize that I can also change the query I am sending the 
analyzer, but that is difficult for me since there are many places in our 
program that use this.
Thanks!

Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com<mailto:eyal.naam...@exlibrisgroup.com>
[Description: Description: Description: Description: C://signature/exlibris.jpg]
www.exlibrisgroup.com<http://www.exlibrisgroup.com/>



RE: Korean script conversion

2015-04-15 Thread Eyal Naamati
Trying again since I don't have an answer yet.
Thanks!

Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.commailto:eyal.naam...@exlibrisgroup.com
[Description: Description: Description: Description: C://signature/exlibris.jpg]
www.exlibrisgroup.comhttp://www.exlibrisgroup.com/

From: Eyal Naamati
Sent: Sunday, March 29, 2015 7:52 AM
To: solr-user@lucene.apache.org
Subject: Korean script conversion

Hi,

We are starting to index records in Korean. Korean text can be written in two 
scripts: Han characters (Chinese) and Hangul characters (Korean).
We are looking for some solr filter or another built in solr component that 
converts between Han and Hangul characters (transliteration).
I know there is the ICUTransformFilterFactory that can convert between Japanese 
or chinese scripts, for example:
filter class=solr.ICUTransformFilterFactory id=Katakana- Hiragana/ for 
Japanese script conversions
So far I couldn't find anything readymade for Korean scripts, but perhaps 
someone knows of one?

Thanks!
Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.commailto:eyal.naam...@exlibrisgroup.com
[Description: Description: Description: Description: C://signature/exlibris.jpg]
www.exlibrisgroup.comhttp://www.exlibrisgroup.com/



RE: Korean script conversion

2015-03-30 Thread Eyal Naamati
We only want the conversion Hanja-Hangul, for each Hanja character there 
exists only one Hangul character that can replace it in a Korean text.
The other way around is not convertible. 
We want to allow searching in both scripts and find matches in both scripts.
 Thanks

Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.com

www.exlibrisgroup.com

-Original Message-
From: Benson Margulies [mailto:bimargul...@gmail.com] 
Sent: Monday, March 30, 2015 1:58 PM
To: solr-user
Subject: Re: Korean script conversion

Why do you think that this is a good idea? Hanja are used for special purposes; 
they are not trivally convertable to Hanjul due to ambiguity, and it's not at 
all clear that a typical search user wants to treat them as equivalent.

On Sun, Mar 29, 2015 at 1:52 AM, Eyal Naamati  eyal.naam...@exlibrisgroup.com 
wrote:

  Hi,



 We are starting to index records in Korean. Korean text can be written 
 in two scripts: Han characters (Chinese) and Hangul characters (Korean).

 We are looking for some solr filter or another built in solr component 
 that converts between Han and Hangul characters (transliteration).

 I know there is the ICUTransformFilterFactory that can convert between 
 Japanese or chinese scripts, for example:

 filter class=*solr.ICUTransformFilterFactory* id=*Katakana- 
 Hiragana* / for Japanese script conversions

 So far I couldn't find anything readymade for Korean scripts, but 
 perhaps someone knows of one?



 Thanks!

 Eyal Naamati
 Alma Developer
 Tel: +972-2-6499313
 Mobile: +972-547915255
 eyal.naam...@exlibrisgroup.com
 [image: Description: Description: Description: Description:
 C://signature/exlibris.jpg]
 www.exlibrisgroup.com





Korean script conversion

2015-03-28 Thread Eyal Naamati
Hi,

We are starting to index records in Korean. Korean text can be written in two 
scripts: Han characters (Chinese) and Hangul characters (Korean).
We are looking for some solr filter or another built in solr component that 
converts between Han and Hangul characters (transliteration).
I know there is the ICUTransformFilterFactory that can convert between Japanese 
or chinese scripts, for example:
filter class=solr.ICUTransformFilterFactory id=Katakana- Hiragana/ for 
Japanese script conversions
So far I couldn't find anything readymade for Korean scripts, but perhaps 
someone knows of one?

Thanks!
Eyal Naamati
Alma Developer
Tel: +972-2-6499313
Mobile: +972-547915255
eyal.naam...@exlibrisgroup.commailto:eyal.naam...@exlibrisgroup.com
[Description: Description: Description: Description: C://signature/exlibris.jpg]
www.exlibrisgroup.comhttp://www.exlibrisgroup.com/