Re: How are people using the ICUTokenizer?

2017-06-20 Thread Alexandre Rafalovitch
I used it in a demo where I searched for Thai words using approximate English sound-equivalent: https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34 I thought that was pretty cool and unexpectedly powerful :-) Regards, Alex. http://www.solr-start.com/ -

Re: How are people using the ICUTokenizer?

2017-06-20 Thread Joel Bernstein
el.da...@nih.gov] > Sent: Tuesday, June 20, 2017 12:02 PM > To: solr-user@lucene.apache.org > Subject: RE: How are people using the ICUTokenizer? > > Joel, > > I think the issue is doing word-breaking according to ICU rules. So, if > you are trying to make sure your index breaks words

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Allison, Timothy B.
g in 6.6. > use the ICUNormalizer I could not agree with this more. -Original Message- From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov] Sent: Tuesday, June 20, 2017 12:02 PM To: solr-user@lucene.apache.org Subject: RE: How are people using the ICUTokenizer? Joel, I think

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Davis, Daniel (NIH/NLM) [C]
knows more than I do. -Original Message- From: David Hastings [mailto:hastings.recurs...@gmail.com] Sent: Tuesday, June 20, 2017 12:13 PM To: solr-user@lucene.apache.org Subject: Re: How are people using the ICUTokenizer? Have you successfully used the shingles with the MoreLikeThis query

Re: How are people using the ICUTokenizer?

2017-06-20 Thread David Hastings
Have you successfully used the shingles with the MoreLikeThis query? Really curious about if this would to return the "interesting Phrases" On Tue, Jun 20, 2017 at 12:01 PM, Davis, Daniel (NIH/NLM) [C] < daniel.da...@nih.gov> wrote: > Joel, > > I think the issue is doing word-breaking according

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Davis, Daniel (NIH/NLM) [C]
Joel, I think the issue is doing word-breaking according to ICU rules. So, if you are trying to make sure your index breaks words properly on eastern languages, just use ICU Tokenizer. Unless your text is already in an ICU normal form, you should always use the ICUNormalizer character