RE: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-26 Thread Allison, Timothy B.
Thank you, Alex, Kuro and Simon. I've had a chance to look into this a bit more. I was under the (wrong) belief that the ICUTokenizer splits on individual Chinese characters like the StandardAnalyzer after (mis)reading these two sources

Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-26 Thread Shawn Heisey
On 6/26/2014 7:27 AM, Allison, Timothy B. wrote: So, I'm left with this as a candidate for the text_all field (I'll probably add a stop filter, too): fieldType name=text_all class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.ICUTokenizerFactory/

RE: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-20 Thread Allison, Timothy B.
Alex, Thank you for the quick response. Apologies for my delay. Y, we'll use edismax. That won't solve the issue of multilingual documents...I don't think...unless we index every document as every language. Let's say a predominantly English document contains a Chinese sentence. If the

Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-20 Thread T. Kuro Kurosaka
On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any

Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-20 Thread Simon Cheng
Hi Tim, I'm working on a similar project with some differences and may be we can share our knowledge in this area : 1) I have no problem with the Chinese characters. You can try this link : http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD Solr can find the record even

ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-18 Thread Allison, Timothy B.
All, In one index I’m working with, the setup is the typical langid mapping to language specific fields. There is also a text_all field that everything is copied to. The documents can contain a wide variety of languages including non-whitespace languages. We’ll be using the ICUTokenFilter

Re: ICUTokenizer or StandardTokenizer or ??? for text_all type field that might include non-whitespace langs

2014-06-18 Thread Alexandre Rafalovitch
I don't think the text_all field would work too well for multilingual setup. Any reason you cannot use edismax to search over a bunch of fields instead? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr