Thank you, Alex, Kuro and Simon. I've had a chance to look into this a bit
more.
I was under the (wrong) belief that the ICUTokenizer splits on individual
Chinese characters like the StandardAnalyzer after (mis)reading these two
sources
On 6/26/2014 7:27 AM, Allison, Timothy B. wrote:
So, I'm left with this as a candidate for the text_all field (I'll probably
add a stop filter, too):
fieldType name=text_all class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.ICUTokenizerFactory/
Alex,
Thank you for the quick response. Apologies for my delay.
Y, we'll use edismax. That won't solve the issue of multilingual documents...I
don't think...unless we index every document as every language.
Let's say a predominantly English document contains a Chinese sentence. If the
On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:
Let's say a predominantly English document contains a Chinese sentence. If the
English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter,
the Chinese sentence could be tokenized as one big token (if it doesn't have
any
Hi Tim,
I'm working on a similar project with some differences and may be we can
share our knowledge in this area :
1) I have no problem with the Chinese characters. You can try this link :
http://123.100.239.158:8983/solr/collection1/browse?q=%E4%B8%AD%E5%9B%BD
Solr can find the record even
All,
In one index I’m working with, the setup is the typical langid mapping to
language specific fields. There is also a text_all field that everything is
copied to. The documents can contain a wide variety of languages including
non-whitespace languages. We’ll be using the ICUTokenFilter
I don't think the text_all field would work too well for multilingual
setup. Any reason you cannot use edismax to search over a bunch of
fields instead?
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr