Hi All, For our use case we don't really need to do a lot of manipulation of incoming text during index time. At most removal of common stop words, tokenize emails/ filenames etc if possible. We get text documents from our end users, which can be in any language (sometimes combination) and we cannot determine the language of the incoming text. Language detection at index time is not necessary.
Which analyzer is recommended to achive basic multilingual search capability for a use case like this. I have read a bunch of posts about using a combination standardtokenizer or ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but looking for ideas, suggestions, best practices. http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236 http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923 https://issues.apache.org/jira/browse/SOLR-6492 Thanks, Rishi.