Re: Basic Multilingual search capability

Trey Grainger Mon, 23 Feb 2015 22:42:01 -0800

Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
<http://solrinaction.com> and soon to be contributed back to Solr via
SOLR-6492 <https://issues.apache.org/jira/browse/SOLR-6492>), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> It isn’t just complicated, it can be impossible.
>
> Do you have content in Chinese or Japanese? Those languages (and some
> others) do not separate words with spaces. You cannot even do word search
> without a language-specific, dictionary-based parser.
>
> German is space separated, except many noun compounds are not
> space-separated.
>
> Do you have Finnish content? Entire prepositional phrases turn into word
> endings.
>
> Do you have Arabic content? That is even harder.
>
> If all your content is in space-separated languages that are not heavily
> inflected, you can kind of do OK with a language-insensitive approach. But
> it hits the wall pretty fast.
>
> One thing that does work pretty well is trademarked names (LaserJet, Coke,
> etc). Those are spelled the same in all languages and usually not inflected.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Feb 23, 2015, at 8:00 PM, Rishi Easwaran <rishi.easwa...@aol.com>
> wrote:
>
> > Hi Alex,
> >
> > There is no specific language list.
> > For example: the documents that needs to be indexed are emails or any
> messages for a global customer base. The messages back and forth could be
> in any language or mix of languages.
> >
> > I understand relevancy, stemming etc becomes extremely complicated with
> multilingual support, but our first goal is to be able to tokenize and
> provide basic search capability for any language. Ex: When the document
> contains hello or здравствуйте, the analyzer creates tokens and provides
> exact match search results.
> >
> > Now it would be great if it had capability to tokenize email addresses
> (ex:he...@aol.com- i think standardTokenizer already does this),
> filenames (здравствуйте.pdf), but maybe we can use filters to accomplish
> that.
> >
> > Thanks,
> > Rishi.
> >
> > -----Original Message-----
> > From: Alexandre Rafalovitch <arafa...@gmail.com>
> > To: solr-user <solr-user@lucene.apache.org>
> > Sent: Mon, Feb 23, 2015 5:49 pm
> > Subject: Re: Basic Multilingual search capability
> >
> >
> > Which languages are you expecting to deal with? Multilingual support
> > is a complex issue. Even if you think you don't need much, it is
> > usually a lot more complex than expected, especially around relevancy.
> >
> > Regards,
> >   Alex.
> > ----
> > Sign up for my Solr resources newsletter at http://www.solr-start.com/
> >
> >
> > On 23 February 2015 at 16:19, Rishi Easwaran <rishi.easwa...@aol.com>
> wrote:
> >> Hi All,
> >>
> >> For our use case we don't really need to do a lot of manipulation of
> incoming
> > text during index time. At most removal of common stop words, tokenize
> emails/
> > filenames etc if possible. We get text documents from our end users,
> which can
> > be in any language (sometimes combination) and we cannot determine the
> language
> > of the incoming text. Language detection at index time is not necessary.
> >>
> >> Which analyzer is recommended to achive basic multilingual search
> capability
> > for a use case like this.
> >> I have read a bunch of posts about using a combination
> standardtokenizer or
> > ICUtokenizer, lowercasefilter and reverwildcardfilter factory, but
> looking for
> > ideas, suggestions, best practices.
> >>
> >>
> http://lucene.472066.n3.nabble.com/ICUTokenizer-or-StandardTokenizer-or-for-quot-text-all-quot-type-field-that-might-include-non-whitess-td4142727.html#a4144236
> >>
> http://lucene.472066.n3.nabble.com/How-to-implement-multilingual-word-components-fields-schema-td4157140.html#a4158923
> >> https://issues.apache.org/jira/browse/SOLR-6492
> >>
> >>
> >> Thanks,
> >> Rishi.
> >>
> >
> >
>
>

Re: Basic Multilingual search capability

Reply via email to