Re: Tika0.10 language identifier in Solr3.5.0

Ted Dunning Mon, 23 Jan 2012 00:50:00 -0800

Jan's point that keeping different fields can make some statistical issues
more correct is sound.

The basic idea is that a common word in a rare language should be treated
as a common word if you are working in that language.  The simplest way to
make that happen is by having a different field for each language.  Since
Lucene itself doesn't much care which fields you have, this is a nice
option.  It leads to complicated looking queries if you don't know what
language the query is in since you have to have as many versions of the
query as you have languages in your corpus, but the performance should be
about the same as if all languages go into a single field.  If you have a
moderately sharp idea about what language you might have, then the query
doesn't have to be all that huge.

The index size should also be about the same for either approach.

As such, I think I would go with what Jan suggested.

On Sun, Jan 22, 2012 at 5:45 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Would "doing the right thing" include firing the results at different
> fields based on the language detected? Your answer to Jan
> seems to indicate not, in which case my original comments
> stand. The main point is that mixing all the *results* of the
> analysis chains for multiple languages into a single field
> will likely result in "interesting" behavior. Not to say it won't
> be satisfactory in your situation, but there are edge cases.
>
> Best
> Erick
>
> On Fri, Jan 20, 2012 at 9:15 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> > I think you misunderstood what I am suggesting.
> >
> > I am suggesting an analyzer that detects the language and then "does the
> > right thing" according to the language it finds.   As such, it would
> > tokenize and stem English according to English rules, German by German
> > rules and would probably do a sliding bigram window in Japanese and
> Chinese.
> >
> > On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson <erickerick...@gmail.com
> >wrote:
> >
> >> bq: Why not have a polyglot analyzer
> >>
> >> That could work, but it makes some compromises and assumes that your
> >> languages are "close enough", I have absolutely no clue how that would
> >> work for English and Chinese say.
> >>
> >> But it also introduces inconsistencies. Take stemming. Even though you
> >> could easily stem in the correct language, throwing all those stems
> >> into the same filed can produce interesting results at search time since
> >> you run the risk of hitting something produced by one of the other
> >> analysis chains.
> >>
>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to