Re: Tika0.10 language identifier in Solr3.5.0

Ted Dunning Fri, 20 Jan 2012 18:49:46 -0800

The TF-IDF argument is a reasonable one.

On Fri, Jan 20, 2012 at 5:33 PM, Jan Høydahl <jan....@cominvent.com> wrote:


> Another benefit with separate field per lang is that TF/IDF stats gets
> correct for each individual language.
> Also if you KNOW the query language, you can target THAT field alone, but
> if you don't know, you can throw the query at multiple fields, which will
> each get proper analysis (at the risk of lower precision)
>
> The only case where I would prefer having one single field for all
> languages is if my search app needs to support a large amount of languages,
> such as a wide web crawl with 100 languages crawled. The way FAST supported
> this was to go lemmatization by index expansion instead of reduction or
> stemming - then you can easily support full linguistics for 100 languages,
> indexed in the same field.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 20. jan. 2012, at 18:15, Ted Dunning wrote:
>
> > I think you misunderstood what I am suggesting.
> >
> > I am suggesting an analyzer that detects the language and then "does the
> > right thing" according to the language it finds.   As such, it would
> > tokenize and stem English according to English rules, German by German
> > rules and would probably do a sliding bigram window in Japanese and
> Chinese.
> >
> > On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson <erickerick...@gmail.com
> >wrote:
> >
> >> bq: Why not have a polyglot analyzer
> >>
> >> That could work, but it makes some compromises and assumes that your
> >> languages are "close enough", I have absolutely no clue how that would
> >> work for English and Chinese say.
> >>
> >> But it also introduces inconsistencies. Take stemming. Even though you
> >> could easily stem in the correct language, throwing all those stems
> >> into the same filed can produce interesting results at search time since
> >> you run the risk of hitting something produced by one of the other
> >> analysis chains.
> >>
>
>

Re: Tika0.10 language identifier in Solr3.5.0

Reply via email to