The TF-IDF argument is a reasonable one. On Fri, Jan 20, 2012 at 5:33 PM, Jan Høydahl <jan....@cominvent.com> wrote:
> Another benefit with separate field per lang is that TF/IDF stats gets > correct for each individual language. > Also if you KNOW the query language, you can target THAT field alone, but > if you don't know, you can throw the query at multiple fields, which will > each get proper analysis (at the risk of lower precision) > > The only case where I would prefer having one single field for all > languages is if my search app needs to support a large amount of languages, > such as a wide web crawl with 100 languages crawled. The way FAST supported > this was to go lemmatization by index expansion instead of reduction or > stemming - then you can easily support full linguistics for 100 languages, > indexed in the same field. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 20. jan. 2012, at 18:15, Ted Dunning wrote: > > > I think you misunderstood what I am suggesting. > > > > I am suggesting an analyzer that detects the language and then "does the > > right thing" according to the language it finds. As such, it would > > tokenize and stem English according to English rules, German by German > > rules and would probably do a sliding bigram window in Japanese and > Chinese. > > > > On Fri, Jan 20, 2012 at 8:54 AM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > >> bq: Why not have a polyglot analyzer > >> > >> That could work, but it makes some compromises and assumes that your > >> languages are "close enough", I have absolutely no clue how that would > >> work for English and Chinese say. > >> > >> But it also introduces inconsistencies. Take stemming. Even though you > >> could easily stem in the correct language, throwing all those stems > >> into the same filed can produce interesting results at search time since > >> you run the risk of hitting something produced by one of the other > >> analysis chains. > >> > >