That's what I was going to look up :) The nutch thing works reasonably well. It comes with a training database from various languages. It had some UTF-8 problems in the files. The trick here is to come up with a balanced volume of text for all languages so that one language's patterns do not overwhelm.
Thanks for the pointer to ngramj (LGPL license), which then leads to another contender, http://tcatng.sourceforge.net/ (BSD license). The latter would make a great DIH Transformer that could go into contrib/ (hint hint). On Tue, Feb 9, 2010 at 7:21 AM, Jan Høydahl / Cominvent <jan....@cominvent.com> wrote: > Much more efficient to tag documents with language at index time. Look for > language identification tools such as > http://www.sematext.com/products/language-identifier/index.html or > http://ngramj.sourceforge.net/ or > http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html > > -- > Jan Høydahl - search architect > Cominvent AS - www.cominvent.com > > On 9. feb. 2010, at 05.19, Lance Norskog wrote: > >> There is >> >> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch <raimon.bo...@gmail.com> wrote: >>> >>> >>> Yes, It's true that we could do it in index time if we had a way to know. I >>> was thinking in some solution in search time, maybe measuring the % of >>> stopwords of each document. Normally, a document of another language won't >>> have any stopword of its main language. >>> >>> If you know some external software to detect the language of a source text, >>> it would be useful too. >>> >>> Thanks, >>> Raimon Bosch. >>> >>> >>> >>> Ahmet Arslan wrote: >>>> >>>> >>>>> In our indexes, sometimes we have some documents written in >>>>> other languages >>>>> different to the most common index's language. Is there any >>>>> way to give less >>>>> boosting to this documents? >>>> >>>> If you are aware of those documents, at index time you can boost those >>>> documents with a value less than 1.0: >>>> >>>> <add> >>>> <doc boost="0.5"> >>>> // document written in other languages >>>> <field name="...">...</field> >>>> <field name="...">...</field> >>>> </doc> >>>> </add> >>>> >>>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22 >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> View this message in context: >>> http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> Lance Norskog >> goks...@gmail.com > > -- Lance Norskog goks...@gmail.com