That's what I was going to look up :)

The nutch thing works reasonably well. It comes with a training
database from various languages. It had some UTF-8 problems in the
files. The trick here is to come up with a balanced volume of text for
all languages so that one language's patterns do not overwhelm.

Thanks for the pointer to ngramj (LGPL license), which then leads to
another contender, http://tcatng.sourceforge.net/ (BSD license). The
latter would make a great DIH Transformer that could go into contrib/
(hint hint).

On Tue, Feb 9, 2010 at 7:21 AM, Jan Høydahl / Cominvent
<jan....@cominvent.com> wrote:
> Much more efficient to tag documents with language at index time. Look for 
> language identification tools such as 
> http://www.sematext.com/products/language-identifier/index.html or 
> http://ngramj.sourceforge.net/ or 
> http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 9. feb. 2010, at 05.19, Lance Norskog wrote:
>
>> There is
>>
>> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch <raimon.bo...@gmail.com> wrote:
>>>
>>>
>>> Yes, It's true that we could do it in index time if we had a way to know. I
>>> was thinking in some solution in search time, maybe measuring the % of
>>> stopwords of each document. Normally, a document of another language won't
>>> have any stopword of its main language.
>>>
>>> If you know some external software to detect the language of a source text,
>>> it would be useful too.
>>>
>>> Thanks,
>>> Raimon Bosch.
>>>
>>>
>>>
>>> Ahmet Arslan wrote:
>>>>
>>>>
>>>>> In our indexes, sometimes we have some documents written in
>>>>> other languages
>>>>> different to the most common index's language. Is there any
>>>>> way to give less
>>>>> boosting to this documents?
>>>>
>>>> If you are aware of those documents, at index time you can boost those
>>>> documents with a value less than 1.0:
>>>>
>>>> <add>
>>>>   <doc boost="0.5">
>>>>     // document written in other languages
>>>>     <field name="...">...</field>
>>>>     <field name="...">...</field>
>>>>   </doc>
>>>> </add>
>>>>
>>>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context: 
>>> http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to