Re: Adding languages to LanguageIdentifier

Jan Høydahl / Cominvent Tue, 24 Aug 2010 10:13:04 -0700

Hi,

Thanks for the answer. That's easy enough.


I cannot find documented what the original training texts were. Shouldn't those 
be in svn, so profiles could be re-built if the algorithm/format changes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 24. aug. 2010, at 16.57, Jukka Zitting wrote:

> Hi,
> 
> On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent
> <[email protected]> wrote:
>> Do anyone have an answer to this question that I posted last week?
>> I know how to generate profiles for Nutch, but not for Tika.
> 
> It's the same thing, you just need to postprocess the Nutch profile
> files to only contain three-letter ngrams as that's what Tika
> currently uses as the standard ngram size.
> 
> Any sufficiently representative corpus of text should be good enough
> for the language profiles. It would also be good to include some
> simple test cases that we can use to verify that future updates to the
> language profiles won't break things.
> 
> BR,
> 
> Jukka Zitting

Re: Adding languages to LanguageIdentifier

Reply via email to