Hi, On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent <[email protected]> wrote: > Do anyone have an answer to this question that I posted last week? > I know how to generate profiles for Nutch, but not for Tika.
It's the same thing, you just need to postprocess the Nutch profile files to only contain three-letter ngrams as that's what Tika currently uses as the standard ngram size. Any sufficiently representative corpus of text should be good enough for the language profiles. It would also be good to include some simple test cases that we can use to verify that future updates to the language profiles won't break things. BR, Jukka Zitting
