Hi, Thanks for the answer. That's easy enough.
I cannot find documented what the original training texts were. Shouldn't those be in svn, so profiles could be re-built if the algorithm/format changes? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 24. aug. 2010, at 16.57, Jukka Zitting wrote: > Hi, > > On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent > <[email protected]> wrote: >> Do anyone have an answer to this question that I posted last week? >> I know how to generate profiles for Nutch, but not for Tika. > > It's the same thing, you just need to postprocess the Nutch profile > files to only contain three-letter ngrams as that's what Tika > currently uses as the standard ngram size. > > Any sufficiently representative corpus of text should be good enough > for the language profiles. It would also be good to include some > simple test cases that we can use to verify that future updates to the > language profiles won't break things. > > BR, > > Jukka Zitting
