Hi, Picking up on this thread again.
I created TIKA-546 "Add ability to create language profiles to tika-app". Do you think this is a viable route? But when I try to find the class org.apache.nutch.analysis.lang.NGramProfile in trunk, it is gone. Is there already an existing initiative to port language profile creation over to Tika? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 24. aug. 2010, at 16.57, Jukka Zitting wrote: > Hi, > > On Tue, Aug 24, 2010 at 4:50 PM, Jan Høydahl / Cominvent > <[email protected]> wrote: >> Do anyone have an answer to this question that I posted last week? >> I know how to generate profiles for Nutch, but not for Tika. > > It's the same thing, you just need to postprocess the Nutch profile > files to only contain three-letter ngrams as that's what Tika > currently uses as the standard ngram size. > > Any sufficiently representative corpus of text should be good enough > for the language profiles. It would also be good to include some > simple test cases that we can use to verify that future updates to the > language profiles won't break things. > > BR, > > Jukka Zitting
