Nick, I've tracked down the issue, but I'm afraid it does not help much: https://issues.apache.org/jira/browse/TIKA-546 Converting the 4-grams to 3-grams and dropping the 1- and 2- grams crossed my mind, but it seems I'm probably better off creating a new profile from a fresh, large corpus anyway.
Best solution would be, if Tika would read the Nutch profile format :-) But I don't have enough understanding of the code to see whether this would be easy to do. Best Cedric On 18 January 2013 16:14, Nick Burch <[email protected]> wrote: > gram profiler
