Dear all
I am trying to upgrade from Nutch 0.9 to Tika 1.2. My current n-gram
profiles feature 1- to 4-grams and thus cannot be read by Tika, as it only
supports 3-gram profile files. I have two questions:
A) Why does Tika only support 3-gram profiles? In the code, the legacy
format is even referenced in comments (LanguageProfileBuilder):
/** The minimum length allowed for a ngram. */
final static int ABSOLUTE_MIN_NGRAM_LENGTH = 3; /* was 1 */
/** The maximum length allowed for a ngram. */
final static int ABSOLUTE_MAX_NGRAM_LENGTH = 3; /* was 4 */
B) I am not a linguistics expert, is there a way to convert the legacy
profiles into 3-gram files expected by Tika 1.2?
Best
Cedric