Just observed that training a token model with a text file without any
<SPLIT> tags will fail with the following error message:

Performing 100 iterations.

  1:  ... loglikelihood=0.0 1.0

  2:  ... loglikelihood=0.0 1.0

Exception in thread "main" java.lang.IllegalArgumentException:
opennlp.tools.util.InvalidFormatException: The maxent model is not
compatible with the tokenizer!

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476)

at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63)

at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253)

at
opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89)

at opennlp.tools.cmdline.CLI.main(CLI.java:222)

Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is
not compatible with the tokenizer!

at
opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155)

at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474)

... 4 more

When I added just one! <SPLIT> tag it worked. But the documentations states:
"... The OpenNLP format contains one sentence per line. Tokens are either
separated by a *whitespace* *or* by a special <SPLIT> tag. The following
sample shows the sample from above in the correct format."

I understand that as it do not necessary have to contain <Split> tags just
whitespace.

brgds,
Peter Thygesen

Reply via email to