Just observed that training a token model with a text file without any <SPLIT> tags will fail with the following error message:
Performing 100 iterations. 1: ... loglikelihood=0.0 1.0 2: ... loglikelihood=0.0 1.0 Exception in thread "main" java.lang.IllegalArgumentException: opennlp.tools.util.InvalidFormatException: The maxent model is not compatible with the tokenizer! at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:476) at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:63) at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:253) at opennlp.tools.cmdline.tokenizer.TokenizerTrainerTool.run(TokenizerTrainerTool.java:89) at opennlp.tools.cmdline.CLI.main(CLI.java:222) Caused by: opennlp.tools.util.InvalidFormatException: The maxent model is not compatible with the tokenizer! at opennlp.tools.tokenize.TokenizerModel.validateArtifactMap(TokenizerModel.java:155) at opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:474) ... 4 more When I added just one! <SPLIT> tag it worked. But the documentations states: "... The OpenNLP format contains one sentence per line. Tokens are either separated by a *whitespace* *or* by a special <SPLIT> tag. The following sample shows the sample from above in the correct format." I understand that as it do not necessary have to contain <Split> tags just whitespace. brgds, Peter Thygesen
