[ https://issues.apache.org/jira/browse/OPENNLP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618295#comment-17618295 ]
Martin Wiesner commented on OPENNLP-1366: ----------------------------------------- For completeness, it also affects OpenNLP latest release {*}2.0.0{*}. I have written a patch for this recently in which [~struberg] helped conceptually in a joint session. Might provide a PR towards 2.0.1 soon. The build, incl. all tests, is passing just fine. The problem as reported above and on StOf is gone with the patch. @ [~jzemerick] Should I upload a patch file upfront, so you can read/check first? Same question applies for Mark. > Training of MaxEnt Model with large corpora fails with > java.io.UTFDataFormatException > -------------------------------------------------------------------------------------- > > Key: OPENNLP-1366 > URL: https://issues.apache.org/jira/browse/OPENNLP-1366 > Project: OpenNLP > Issue Type: Improvement > Components: Lemmatizer > Affects Versions: 1.9.4 > Reporter: Richard Zowalla > Assignee: Jeff Zemerick > Priority: Major > > As written on > [dev@opennlp.a.o|[https://lists.apache.org/thread/vc5lfzj81tco703noqxpvy8sfj8fw8b1]], > we are working on training a large opennlp maxent model for lemmatizing > German texts. We use a wikipedia tree bank from Tübingen. > This consumes > 2 TB of RAM during training but will finish after some time. > However, writing this model will result in a java.io.UTFDataFormatException > However, training such a big model isn't feasable for debugging. Gladly, a > similar with a smaller dataset is found on Stackoverflow: > [https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long] > > It contains the OpenNLP CLI command to train a lemmatizer on a much smaller > dataset. > The stacktrace is raced while writing a String as UTF in DataOutputStream, > which has a hard-coded size limitation in the JDK (for reasons behind my > knowledge ;)) > Stacktrace: > {code:java} > java.io.UTFDataFormatException: encoded string too long: 383769 bytes > at > java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at > java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > opennlp.tools.ml.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:71) > at > opennlp.tools.ml.maxent.io.GISModelWriter.persist(GISModelWriter.java:97) > at > opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75) > at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29) > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597) > at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182) > at > opennlp.tools.cmdline.lemmatizer.LemmatizerTrainerTool.run(LemmatizerTrainerTool.java:77) > at opennlp.tools.cmdline.CLI.main(CLI.java:256) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)