Hi all, this is my first post to the list. I’ve tried to gather some info from the documentation and googling around but I haven’t found a satisfying answer to the following questions. Please tell me where to RTFM if some of these questions belong to some FAQ or are off-topic.
It seems there’s no way to incrementally train the POS tagger nor to parallelize this task. Is this correct? If the only way to train the POS tagger is in one single shot, how can I estimate memory requirements for the JVM? In other words, given, say, a 1GB training corpus, is there a way to estimate how much RAM would it be needed? Finally, I have tried to use the `-ngram` switch: > opennlp POSTaggerTrainer.conllx -type maxent -ngram 3 ... <other options as > usual: -lang -model -data -encoding> but I get this error: > Building ngram dictionary ... IO error while building NGram Dictionary: > Stream not marked > Stream not marked > java.io.IOException: Stream not marked > at java.io.BufferedReader.reset(BufferedReader.java:485) > at > opennlp.tools.util.PlainTextByLineStream.reset(PlainTextByLineStream.java:79) > at > opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43) > at > opennlp.tools.util.FilterObjectStream.reset(FilterObjectStream.java:43) > at > opennlp.tools.cmdline.postag.POSTaggerTrainerTool.run(POSTaggerTrainerTool.java:80) > at opennlp.tools.cmdline.CLI.main(CLI.java:222) But I can’t find out what I’m doing wrong. Any help really appreciated. -- Giorgio Valoti
