On 10/30/2013 10:25 AM, Yakov Keranchuk wrote:
Hi all!


I encountered a small problem (as I think), with POSTaggerTrainer.

Train file contains russian and english words, ex. "бежать_action" in UTF-8
encoding. So in training (with or without -encoding UTF-8 option) I have
following:

opennlp.tools.postag.WordTagSampleStream read
WARNING: Error during parsing, ignoring sentence: ъєяшы_action ....(the
rest of sentence)

Where can be the problem?

The training file formats assumes that a token and pos tag is always seperated by an underscore, since your data contains underscores this does not work anymore, thats what the error message tries
to tell you ...

One way to solve this is to somehow get rid of the underscores in your text data.

We have an open jira issue to make the char which is used to seperate a token and a tag configurable,
this would probably solve your problem.

I don't think implementing this will be much work, a contribution would be very welcome.

HTH,
Jörn




Reply via email to