Re: POSTaggerTrainer encoding

Jörn Kottmann Wed, 30 Oct 2013 03:09:26 -0700

On 10/30/2013 10:25 AM, Yakov Keranchuk wrote:

Hi all!



I encountered a small problem (as I think), with POSTaggerTrainer.

Train file contains russian and english words, ex. "бежать_action" in UTF-8
encoding. So in training (with or without -encoding UTF-8 option) I have
following:

opennlp.tools.postag.WordTagSampleStream read
WARNING: Error during parsing, ignoring sentence: ъєяшы_action ....(the
rest of sentence)

Where can be the problem?

The training file formats assumes that a token and pos tag is alwaysseperated by an underscore,since your data contains underscores this does not work anymore, thatswhat the error message tries

to tell you ...

One way to solve this is to somehow get rid of the underscores in yourtext data.

We have an open jira issue to make the char which is used to seperate atoken and a tag configurable,

this would probably solve your problem.

I don't think implementing this will be much work, a contribution wouldbe very welcome.


HTH,
Jörn

Re: POSTaggerTrainer encoding

Reply via email to