Hi Everyone

I need to train the POS tagger on multi-word terms. In other words,
some of my lexical units are made several tokens separated by
whitespace characters (like "traffic light", "feu rouge", "in order
to", ...).

I thing the training API allows to handle that but the command line
tools cannot. The former takes the words of a sentence as an array of
string. The latter assumes that the whitespace character is the
lexical unit separator.
A convention like concatenating all the words which are part of a
multi word term is not a solution since in that case models built by
the command line and by the API will be different.

It would be great if we could set by parameter what is the lexical
unit separator as well pos tag separator.

What do you think ?

/Nicolas

[1] http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api)

Reply via email to