To make this reliable the tokenization on new unseen
text must be done correctly. For the spanish data we had
a special chunker to put the multi word units into one token.
Do you use something like that?
What do you think about outputting a special pos tag to indicate
that it is a multi-word tag?
Jörn
On 06/21/2012 04:47 PM, Nicolas Hernandez wrote:
Hi Jörn
On Thu, Jun 21, 2012 at 9:50 AM, Jörn Kottmann<[email protected]> wrote:
Hello,
the lexical unit in the POS Tagger is a token. For the
spanish POS models mutli-token chunks were converted
into one token separated by a "_".
To what would you set the lexical unit separator in your case?
I do the same, but ... I m a bit confused of doing that because
1. I do not like pre and post process my data (here to add/remove an
underscore to the multi-words terms)
2. A model trained with the API, which allows you not to preprocess
your data, will be different from the model trained with the cli on
the same data
3. Finally when you get a model you do not know which segmentation it
assumes and how the multi-words terms are represented
Since it is often convenient to use the cli it would be nice to set
the token separator at least to be able to build the same models than
with the API.
The pos tag separator can already be configured in the class
which reads the input, but this parameter is not be set by the cli
tool.
+1 to make both configurable from the command line.
Nice.
At least the idea has been proposed. If I have time...
Jörn
On 06/20/2012 03:02 PM, Nicolas Hernandez wrote:
Hi Everyone
I need to train the POS tagger on multi-word terms. In other words,
some of my lexical units are made several tokens separated by
whitespace characters (like "traffic light", "feu rouge", "in order
to", ...).
I thing the training API allows to handle that but the command line
tools cannot. The former takes the words of a sentence as an array of
string. The latter assumes that the whitespace character is the
lexical unit separator.
A convention like concatenating all the words which are part of a
multi word term is not a solution since in that case models built by
the command line and by the API will be different.
It would be great if we could set by parameter what is the lexical
unit separator as well pos tag separator.
What do you think ?
/Nicolas
[1]
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api)