Hello, the lexical unit in the POS Tagger is a token. For the spanish POS models mutli-token chunks were converted into one token separated by a "_".
To what would you set the lexical unit separator in your case? The pos tag separator can already be configured in the class which reads the input, but this parameter is not be set by the cli tool. +1 to make both configurable from the command line. Jörn On 06/20/2012 03:02 PM, Nicolas Hernandez wrote:
Hi Everyone I need to train the POS tagger on multi-word terms. In other words, some of my lexical units are made several tokens separated by whitespace characters (like "traffic light", "feu rouge", "in order to", ...). I thing the training API allows to handle that but the command line tools cannot. The former takes the words of a sentence as an array of string. The latter assumes that the whitespace character is the lexical unit separator. A convention like concatenating all the words which are part of a multi word term is not a solution since in that case models built by the command line and by the API will be different. It would be great if we could set by parameter what is the lexical unit separator as well pos tag separator. What do you think ? /Nicolas [1] http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api)
