Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Nicolas Hernandez Wed, 27 Jun 2012 06:08:51 -0700

On Wed, Jun 27, 2012 at 10:03 AM, Jörn Kottmann <[email protected]> wrote:
> On 06/22/2012 11:13 AM, Nicolas Hernandez wrote:
>>
>> On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote:
>>>
>>> To make this reliable the tokenization on new unseen
>>> text must be done correctly. For the spanish data we had
>>> a special chunker to put the multi word units into one token.
>>>
>>> Do you use something like that?
>>
>> A kind of.
>>
>> A POS tagger should always be used with the tokenizer which was used
>> for the POS tagger trainer.
>> The more you manage to automatically recognize multi word units, the
>> more you should consider them for your training. It makes easier the
>> further syntax analysis.
>> (interaction can exist between the two processes (i.e. recognizing
>> multi words units can require a pos tagging of simple words) but I won
>> t speak about that).
>
>
> At some place we need multi-word-unit detection. If you do that
> in our tokenizer than it will affect all components which rely on the
> tokenization e.g. also NER.
>
>
>
>>> What do you think about outputting a special pos tag to indicate
>>> that it is a multi-word tag?
>>
>> I do not see the point at the pos analysis level. We must try to
>> re-use what it exists to preserve the compatibility.
>> Here I should mention that [Green:2011:MEI:2145432.2145516] (see
>> below) introduced multi-word tags (one for each pos label mw noun, mw
>> verb, mw preposition...) at the chunk and syntax levels. This is
>> probably a nice idea.
>
>
> That would move the responsibility to detect multi-word-units
> to the POS Tagger.
>
> A simple transformation step could convert the output to the
> non-multi-word-tags format.
>
> This has the advantage that a user can detect a multi-word unit.


The "special pos tags" would be for the whole multi-word expressions
(MWE) or for each word of the MWE ?

Anyway, according to me, the pos tagger trainer should not be aware of
that and we must not force the users to change their tagset to use the
opennlp tools.
If the data is annotated with pos tags which inform about MWE then
they will be handled like any tags over simple words.


>
>>
>> The cli should offer a way to specify what are multi-word expressions
>> in the data.
>> This can be done by using a parameter to set what is the token
>> separator character.
>>
>> Models built from the cli or the API should be the same.
>> One way to do that is to use a parameter to set what is the multi-word
>> separator character and to turn this separator character into
>> whitespace before training the model.
>> For example with " " as the token separator character, "_" as the
>> multi-word separator character and "/" as the pos tag separator, the
>> following sentence
>> Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
>> sleep/VB longer/RB
>> should be turn into
>> String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
>> "earlier", "in order to", "sleep", "longer"};
>> (note "in order to")
>> What do you think about ?
>>
>>
>
> I think that could be problematic if your training or test data contains
> the multi-word-separator character. In this case you might consider
> something
> as a multi-word-unit which should not be one.
> What do you think about using SGML style tags as we do in the NER training
> format?
> For example: <MW>in order to</MW>/IN.

I do not like to mix the annotation systems: either everything in XML
(<MW pos="IN">in order to</MW>) or not.
The multi-word separator can be a string not a character (e.g."_##_"
which is quite rare). The point is that the user should be informed
about the problem you mention and since it is up to him to set the
string by parameter, he will do an aware and wise choice.
If you mix both annotations systems, then the ambiguity problem
remains also for the start tag and end tag you use.

> Would you prefer, dealing with multi-word-units at the tokenizer level
> or at the POS Tagger level?
> Or do we need support for both?

I don t know think that currently (considering the state-of-art in
"(simple/multi)word" tokenization) there is an unique answer to the
question at which level(s) the MWE detection should occur. I see it
can be dealt at several levels.

MWEs are hard to define. They can be defined by regular forms, syntax
composition, semantic composition or none of these criteria.
Consequently many approaches are possible to recognize them: by
dictionary (like DBPedia), by regex (e.g. numbers, emails), by pos
pattern (Adj Noun, Noun Prep Noun), by (B-I-O) chunking...
So some of these approaches can be used at the word tokenization
stage, some other after a first pos tagging stage.

I think we agree that whatever the analyzers we use (POS, Chunk,
Parser, NER...), all should have been built on data word-tokenized in
the same way.

Personally I do not use the OpenNLP word tokenizer. Actually I used it
but I also use dictionaries and regex expressions which lead me to a
richer concept of "word" (what it is called "lexical unit"). I take
them as input of my processing chain.

And since I used also UIMA and since UIMA process annotations. If the
annotation stands for "lexical unit", some can be MWE or simple word,
it is transparent for the user.

So I would have like that the OpenNLP pos tagger/trainer cli offer me
a way to build models I can use with UIMA without pre/post processing.

In my opinion, an OpenNLP labeller/trainer should offer the
possibility to the users of adapting its input/output to the users'
data and not the opposite.

/Nicolas

>
> Jörn
>

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to