Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Nicolas Hernandez Wed, 27 Jun 2012 07:54:12 -0700

On Wed, Jun 27, 2012 at 3:40 PM, Jörn Kottmann <[email protected]> wrote:
> On 06/27/2012 03:07 PM, Nicolas Hernandez wrote:
>>
>> On Wed, Jun 27, 2012 at 10:03 AM, Jörn Kottmann <[email protected]>
>> wrote:
>>
>>>
>>> That would move the responsibility to detect multi-word-units
>>> to the POS Tagger.
>>>
>>> A simple transformation step could convert the output to the
>>> non-multi-word-tags format.
>>>
>>> This has the advantage that a user can detect a multi-word unit.
>>
>> The "special pos tags" would be for the whole multi-word expressions
>> (MWE) or for each word of the MWE ?
>>
>> Anyway, according to me, the pos tagger trainer should not be aware of
>> that and we must not force the users to change their tagset to use the
>> opennlp tools.
>> If the data is annotated with pos tags which inform about MWE then
>> they will be handled like any tags over simple words.
>
>
> Yes that should just work with our current way of handling the data.
> A user even has the option with the current implementation to customize
> the POS Tagger to handle it differently, even including a custom
> MWE detector model in the pos model package would be possible.
>
> That should be fine as it is.
>
>>>> The cli should offer a way to specify what are multi-word expressions
>>>> in the data.
>>>> This can be done by using a parameter to set what is the token
>>>> separator character.
>>>>
>>>> Models built from the cli or the API should be the same.
>>>> One way to do that is to use a parameter to set what is the multi-word
>>>> separator character and to turn this separator character into
>>>> whitespace before training the model.
>>>> For example with " " as the token separator character, "_" as the
>>>> multi-word separator character and "/" as the pos tag separator, the
>>>> following sentence
>>>> Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
>>>> sleep/VB longer/RB
>>>> should be turn into
>>>> String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
>>>> "earlier", "in order to", "sleep", "longer"};
>>>> (note "in order to")
>>>> What do you think about ?
>>>>
>>>>
>>> I think that could be problematic if your training or test data contains
>>> the multi-word-separator character. In this case you might consider
>>> something
>>> as a multi-word-unit which should not be one.
>>> What do you think about using SGML style tags as we do in the NER
>>> training
>>> format?
>>> For example: <MW>in order to</MW>/IN.
>>
>> I do not like to mix the annotation systems: either everything in XML
>> (<MW pos="IN">in order to</MW>) or not.
>> The multi-word separator can be a string not a character (e.g."_##_"
>> which is quite rare). The point is that the user should be informed
>> about the problem you mention and since it is up to him to set the
>> string by parameter, he will do an aware and wise choice.
>> If you mix both annotations systems, then the ambiguity problem
>> remains also for the start tag and end tag you use.
>>
>>> Would you prefer, dealing with multi-word-units at the tokenizer level
>>> or at the POS Tagger level?
>>> Or do we need support for both?
>>
>>
>> I think we agree that whatever the analyzers we use (POS, Chunk,
>> Parser, NER...), all should have been built on data word-tokenized in
>> the same way.
>
>
> If  a token can be a MWE then we need to fix all our formats to
> support the MWE separator char sequence. Currently we use the whitespace
> tokenizer in most of the places to process our training data.
> That we should change and use one which is sensitive to MWEs separated by
> a char sequence.
>
> We should specify a default MWE separator which is used when serializing
> data with MWEs
> and offer an option to specify it.
>
>
>> Personally I do not use the OpenNLP word tokenizer. Actually I used it
>> but I also use dictionaries and regex expressions which lead me to a
>> richer concept of "word" (what it is called "lexical unit"). I take
>> them as input of my processing chain.
>>
>> And since I used also UIMA and since UIMA process annotations. If the
>> annotation stands for "lexical unit", some can be MWE or simple word,
>> it is transparent for the user.
>>
>> So I would have like that the OpenNLP pos tagger/trainer cli offer me
>> a way to build models I can use with UIMA without pre/post processing.
>>
>> In my opinion, an OpenNLP labeller/trainer should offer the
>> possibility to the users of adapting its input/output to the users'
>> data and not the opposite.
>
>
> Yes, I see this in the same way.
>
> Would you mind to open a jira issue to request MWE support in our
> training formats?
>
> Jörn
>


done

https://issues.apache.org/jira/browse/OPENNLP-515

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to