On 06/22/2012 11:13 AM, Nicolas Hernandez wrote:
On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote:
To make this reliable the tokenization on new unseen
text must be done correctly. For the spanish data we had
a special chunker to put the multi word units into one token.

Do you use something like that?
A kind of.

A POS tagger should always be used with the tokenizer which was used
for the POS tagger trainer.
The more you manage to automatically recognize multi word units, the
more you should consider them for your training. It makes easier the
further syntax analysis.
(interaction can exist between the two processes (i.e. recognizing
multi words units can require a pos tagging of simple words) but I won
t speak about that).

At some place we need multi-word-unit detection. If you do that
in our tokenizer than it will affect all components which rely on the
tokenization e.g. also NER.


What do you think about outputting a special pos tag to indicate
that it is a multi-word tag?
I do not see the point at the pos analysis level. We must try to
re-use what it exists to preserve the compatibility.
Here I should mention that [Green:2011:MEI:2145432.2145516] (see
below) introduced multi-word tags (one for each pos label mw noun, mw
verb, mw preposition...) at the chunk and syntax levels. This is
probably a nice idea.

That would move the responsibility to detect multi-word-units
to the POS Tagger.

A simple transformation step could convert the output to the
non-multi-word-tags format.

This has the advantage that a user can detect a multi-word unit.


The cli should offer a way to specify what are multi-word expressions
in the data.
This can be done by using a parameter to set what is the token
separator character.

Models built from the cli or the API should be the same.
One way to do that is to use a parameter to set what is the multi-word
separator character and to turn this separator character into
whitespace before training the model.
For example with " " as the token separator character, "_" as the
multi-word separator character and "/" as the pos tag separator, the
following sentence
Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
sleep/VB longer/RB
should be turn into
String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
"earlier", "in order to", "sleep", "longer"};
(note "in order to")
What do you think about ?



I think that could be problematic if your training or test data contains
the multi-word-separator character. In this case you might consider something
as a multi-word-unit which should not be one.
What do you think about using SGML style tags as we do in the NER training format?
For example: <MW>in order to</MW>/IN.

Would you prefer, dealing with multi-word-units at the tokenizer level
or at the POS Tagger level?
Or do we need support for both?

Jörn

Reply via email to