Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Jörn Kottmann Wed, 27 Jun 2012 01:41:30 -0700

On 06/21/2012 05:31 PM, Nicolas Hernandez wrote:

On Thu, Jun 21, 2012 at 10:00 AM, Jörn Kottmann <[email protected]> wrote:

BTW, if you are training on the FrenchTreebank.
We have dedicated format support for it in the trunk, would be
easy to do the POS training with it.

=)
I didn't know. It is nice for opennlp. I am sorry I did not answered
you about your invitation to integrate my own code. My approach was
not dedicated to the parsing of the FrenchTreebank. So I could not
integrate it easily.


I ve tried the converter. I am not sure how to use it ?
[2] gives a sentence per line with no pos tag associated with the tokens.

Anyway, it is very tricky to choose  what considering as tokens either
compound or only simple words, or what pos tag to give to the tokens.
I am not sure to understand well the choices which have been made in [1].
As soon as I manage to make the converter works, it will be more
simple to see them.


The current implementation uses the same tag for all multi-word-unit
tokens.
For example: in_order_to/IN will be in/IN oder/IN to/IN.

Would you use the pos tags as they are in the data? Maybe it would be useful

to add support for pos tag mappings. This would make it easy toexperiment with

different tag sets.

Looks like we want to give the user some options on how these cases willbe handled.


We need to add direct support for training the POS Tagger on this data.
Then you can do:
bin/opennlp POSTaggerConverter frenchtreebank ...

There is a problem with the Parse object and the way the command line tools

are build. The cli tools assume that hey can serialize via toString asample objectinto training data, but that does not work for the Parse object yet. Tofix that we needto make a breaking API change and need to refactor some code in thecoref component.

Anyway it would be nice to get the parser trained on it as well.

Jörn

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to