On 06/21/2012 05:31 PM, Nicolas Hernandez wrote:
On Thu, Jun 21, 2012 at 10:00 AM, Jörn Kottmann <[email protected]> wrote:
BTW, if you are training on the FrenchTreebank.
We have dedicated format support for it in the trunk, would be
easy to do the POS training with it.
=)
I didn't know. It is nice for opennlp. I am sorry I did not answered
you about your invitation to integrate my own code. My approach was
not dedicated to the parsing of the FrenchTreebank. So I could not
integrate it easily.

I ve tried the converter. I am not sure how to use it ?
[2] gives a sentence per line with no pos tag associated with the tokens.

Anyway, it is very tricky to choose  what considering as tokens either
compound or only simple words, or what pos tag to give to the tokens.
I am not sure to understand well the choices which have been made in [1].
As soon as I manage to make the converter works, it will be more
simple to see them.

The current implementation uses the same tag for all multi-word-unit
tokens.
For example: in_order_to/IN will be in/IN oder/IN to/IN.

Would you use the pos tags as they are in the data? Maybe it would be useful
to add support for pos tag mappings. This would make it easy to experiment with
different tag sets.

Looks like we want to give the user some options on how these cases will be handled.

We need to add direct support for training the POS Tagger on this data.
Then you can do:
bin/opennlp POSTaggerConverter frenchtreebank ...

There is a problem with the Parse object and the way the command line tools
are build. The cli tools assume that hey can serialize via toString a sample object into training data, but that does not work for the Parse object yet. To fix that we need to make a breaking API change and need to refactor some code in the coref component.
Anyway it would be nice to get the parser trained on it as well.

Jörn

Reply via email to