Re: Training a POS tagger model

Aliaksandr Autayeu Fri, 27 Jul 2012 00:49:27 -0700

Hi Alessandra,

> I would like to provide (train) a POS tagger model for italian language.
> I have some questions:
> > - may I use a token_tag pair list in place of sentence list? Something
> like:
> > casa_NOUN
> > e_CON (conjuction)
>
This way you loose context. There is a window (few tokens around the target
token) which is a feature for POS tagger, and it is used in training. By
formatting your dataset this way, you loose this feature.




> > ...
> > in place of
> >
> > la_ART casa_NOUN e_CON la_ART strada_NOUN
> > ...
> > because I have founded an italian word list.
>
Well, if it is a word list (arbitrary words, not connected to, e.g. like a
dictionary), then it is not a text and it does not make a lot of sense to
train a model on it. But from your example it looks like you have tagged
sentences, they are just formatted in a different way. So, you have two
options: 1) reformat you dataset into a format OpenNLP supports 2) Write a
java class to support your format in OpenNLP. Your format looks quite
simple (do you have sentence delimiters?), so 1) might be feasible with
something like awk or sed.



> > - Do I need to provide a tag dictionary? Is there a default tag
> dictionary?
>
Tag dictionary improves performance of the model, but it is not needed. It
is optional. AFAIK, for Italian there is no default tag dictionary in
OpenNLP.

Aliaksandr

Re: Training a POS tagger model

Reply via email to