Hi Alessandra,

> I would like to provide (train) a POS tagger model for italian language.
> I have some questions:
> > - may I use a token_tag pair list in place of sentence list? Something
> like:
> > casa_NOUN
> > e_CON (conjuction)
>
This way you loose context. There is a window (few tokens around the target
token) which is a feature for POS tagger, and it is used in training. By
formatting your dataset this way, you loose this feature.



> > ...
> > in place of
> >
> > la_ART casa_NOUN e_CON la_ART strada_NOUN
> > ...
> > because I have founded an italian word list.
>
Well, if it is a word list (arbitrary words, not connected to, e.g. like a
dictionary), then it is not a text and it does not make a lot of sense to
train a model on it. But from your example it looks like you have tagged
sentences, they are just formatted in a different way. So, you have two
options: 1) reformat you dataset into a format OpenNLP supports 2) Write a
java class to support your format in OpenNLP. Your format looks quite
simple (do you have sentence delimiters?), so 1) might be feasible with
something like awk or sed.



> > - Do I need to provide a tag dictionary? Is there a default tag
> dictionary?
>
Tag dictionary improves performance of the model, but it is not needed. It
is optional. AFAIK, for Italian there is no default tag dictionary in
OpenNLP.

Aliaksandr

Reply via email to