Hi Alessandra, > I would like to provide (train) a POS tagger model for italian language. > I have some questions: > > - may I use a token_tag pair list in place of sentence list? Something > like: > > casa_NOUN > > e_CON (conjuction) > This way you loose context. There is a window (few tokens around the target token) which is a feature for POS tagger, and it is used in training. By formatting your dataset this way, you loose this feature.
> > ... > > in place of > > > > la_ART casa_NOUN e_CON la_ART strada_NOUN > > ... > > because I have founded an italian word list. > Well, if it is a word list (arbitrary words, not connected to, e.g. like a dictionary), then it is not a text and it does not make a lot of sense to train a model on it. But from your example it looks like you have tagged sentences, they are just formatted in a different way. So, you have two options: 1) reformat you dataset into a format OpenNLP supports 2) Write a java class to support your format in OpenNLP. Your format looks quite simple (do you have sentence delimiters?), so 1) might be feasible with something like awk or sed. > > - Do I need to provide a tag dictionary? Is there a default tag > dictionary? > Tag dictionary improves performance of the model, but it is not needed. It is optional. AFAIK, for Italian there is no default tag dictionary in OpenNLP. Aliaksandr
