Thanks a lot.
Il giorno 27/lug/2012, alle ore 09.48, Aliaksandr Autayeu ha scritto: > Hi Alessandra, > >> I would like to provide (train) a POS tagger model for italian language. >> I have some questions: >>> - may I use a token_tag pair list in place of sentence list? Something >> like: >>> casa_NOUN >>> e_CON (conjuction) >> > This way you loose context. There is a window (few tokens around the target > token) which is a feature for POS tagger, and it is used in training. By > formatting your dataset this way, you loose this feature. > > > >>> ... >>> in place of >>> >>> la_ART casa_NOUN e_CON la_ART strada_NOUN >>> ... >>> because I have founded an italian word list. >> > Well, if it is a word list (arbitrary words, not connected to, e.g. like a > dictionary), then it is not a text and it does not make a lot of sense to > train a model on it. But from your example it looks like you have tagged > sentences, they are just formatted in a different way. So, you have two > options: 1) reformat you dataset into a format OpenNLP supports 2) Write a > java class to support your format in OpenNLP. Your format looks quite > simple (do you have sentence delimiters?), so 1) might be feasible with > something like awk or sed. > > > >>> - Do I need to provide a tag dictionary? Is there a default tag >> dictionary? >> > Tag dictionary improves performance of the model, but it is not needed. It > is optional. AFAIK, for Italian there is no default tag dictionary in > OpenNLP. > > Aliaksandr
