Thanks a lot.




Il giorno 27/lug/2012, alle ore 09.48, Aliaksandr Autayeu ha scritto:

> Hi Alessandra,
> 
>> I would like to provide (train) a POS tagger model for italian language.
>> I have some questions:
>>> - may I use a token_tag pair list in place of sentence list? Something
>> like:
>>> casa_NOUN
>>> e_CON (conjuction)
>> 
> This way you loose context. There is a window (few tokens around the target
> token) which is a feature for POS tagger, and it is used in training. By
> formatting your dataset this way, you loose this feature.
> 
> 
> 
>>> ...
>>> in place of
>>> 
>>> la_ART casa_NOUN e_CON la_ART strada_NOUN
>>> ...
>>> because I have founded an italian word list.
>> 
> Well, if it is a word list (arbitrary words, not connected to, e.g. like a
> dictionary), then it is not a text and it does not make a lot of sense to
> train a model on it. But from your example it looks like you have tagged
> sentences, they are just formatted in a different way. So, you have two
> options: 1) reformat you dataset into a format OpenNLP supports 2) Write a
> java class to support your format in OpenNLP. Your format looks quite
> simple (do you have sentence delimiters?), so 1) might be feasible with
> something like awk or sed.
> 
> 
> 
>>> - Do I need to provide a tag dictionary? Is there a default tag
>> dictionary?
>> 
> Tag dictionary improves performance of the model, but it is not needed. It
> is optional. AFAIK, for Italian there is no default tag dictionary in
> OpenNLP.
> 
> Aliaksandr

Reply via email to