Hi, Riccardo,

On Tue, Jul 31, 2012 at 8:51 AM, Riccardo Tasso <[email protected]>wrote:

> Hi all,
>     I was asking myself which features are extracted for each token or
> which context is used in the default POS tagger.
>

It would be easier to check the source code:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java?view=markup


> Since I'm training models for other languages than English (e.g. Italian),
> do you think I would have any benefit using a non standard POS Context
> Generator?
>

Yes, it would help! I've been doing it for Portuguese, and added a few
specific features. For example, one that I check a transitivity dictionary
to help the tagger decide if the token next to a verb is likely to be a
preposition or an article.

It also helped a lot to have a tag dictionary in the sequence validator,
but it is trick because if your dictionary is not complete (for example the
token A can be classified as X, Y and Z, but the dictionary misses the Z),
the dictionary will make your results worse.

What helped me a lot deciding where I should focus while adding new
features was the report generated by the POS Tagger CV if you include the
argument -reportOutputFile report.txt (only 1.5.3-SNAPSHOT)

It will show you which tags and tokens have poor accuracy, and you can
first focus on improving that ones.

Regards,
William

Reply via email to