Hi, Riccardo, On Tue, Jul 31, 2012 at 8:51 AM, Riccardo Tasso <[email protected]>wrote:
> Hi all, > I was asking myself which features are extracted for each token or > which context is used in the default POS tagger. > It would be easier to check the source code: http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/postag/DefaultPOSContextGenerator.java?view=markup > Since I'm training models for other languages than English (e.g. Italian), > do you think I would have any benefit using a non standard POS Context > Generator? > Yes, it would help! I've been doing it for Portuguese, and added a few specific features. For example, one that I check a transitivity dictionary to help the tagger decide if the token next to a verb is likely to be a preposition or an article. It also helped a lot to have a tag dictionary in the sequence validator, but it is trick because if your dictionary is not complete (for example the token A can be classified as X, Y and Z, but the dictionary misses the Z), the dictionary will make your results worse. What helped me a lot deciding where I should focus while adding new features was the report generated by the POS Tagger CV if you include the argument -reportOutputFile report.txt (only 1.5.3-SNAPSHOT) It will show you which tags and tokens have poor accuracy, and you can first focus on improving that ones. Regards, William
