Hi
I wondered how good/or bad is the quality of OpenNLP models for various types 
of languages (latin alphabet, cyrilic alphabet, abjads, ideographic).
I wrote a program to download Universal Dependencies treebank 
https://universaldependencies.org/, train and evaluate OpenNLP models for a 
language (sentence-detector, tokenizer, pos-tagger, lemmatizer).
The program and evaluation results are available at 
https://github.com/abzif/babzel
This program may be useful for a somebody who wants to train generic models for 
a desired language with small effort. Universal Dependencies support a lot of 
languages so it is good for this purpose.
The evaluation results show that models trained for alphabetic languages 
(latin, cyrylic, abjads) seems to have really good quality.
Chinese-Jananese-Korean models are not that good. Also lemmatizer fails with 
exception for some languages.
Maybe the results may be an inspiration for improvements.
Thanks
Leszek

Reply via email to