Hi
I wondered how good/or bad is the quality of OpenNLP models for various types
of languages (latin alphabet, cyrilic alphabet, abjads, ideographic).
I wrote a program to download Universal Dependencies treebank
https://universaldependencies.org/, train and evaluate OpenNLP models for a
language (sentence-detector, tokenizer, pos-tagger, lemmatizer).
The program and evaluation results are available at
https://github.com/abzif/babzel
This program may be useful for a somebody who wants to train generic models for
a desired language with small effort. Universal Dependencies support a lot of
languages so it is good for this purpose.
The evaluation results show that models trained for alphabetic languages
(latin, cyrylic, abjads) seems to have really good quality.
Chinese-Jananese-Korean models are not that good. Also lemmatizer fails with
exception for some languages.
Maybe the results may be an inspiration for improvements.
Thanks
Leszek