The Apache OpenNLP team is pleased to announce the release of pre-trained models for 32 languages, based on Universal Dependencies (UD) treebanks.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. Changes in this version: - New pre-trained sentence detection, tokenization, parts of speech tagging, and lemmatization models for 9 languages are now available for: Armenian, Basque, Catalan, Georgian, Greek, Kazakh, Korean, Icelandic, and Turkish. - The existing sentence detection, tokenization, and parts of speech tagging models for the 23 languages, published with models release 1.1, have been re-trained. - In addition, new lemmatization models have been trained and added for all languages. All models, for a total of 32 languages, were trained with OpenNLP 2.5.0 based on the latest UD release 2.15 The models are compatible with Apache OpenNLP >=1.0.0. Apache OpenNLP model and reports are available for download from our model download page: https://opennlp.apache.org/models.html More information about this release can be found in the README at: https://dist.apache.org/repos/dist/release/opennlp/models/ud-models-1.2/README Details about this model effectiveness can be found in the following report: https://dist.apache.org/repos/dist/release/opennlp/models/ud-models-1.2/opennlp-training-eval-logs-1.2-2.5.0.zip The Apache OpenNLP Team