Dear Daniel, you can find a Spanish Lemmatizer corpus (medical domain) at:
https://github.com/PlanTL-SANIDAD/SPACCC_TOKEN Under the corpus folder (validation) Regards, Martin El mar., 9 jul. 2019 a las 15:11, Dan Russ (<danrus...@gmail.com>) escribió: > Hello, > It looks like the GitHub repo has files in conllu format, which is > readable by opennlp. > > es_gsd-ud-dev.conllu > Daniel > > > On Jul 8, 2019, at 9:04 PM, T. Kuro Kurosaka <k...@bhlab.com> wrote: > > > > I downloaded OpenNLP hoping that I could use it to lemmatize Spanish > text. > > > > But > https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer > > seems to be saying I have to train a model first to use the lemmatizer. > > > > Although it says "The Universal Dependencies Treebank and the CoNLL 2009 > datasets distribute training data for many languages.", I am having > difficulty finding one. > > > > I thought > > https://github.com/UniversalDependencies/UD_Spanish-GSD > > may be it, but the files there are in an XML format. > > > > Can someone point me to an open-source lemmatizer training data in the > format openNLP UIMA Lemmatizer can use ? > > > > Thank you in advance. > > > > -- > > T. "Kuro" Kurosaka, Berkeley, California, USA > > > > -- ======================================= Martin Krallinger, Dr. -------------------------------------------------------------------- Head of Biological Text Mining Unit Barcelona Supercomputing Center (BSC-CNS) -------------------------------------------------------------------- Oficina Técnica General (OTG) del Plan TL en el área de Biomedicina de la *Secretaría de Estado * *para* el Avance Digital =======================================