Re: Where can I find Spanish Lemmatizer training data ?

Martin Krallinger Tue, 09 Jul 2019 07:09:00 -0700

Dear Daniel,

you can find a Spanish Lemmatizer corpus (medical domain) at:


https://github.com/PlanTL-SANIDAD/SPACCC_TOKEN

Under the corpus folder (validation)



Regards,

Martin

El mar., 9 jul. 2019 a las 15:11, Dan Russ (<danrus...@gmail.com>) escribió:

> Hello,
>    It looks like the GitHub repo has files in conllu format, which is
> readable by opennlp.
>
>    es_gsd-ud-dev.conllu
> Daniel
>
> > On Jul 8, 2019, at 9:04 PM, T. Kuro Kurosaka <k...@bhlab.com> wrote:
> >
> > I downloaded OpenNLP hoping that I could use it to lemmatize Spanish
> text.
> >
> > But
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.lemmatizer
> > seems to be saying I have to train a model first to use the lemmatizer.
> >
> > Although it says "The Universal Dependencies Treebank and the CoNLL 2009
> datasets distribute training data for many languages.", I am having
> difficulty finding one.
> >
> > I thought
> > https://github.com/UniversalDependencies/UD_Spanish-GSD
> > may be it, but the files there are in an XML format.
> >
> > Can someone point me to an open-source lemmatizer training data in the
> format openNLP UIMA Lemmatizer can use ?
> >
> > Thank you in advance.
> >
> > --
> > T. "Kuro" Kurosaka, Berkeley, California, USA
> >
>
>

-- 
=======================================
Martin Krallinger, Dr.
--------------------------------------------------------------------
Head of Biological Text Mining Unit
Barcelona Supercomputing Center (BSC-CNS)
--------------------------------------------------------------------
Oficina Técnica General (OTG) del Plan TL en el
área de Biomedicina de la
*Secretaría de Estado *
*para* el Avance Digital

=======================================

Re: Where can I find Spanish Lemmatizer training data ?

Reply via email to