Re: Is this a typical OpenNLP tokenization issue?

Suneel Marthi Thu, 29 Jun 2017 17:39:58 -0700

On Thu, Jun 29, 2017 at 8:36 PM, Ling <lingv...@gmail.com> wrote:

> Hi, Suneel , that's great. The reason was that I wanted to do something in
> DeepLearnig4j and happened to find that openNLP was integrated into it
> already. So I just used their API to call openNLP.
>
> Is there a set date for next release? Also, are the 1.5 models the same as
> the models to be included in the 1.81 release?
>


shuld be some time next week.

if u r talking about the usage by 'models being the same', yes nothing
changes in how u invoke the model from ur code.

>
> Thanks.
> Ling
>
> On Thu, Jun 29, 2017 at 5:30 PM, Suneel Marthi <smar...@apache.org> wrote:
>
> > On Thu, Jun 29, 2017 at 8:07 PM, Ling <lingv...@gmail.com> wrote:
> >
> > > Hi, Jörn:
> > >
> > > I want to directly use openNLP, instead of deeplearning4j and UIMA. I
> > > included the Maven 1.8 version in my POM file, then do I still need to
> > > download the models separately? And I can't find those model files. For
> > > example, to do a simple test on tokenization model,
> > >
> >
> > Dl4j is for Deep learning, OpenNLP is for text processing - not sure why
> > you would go to DL4J first and revert back to OpenNLP if all u want to do
> > is basic text processing.
> >
> > The model files (1.5 models) are presently at -
> > http://opennlp.sourceforge.net/models-1.5/
> >
> >
> >
> > >
> > > InputStream is = new FileInputStream("en-token.bin");
> > >
> > > Do I have to download the en-token.bin separately? I am working in a
> > maven
> > > projects. Thank you
> >
> >
> > Yes, the models need to be downloaded separately.
> >
> > We finally got approval from Apache Foundation to distribute OpenNLP
> models
> > thru Apache, following the upcoming 1.8.1 release we should be
> distributing
> > updated 1.8.1 models too once we hash out the details for doing that.
> >
> >
> > > .
> > >
> > > Ling
> > >
> > >
> > > On Thu, Jun 29, 2017 at 10:42 AM, Joern Kottmann <kottm...@gmail.com>
> > > wrote:
> > >
> > > > Long chain, yes, then you probably use the SourceForge tokenization
> > > > model that was trained on some old news.
> > > >
> > > > We usually don't consider mistakes the models do as bugs because we
> > > > can't do much about it other than suggesting to use models that fit
> > > > your data very well and even in that case models can be wrong
> > > > sometimes.
> > > >
> > > > If there is something we can do here to reduce the error rate then we
> > > > are very happy to get that as a contribution or just pointed out.
> > > >
> > > > Jörn
> > > >
> > > > On Thu, Jun 29, 2017 at 6:54 PM, Ling <lingv...@gmail.com> wrote:
> > > > > Hi, Jörn:
> > > > >
> > > > > I am using a Deeplearning4j, which uses org.apache.uima library I
> > > think.
> > > > > And then UIMA uses openNLP. Probably that's what happens.
> > > > >
> > > > > So it isn't openNLP's original problem? Thank you.
> > > > >
> > > > > Ling
> > > > >
> > > > > On Thu, Jun 29, 2017 at 12:30 AM, Joern Kottmann <
> kottm...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> which model are you using? Did you train it yourself?
> > > > >>
> > > > >> Jörn
> > > > >>
> > > > >> On Thu, Jun 29, 2017 at 4:04 AM, Ling <lingv...@gmail.com> wrote:
> > > > >> > Hi, all:
> > > > >> >
> > > > >> > I am testing openNLP and found some significant tokenization
> issue
> > > > >> > involving punctuation.
> > > > >> >
> > > > >> > Thank you Costco!
> > > > >> > i love costco!
> > > > >> > I love Costco!!
> > > > >> > FUCK IKEA.
> > > > >> >
> > > > >> > In all these cases, the last punctuation is not split so
> "Costco!"
> > > and
> > > > >> > "IKEA." are treated as one token. This looks like a systematic
> > > > problem.
> > > > >> > Before I file an issue on OpenNLP project, I want to make sure
> > this
> > > > issue
> > > > >> > is true coming from the library.
> > > > >> >
> > > > >> > Does any of you encounter similar problem? Thanks.
> > > > >>
> > > >
> > >
> >
>

Re: Is this a typical OpenNLP tokenization issue?

Reply via email to