RE: Re: Integrating NLP into Lucene Analysis Chain

2022-11-22 Thread Lucas Kot-Zaniewski
Hi Benoit, Thanks for the reply and link! My application is english-focused so I have the benefit of having a language with little inflection. This along with a few other reasons pushed me towards an index-heavy approach which doesn't have the complexities involved with synonyms of different

RE: RE: Integrating NLP into Lucene Analysis Chain

2022-11-22 Thread Lucas Kot-Zaniewski
gt; To: java-user@lucene.apache.org > Subject: Integrating NLP into Lucene Analysis Chain > > External Email - Use Caution > > Greetings, > I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene comm

RE: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Wang, Guan
-Zaniewski (BLOOMBERG/ 919 3RD A) Sent: Saturday, November 19, 2022 10:27 PM To: java-user@lucene.apache.org Subject: Integrating NLP into Lucene Analysis Chain External Email - Use Caution Greetings, I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am

Re: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Mikhail Khludnev
Hello, Benoit. I just came across https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TypeAsSynonymFilterFactory.html It sounds similar to what you asking, but it watches TypeAttribute only. Also, spans are superseded with intervals

Re: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Benoit Mercier
Hi Luke, Thank you for your work and information sharing. From my point of view lemmatization is just a use case of text token annotation. I have been working with Lucene since 2006  to index lexicographic and linguistic data and I always miss the fact that (1) token attributes are not

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
https://github.com/apache/lucene/pull/11955 On Sat, Nov 19, 2022 at 10:43 PM Robert Muir wrote: > > Hi, > > Is this 'synchronized' really needed? > > 1. Lucene tokenstreams are only used by a single thread. If you index > with 10 threads, 10 tokenstreams are used. > 2. These OpenNLP Factories

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
Hi, Is this 'synchronized' really needed? 1. Lucene tokenstreams are only used by a single thread. If you index with 10 threads, 10 tokenstreams are used. 2. These OpenNLP Factories make a new *Op for each tokenstream that they create. so there's no thread hazard. 3. If i remove 'synchronized'

Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A)
Greetings, I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene