RE: Re: Integrating NLP into Lucene Analysis Chain

2022-11-22 Thread Lucas Kot-Zaniewski
Hi Benoit, Thanks for the reply and link! My application is english-focused so I have the benefit of having a language with little inflection. This along with a few other reasons pushed me towards an index-heavy approach which doesn't have the complexities involved with synonyms of different posit

RE: RE: Integrating NLP into Lucene Analysis Chain

2022-11-22 Thread Lucas Kot-Zaniewski
Hi Guan, I think I've confused everyone a little bit, including myself. When I initially went down the rabbit hole of understanding the synchronization of these wrapping methods I kept an eye out for all potential thread safety issues within open-nlp. I ended up finding issues unrelated to the sy

RE: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Wang, Guan
Hi Luke, For what you've described as a "bug" for NLPPOSTaggerOp, I do agree with you that there could be a more elegant solution than simply synchronizing the entire method. That has been said, IMHO, I don't see there is a thread-safe issue. Lucene TokenFilters are not supposed to be shared am

Re: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Mikhail Khludnev
Hello, Benoit. I just came across https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TypeAsSynonymFilterFactory.html It sounds similar to what you asking, but it watches TypeAttribute only. Also, spans are superseded with intervals https://lucene.apache

Re: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Benoit Mercier
Hi Luke, Thank you for your work and information sharing. From my point of view lemmatization is just a use case of text token annotation. I have been working with Lucene since 2006  to index lexicographic and linguistic data and I always miss the fact that (1) token attributes are not search

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
https://github.com/apache/lucene/pull/11955 On Sat, Nov 19, 2022 at 10:43 PM Robert Muir wrote: > > Hi, > > Is this 'synchronized' really needed? > > 1. Lucene tokenstreams are only used by a single thread. If you index > with 10 threads, 10 tokenstreams are used. > 2. These OpenNLP Factories mak

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir
Hi, Is this 'synchronized' really needed? 1. Lucene tokenstreams are only used by a single thread. If you index with 10 threads, 10 tokenstreams are used. 2. These OpenNLP Factories make a new *Op for each tokenstream that they create. so there's no thread hazard. 3. If i remove 'synchronized' ke