Re: Help to find the RC of incompatible analyers
Thanks Chris and Patrick's help. Chris is right. I don't have much knowledge in the analyzers, so I've missed many details. Following Chris's advice, I've digged deeper in the the related code about tokenizers, and fixed the problem. Thanks for help. On Fri, Apr 28, 2023 at 4:44 PM Chris Hostetter wrote: > > You provided a list of TokenFilters that you use in your Analyzer, > but you didn't mention anything about what Tokenizer you are using. > > You also mentioned seeing a difference in the "tokenization result" and > the example output you gave does in fact seem to be the output of the > tokenizer -- not the output of the TokenFilters you mentioned -- since > ShingleFilter would be producing more output tokens then you listed. > > All of which suggests that the discrepency you are seeing is in your > tokenizer. > > Generally speaking: the best way to ensure folks on the mailing list can > make sense of your situation and offer assistance is if you can provide > reproducible snippets of code w/hardcoded input (ala unit tests) that > demonstrates what you're seeing. > > : Our current code is based on Lucene7. > : In some analyzer testcase, give a string "*Google's biologist’s*", the > : tokenization result is, *["google", "biologist"]* > : > : But after I migrating the codebase to Lucene9, > : the result becomes, *["googles", "**biologist’s**"]* > > > : The analyzer uses the following three Lucene libraries: > : > : org.apache.lucene.analysis.core.FlattenGraphFilter; > : > : org.apache.lucene.analysis.shingle.ShingleFilter; > : > : org.apache.lucene.analysis.synonym.SynonymGraphFilter; > > > -Hoss > http://www.lucidworks.com/ > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Help to find the RC of incompatible analyers
You provided a list of TokenFilters that you use in your Analyzer, but you didn't mention anything about what Tokenizer you are using. You also mentioned seeing a difference in the "tokenization result" and the example output you gave does in fact seem to be the output of the tokenizer -- not the output of the TokenFilters you mentioned -- since ShingleFilter would be producing more output tokens then you listed. All of which suggests that the discrepency you are seeing is in your tokenizer. Generally speaking: the best way to ensure folks on the mailing list can make sense of your situation and offer assistance is if you can provide reproducible snippets of code w/hardcoded input (ala unit tests) that demonstrates what you're seeing. : Our current code is based on Lucene7. : In some analyzer testcase, give a string "*Google's biologist’s*", the : tokenization result is, *["google", "biologist"]* : : But after I migrating the codebase to Lucene9, : the result becomes, *["googles", "**biologist’s**"]* : The analyzer uses the following three Lucene libraries: : : org.apache.lucene.analysis.core.FlattenGraphFilter; : : org.apache.lucene.analysis.shingle.ShingleFilter; : : org.apache.lucene.analysis.synonym.SynonymGraphFilter; -Hoss http://www.lucidworks.com/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Help to find the RC of incompatible analyers
It sounds like an EnglishPossessiveFilter is missing and I think it is not relevant to the filters you listed? Are there other Lucene filters you're using? Also what exact versions are you upgrading from and to? On Fri, Apr 28, 2023 at 10:20 AM MyCoy Z wrote: > Hi, Lucene dev community: > > Our current code is based on Lucene7. > In some analyzer testcase, give a string "*Google's biologist’s*", the > tokenization result is, *["google", "biologist"]* > > But after I migrating the codebase to Lucene9, > the result becomes, *["googles", "**biologist’s**"]* > > It looks like some behavior has changed among the major versions. > > But I cannot find exactly where is the RC that causes this. > Could someone please provide some clues? Maybe some grammar has changed? > > The analyzer uses the following three Lucene libraries: > > org.apache.lucene.analysis.core.FlattenGraphFilter; > > org.apache.lucene.analysis.shingle.ShingleFilter; > > org.apache.lucene.analysis.synonym.SynonymGraphFilter; > > > Thanks > >
Help to find the RC of incompatible analyers
Hi, Lucene dev community: Our current code is based on Lucene7. In some analyzer testcase, give a string "*Google's biologist’s*", the tokenization result is, *["google", "biologist"]* But after I migrating the codebase to Lucene9, the result becomes, *["googles", "**biologist’s**"]* It looks like some behavior has changed among the major versions. But I cannot find exactly where is the RC that causes this. Could someone please provide some clues? Maybe some grammar has changed? The analyzer uses the following three Lucene libraries: org.apache.lucene.analysis.core.FlattenGraphFilter; org.apache.lucene.analysis.shingle.ShingleFilter; org.apache.lucene.analysis.synonym.SynonymGraphFilter; Thanks