subject:"Help to find the RC of incompatible analyers"

Re: Help to find the RC of incompatible analyers

2023-04-29 Thread MyCoy Z

Thanks Chris and Patrick's help.

Chris is right. I don't have much knowledge in the analyzers, so I've
missed many details.
Following Chris's advice, I've digged deeper in the the related code about
tokenizers, and fixed the problem.

Thanks for help.

On Fri, Apr 28, 2023 at 4:44 PM Chris Hostetter 
wrote:

>
> You provided a list of TokenFilters that you use in your Analyzer,
> but you didn't mention anything about what Tokenizer you are using.
>
> You also mentioned seeing a difference in the "tokenization result" and
> the example output you gave does in fact seem to be the output of the
> tokenizer -- not the output of the TokenFilters you mentioned -- since
> ShingleFilter would be producing more output tokens then you listed.
>
> All of which suggests that the discrepency you are seeing is in your
> tokenizer.
>
> Generally speaking: the best way to ensure folks on the mailing list can
> make sense of your situation and offer assistance is if you can provide
> reproducible snippets of code w/hardcoded input (ala unit tests) that
> demonstrates what you're seeing.
>
> : Our current code is based on Lucene7.
> : In some analyzer testcase, give a string "*Google's biologist’s*", the
> : tokenization result is, *["google", "biologist"]*
> :
> : But after I migrating the codebase to Lucene9,
> : the result becomes, *["googles", "**biologist’s**"]*
>
>
> : The analyzer uses the following three Lucene libraries:
> :
> : org.apache.lucene.analysis.core.FlattenGraphFilter;
> :
> : org.apache.lucene.analysis.shingle.ShingleFilter;
> :
> : org.apache.lucene.analysis.synonym.SynonymGraphFilter;
>
>
> -Hoss
> http://www.lucidworks.com/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Help to find the RC of incompatible analyers

2023-04-28 Thread Chris Hostetter


You provided a list of TokenFilters that you use in your Analyzer, 
but you didn't mention anything about what Tokenizer you are using.

You also mentioned seeing a difference in the "tokenization result" and 
the example output you gave does in fact seem to be the output of the 
tokenizer -- not the output of the TokenFilters you mentioned -- since 
ShingleFilter would be producing more output tokens then you listed.

All of which suggests that the discrepency you are seeing is in your 
tokenizer.

Generally speaking: the best way to ensure folks on the mailing list can 
make sense of your situation and offer assistance is if you can provide 
reproducible snippets of code w/hardcoded input (ala unit tests) that 
demonstrates what you're seeing.

: Our current code is based on Lucene7.
: In some analyzer testcase, give a string "*Google's biologist’s*", the
: tokenization result is, *["google", "biologist"]*
: 
: But after I migrating the codebase to Lucene9,
: the result becomes, *["googles", "**biologist’s**"]*


: The analyzer uses the following three Lucene libraries:
: 
: org.apache.lucene.analysis.core.FlattenGraphFilter;
: 
: org.apache.lucene.analysis.shingle.ShingleFilter;
: 
: org.apache.lucene.analysis.synonym.SynonymGraphFilter;


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Help to find the RC of incompatible analyers

2023-04-28 Thread Patrick Zhai

It sounds like an EnglishPossessiveFilter is missing and I think it is not
relevant to the filters you listed?
Are there other Lucene filters you're using?

Also what exact versions are you upgrading from and to?

On Fri, Apr 28, 2023 at 10:20 AM MyCoy Z  wrote:

> Hi, Lucene dev community:
>
> Our current code is based on Lucene7.
> In some analyzer testcase, give a string "*Google's biologist’s*", the
> tokenization result is, *["google", "biologist"]*
>
> But after I migrating the codebase to Lucene9,
> the result becomes, *["googles", "**biologist’s**"]*
>
> It looks like some behavior has changed among the major versions.
>
> But I cannot find exactly where is the RC that causes this.
> Could someone please provide some clues? Maybe some grammar has changed?
>
> The analyzer uses the following three Lucene libraries:
>
> org.apache.lucene.analysis.core.FlattenGraphFilter;
>
> org.apache.lucene.analysis.shingle.ShingleFilter;
>
> org.apache.lucene.analysis.synonym.SynonymGraphFilter;
>
>
> Thanks
>
>

Help to find the RC of incompatible analyers

2023-04-28 Thread MyCoy Z

Hi, Lucene dev community:

Our current code is based on Lucene7.
In some analyzer testcase, give a string "*Google's biologist’s*", the
tokenization result is, *["google", "biologist"]*

But after I migrating the codebase to Lucene9,
the result becomes, *["googles", "**biologist’s**"]*

It looks like some behavior has changed among the major versions.

But I cannot find exactly where is the RC that causes this.
Could someone please provide some clues? Maybe some grammar has changed?

The analyzer uses the following three Lucene libraries:

org.apache.lucene.analysis.core.FlattenGraphFilter;

org.apache.lucene.analysis.shingle.ShingleFilter;

org.apache.lucene.analysis.synonym.SynonymGraphFilter;


Thanks

Re: Help to find the RC of incompatible analyers

Re: Help to find the RC of incompatible analyers

Re: Help to find the RC of incompatible analyers

Help to find the RC of incompatible analyers

4 matches

Site Navigation

Mail list logo

Footer information