Re: Question about the light and minimal French stemmers

2019-07-28 Thread Adrien Gallou
Hi Tomoko,

Thanks for your answer.

So, after them, I have opened an issue with a patch attached:
https://issues.apache.org/jira/browse/LUCENE-8937

Adrien

Le dim. 28 juil. 2019 à 13:51, Michael Sokolov  a
écrit :

> Oh sorry for jumping in with my irrelevant comment, you are right, of
> course!
>
> On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida  >
> wrote:
>
> > Let me just make things a bit clear...
> > I think the concern here is that FrenchMinimalStemmer would remove the
> > last "digit" from a token because of it does not check if the
> > character is letter or not.
> > e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.
> >
> > To me, this behaviour is beyond stemming.
> >
> > Tomoko
> >
> > 2019年7月28日(日) 4:55 Michael Sokolov :
> > >
> > > I'm not so sure. I think the whole idea of having both stemmers is that
> > the
> > > minimal one does less than the light one.
> > >
> > > Removing the final character of a double letter suffix is going to
> > > sacrifice some precision. For example mes/mess, ne/née, I'm sure there
> > are
> > > others.
> > >
> > > So having both options is helpful, I don't think it's a bug on the face
> > of
> > > it. However I didn't look closely at the code, so I'm not sure what the
> > > intent is exactly.
> > >
> > > On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com>
> > > wrote:
> > >
> > > > Hi Adrien,
> > > >
> > > > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > > > patch if possible)?
> > > >
> > > > Tomoko
> > > >
> > > > 2019年7月23日(火) 22:05 Adrien Gallou :
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm using both light and minimal French stemmers and encountered an
> > issue
> > > > > when using the minimal stemmer.
> > > > >
> > > > > The light stemmer removes the last character of a word if the last
> > two
> > > > > characters are identical.
> > > > > We can see that here:
> > > > >
> > > >
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > > > In this light stemmer, there is a check to avoid altering the token
> > if
> > > > the
> > > > > token is a number.
> > > > >
> > > > > The minimal stemmer also removes the last character of a word if
> the
> > last
> > > > > two characters are identical.
> > > > > We can see that here:
> > > > >
> > > >
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > > > >
> > > > > But in this minimal stemmer there is no check to see if the
> > character is
> > > > a
> > > > > letter or not.
> > > > > So when we have numeric tokens with the last two characters
> identical
> > > > they
> > > > > are altered.
> > > > >
> > > > > Is there a reason for this?
> > > > > Should I file an issue on Jira to add this check?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Adrien Gallou
> > > >
> > > > -
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Question about the light and minimal French stemmers

2019-07-28 Thread Michael Sokolov
Oh sorry for jumping in with my irrelevant comment, you are right, of
course!

On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida 
wrote:

> Let me just make things a bit clear...
> I think the concern here is that FrenchMinimalStemmer would remove the
> last "digit" from a token because of it does not check if the
> character is letter or not.
> e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.
>
> To me, this behaviour is beyond stemming.
>
> Tomoko
>
> 2019年7月28日(日) 4:55 Michael Sokolov :
> >
> > I'm not so sure. I think the whole idea of having both stemmers is that
> the
> > minimal one does less than the light one.
> >
> > Removing the final character of a double letter suffix is going to
> > sacrifice some precision. For example mes/mess, ne/née, I'm sure there
> are
> > others.
> >
> > So having both options is helpful, I don't think it's a bug on the face
> of
> > it. However I didn't look closely at the code, so I'm not sure what the
> > intent is exactly.
> >
> > On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <
> tomoko.uchida.1...@gmail.com>
> > wrote:
> >
> > > Hi Adrien,
> > >
> > > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > > patch if possible)?
> > >
> > > Tomoko
> > >
> > > 2019年7月23日(火) 22:05 Adrien Gallou :
> > > >
> > > > Hi,
> > > >
> > > > I'm using both light and minimal French stemmers and encountered an
> issue
> > > > when using the minimal stemmer.
> > > >
> > > > The light stemmer removes the last character of a word if the last
> two
> > > > characters are identical.
> > > > We can see that here:
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > > In this light stemmer, there is a check to avoid altering the token
> if
> > > the
> > > > token is a number.
> > > >
> > > > The minimal stemmer also removes the last character of a word if the
> last
> > > > two characters are identical.
> > > > We can see that here:
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > > >
> > > > But in this minimal stemmer there is no check to see if the
> character is
> > > a
> > > > letter or not.
> > > > So when we have numeric tokens with the last two characters identical
> > > they
> > > > are altered.
> > > >
> > > > Is there a reason for this?
> > > > Should I file an issue on Jira to add this check?
> > > >
> > > > Thanks,
> > > >
> > > > Adrien Gallou
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>