Re: Question about the light and minimal French stemmers

2019-07-27 Thread Tomoko Uchida
Let me just make things a bit clear...
I think the concern here is that FrenchMinimalStemmer would remove the
last "digit" from a token because of it does not check if the
character is letter or not.
e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.

To me, this behaviour is beyond stemming.

Tomoko

2019年7月28日(日) 4:55 Michael Sokolov :
>
> I'm not so sure. I think the whole idea of having both stemmers is that the
> minimal one does less than the light one.
>
> Removing the final character of a double letter suffix is going to
> sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
> others.
>
> So having both options is helpful, I don't think it's a bug on the face of
> it. However I didn't look closely at the code, so I'm not sure what the
> intent is exactly.
>
> On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida 
> wrote:
>
> > Hi Adrien,
> >
> > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > patch if possible)?
> >
> > Tomoko
> >
> > 2019年7月23日(火) 22:05 Adrien Gallou :
> > >
> > > Hi,
> > >
> > > I'm using both light and minimal French stemmers and encountered an issue
> > > when using the minimal stemmer.
> > >
> > > The light stemmer removes the last character of a word if the last two
> > > characters are identical.
> > > We can see that here:
> > >
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > In this light stemmer, there is a check to avoid altering the token if
> > the
> > > token is a number.
> > >
> > > The minimal stemmer also removes the last character of a word if the last
> > > two characters are identical.
> > > We can see that here:
> > >
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > >
> > > But in this minimal stemmer there is no check to see if the character is
> > a
> > > letter or not.
> > > So when we have numeric tokens with the last two characters identical
> > they
> > > are altered.
> > >
> > > Is there a reason for this?
> > > Should I file an issue on Jira to add this check?
> > >
> > > Thanks,
> > >
> > > Adrien Gallou
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about the light and minimal French stemmers

2019-07-27 Thread Michael Sokolov
I'm not so sure. I think the whole idea of having both stemmers is that the
minimal one does less than the light one.

Removing the final character of a double letter suffix is going to
sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
others.

So having both options is helpful, I don't think it's a bug on the face of
it. However I didn't look closely at the code, so I'm not sure what the
intent is exactly.

On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida 
wrote:

> Hi Adrien,
>
> To me, it sounds simply a bug. Can you please open a JIRA (with a
> patch if possible)?
>
> Tomoko
>
> 2019年7月23日(火) 22:05 Adrien Gallou :
> >
> > Hi,
> >
> > I'm using both light and minimal French stemmers and encountered an issue
> > when using the minimal stemmer.
> >
> > The light stemmer removes the last character of a word if the last two
> > characters are identical.
> > We can see that here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > In this light stemmer, there is a check to avoid altering the token if
> the
> > token is a number.
> >
> > The minimal stemmer also removes the last character of a word if the last
> > two characters are identical.
> > We can see that here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> >
> > But in this minimal stemmer there is no check to see if the character is
> a
> > letter or not.
> > So when we have numeric tokens with the last two characters identical
> they
> > are altered.
> >
> > Is there a reason for this?
> > Should I file an issue on Jira to add this check?
> >
> > Thanks,
> >
> > Adrien Gallou
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Question about the light and minimal French stemmers

2019-07-27 Thread Tomoko Uchida
I found an issue which adds the isLetter() check on FrenchLightStemmer.
https://issues.apache.org/jira/browse/LUCENE-4063

Seems the same change has not been applied to FrenchMinimalStemmer,
would it be a good idea that we add the same check to it to avoid too
aggressive stemming?

Tomoko

2019年7月27日(土) 20:29 Tomoko Uchida :
>
> Hi Adrien,
>
> To me, it sounds simply a bug. Can you please open a JIRA (with a
> patch if possible)?
>
> Tomoko
>
> 2019年7月23日(火) 22:05 Adrien Gallou :
> >
> > Hi,
> >
> > I'm using both light and minimal French stemmers and encountered an issue
> > when using the minimal stemmer.
> >
> > The light stemmer removes the last character of a word if the last two
> > characters are identical.
> > We can see that here:
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > In this light stemmer, there is a check to avoid altering the token if the
> > token is a number.
> >
> > The minimal stemmer also removes the last character of a word if the last
> > two characters are identical.
> > We can see that here:
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> >
> > But in this minimal stemmer there is no check to see if the character is a
> > letter or not.
> > So when we have numeric tokens with the last two characters identical they
> > are altered.
> >
> > Is there a reason for this?
> > Should I file an issue on Jira to add this check?
> >
> > Thanks,
> >
> > Adrien Gallou

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about the light and minimal French stemmers

2019-07-27 Thread Tomoko Uchida
Hi Adrien,

To me, it sounds simply a bug. Can you please open a JIRA (with a
patch if possible)?

Tomoko

2019年7月23日(火) 22:05 Adrien Gallou :
>
> Hi,
>
> I'm using both light and minimal French stemmers and encountered an issue
> when using the minimal stemmer.
>
> The light stemmer removes the last character of a word if the last two
> characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> In this light stemmer, there is a check to avoid altering the token if the
> token is a number.
>
> The minimal stemmer also removes the last character of a word if the last
> two characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
>
> But in this minimal stemmer there is no check to see if the character is a
> letter or not.
> So when we have numeric tokens with the last two characters identical they
> are altered.
>
> Is there a reason for this?
> Should I file an issue on Jira to add this check?
>
> Thanks,
>
> Adrien Gallou

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org