The Porter/Snowball stemmer is an evolved version of a forty year old hack.
It is neat that it works at all, but don’t expect too much. I think it is too 
aggressive
for search use.

What does KStem do with this? That is based on better linguistic models.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 1, 2020, at 8:45 AM, Mike Drob <md...@apache.org> wrote:
> 
> This is how things get stemmed *now*, but I believe there is an open
> question as to whether that is how they *should* be stemmed. Specifically,
> the case appears to be -ify words not stemming to the same as -ification -
> this applies to much more than identify/identification. Also, justify,
> fortify, notify, many many others.
> 
> $ grep ification /usr/share/dict/words | wc -l
>     328
> 
> I am by no means an expert on stemming, and if the folks at snowball decide
> to tell us that this change is bad or hard because it would overstem some
> other words, then I'll happily accept that. But I definitely want to use
> their expertise rather than relying on my own.
> 
> Mike
> 
> On Fri, May 1, 2020 at 10:35 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
> 
>> Unless I'm misunderstanding the bug in question, there is no bug. What you
>> are observing is simply just how things get stemmed...
>> 
>> Best,
>> Audrey
>> 
>> On 4/30/20, 6:37 PM, "Jhonny Lopez" <jhonny.lo...@publicismedia.com>
>> wrote:
>> 
>>    Yes, sounds like worth it.
>> 
>>    Thanks guys!
>> 
>>    -----Original Message-----
>>    From: Mike Drob <md...@apache.org>
>>    Sent: jueves, 30 de abril de 2020 5:30 p. m.
>>    To: solr-user@lucene.apache.org
>>    Subject: Re: Possible issue with Stemming and nouns ended with suffix
>> 'ion'
>> 
>>    This email has been sent from a source external to Publicis Groupe.
>> Please use caution when clicking links or opening attachments.
>>    Cet email a été envoyé depuis une source externe à Publicis Groupe.
>> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
>> lorsque vous ouvrez des pièces jointes.
>> 
>> 
>> 
>>    Is this worth filing a bug/suggestion to the folks over at
>> snowballstem.org?
>> 
>>    On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:
>> 
>>> I agree with Erick. I think that's just how the cookie crumbles when
>>> stemming. If you have some time on your hands, you can integrate
>>> OpenNLP with your Solr instance and start using the lemmas of tokens
>>> instead of the stems. In this case, I believe if you were to
>> lemmatize
>>> both "identify" and "identification," they would both condense to
>> "identify."
>>> 
>>> Best,
>>> Audrey
>>> 
>>> On 4/30/20, 3:54 PM, "Erick Erickson" <erickerick...@gmail.com>
>> wrote:
>>> 
>>>    They are being stemmed to two different tokens, “identif” and
>>> “identifi”. Stemming is algorithmic and imperfect and in this case
>>> you’re getting bitten by that algorithm. It looks like you’re using
>>> PorterStemFilter, if you want you can look up the exact algorithm,
>> but
>>> I don’t think it’s a bug, just one of those little joys of English...
>>> 
>>>    To get a clearer picture of exactly what’s being searched, try
>>> adding &debug=query to your query, in particular looking at the
>> parsed
>>> query that’s returned. That’ll tell you a bunch. In this particular
>>> case I don’t think it’ll tell you anything more, but for future…
>>> 
>>>    Best,
>>>    Erick
>>> 
>>>    On, and un-checking the ‘verbose’ box on the analysis page
>> removes
>>> a lot of distraction, the detailed information is often TMI ;)
>>> 
>>>> On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
>>> jhonny.lo...@publicismedia.com> wrote:
>>>> 
>>>> Sure, rewriting the message with links for images:
>>>> 
>>>> 
>>>> We’re facing an issue with stemming in solr. Most of the cases
>>> are working correctly, for example, if we search for bidding, solr
>>> brings results for bidding, bid, bids, etc. However, with nouns
>> ended with ‘ion’
>>> suffix, stemming is not working. Even when analyzers seems to have
>>> correct stemming of the word, the results are not reflecting that.
>> One
>>> example. If I search ‘identifying’, this is the output:
>>>> 
>>>> Analyzer (image link):
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e=
>>>> 
>>>> A clip of results:
>>>> "haschildren_b":false,
>>>>       "isbucket_text_s":"0",
>>>>       "sectionbody_t":"\n\n\nIn order to identify 1st price
>>> auctions, leverage the proprietary tools available or manually pull a
>>> log file report to understand the trends and gauge auction spread
>>> overtime to assess the impact of variable auction
>> dynamics.\n\n\n\n\n\n\n",
>>>>       "parsedupdatedby_s":"sitecorecarvaini",
>>>>       "sectionbody_t_en":"\n\n\nIn order to identify 1st price
>>> auctions, leverage the proprietary tools available or manually pull a
>>> log file report to understand the trends and gauge auction spread
>>> overtime to assess the impact of variable auction
>> dynamics.\n\n\n\n\n\n\n",
>>>>       "hide_section_b":false
>>>> 
>>>> 
>>>> As you can see, it has used the stemming correctly and brings
>>> results for other words based in the root, in this case “Identify”.
>>>> 
>>>> However, if I search for “Identification”, this is the output:
>>>> 
>>>> Analyzer (imagelink):
>>>> 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE&e=
>>>> 
>>>> 
>>>> Even with proper stemming, solr is only bringing results for
>> the
>>> word identification (or identifications) but nothing else.
>>>> 
>>>> The queries are over the same field that has the Porter
>> Stemming
>>> Filter applied for both, query and index. This behavior is consistent
>>> with other ‘ion’ ended nouns: representation, modification, etc.
>>>> 
>>>> Solr Version: 8.1. Does anyone know why is it happening? Is it
>> a bug?
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> 
>>>> From: Erick Erickson <erickerick...@gmail.com>
>>>> 
>>>> Sent: jueves, 30 de abril de 2020 1:47 p. m.
>>>> 
>>>> To: solr-user@lucene.apache.org
>>>> 
>>>> Subject: Re: Possible issue with Stemming and nouns ended with
>>> suffix 'ion'
>>>> 
>>>> 
>>>> 
>>>> This email has been sent from a source external to Publicis
>> Groupe.
>>> Please use caution when clicking links or opening attachments.
>>>> 
>>>> Cet email a été envoyé depuis une source externe à Publicis
>> Groupe.
>>> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens
>>> ou lorsque vous ouvrez des pièces jointes.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> The mail server is pretty aggressive about stripping links, so
>>> we can’t see the images.
>>>> 
>>>> 
>>>> 
>>>> Could you put them somewhere and paste a link?
>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Erick
>>>> 
>>>> 
>>>> 
>>>>> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez <
>>> jhonny.lo...@publicismedia.com> wrote:
>>>> 
>>>>> 
>>>> 
>>>>> We’re facing an issue with stemming in solr. Most of the cases
>>> are working correctly, for example, if we search for bidding, solr
>>> brings results for bidding, bid, bids, etc. However, with nouns
>> ended with ‘ion’
>>> suffix, stemming is not working. Even when analyzers seems to have
>>> correct stemming of the word, the results are not reflecting that.
>> One
>>> example. If I search ‘identifying’, this is the output:
>>>> 
>>>>> 
>>>> 
>>>>> Analyzer (image):
>>>> 
>>>>> 
>>>> 
>>>>> A clip of results:
>>>> 
>>>>> "haschildren_b":false,
>>>> 
>>>>>       "isbucket_text_s":"0",
>>>> 
>>>>>       "sectionbody_t":"\n\n\nIn order to identify 1st price
>>> auctions, leverage the proprietary tools available or manually pull a
>>> log file report to understand the trends and gauge auction spread
>>> overtime to assess the impact of variable auction
>> dynamics.\n\n\n\n\n\n\n",
>>>> 
>>>>>       "parsedupdatedby_s":"sitecorecarvaini",
>>>> 
>>>>>       "sectionbody_t_en":"\n\n\nIn order to identify 1st
>> price
>>> auctions, leverage the proprietary tools available or manually pull a
>>> log file report to understand the trends and gauge auction spread
>>> overtime to assess the impact of variable auction
>> dynamics.\n\n\n\n\n\n\n",
>>>> 
>>>>>       "hide_section_b":false
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> As you can see, it has used the stemming correctly and brings
>>> results for other words based in the root, in this case “Identify”.
>>>> 
>>>>> 
>>>> 
>>>>> However, if I search for “Identification”, this is the output:
>>>> 
>>>>> 
>>>> 
>>>>> Analyzer (image):
>>>> 
>>>>> 
>>>> 
>>>>> Even with proper stemming, solr is only bringing results for
>>> the word identification (or identifications) but nothing else.
>>>> 
>>>>> 
>>>> 
>>>>> The queries are over the same field that has the Porter
>>> Stemming Filter applied for both, query and index. This behavior is
>>> consistent with other ‘ion’ ended nouns: representation,
>> modification, etc.
>>>> 
>>>>> 
>>>> 
>>>>> Solr Version: 8.1. Does anyone know why is it happening? Is it
>>> a bug?
>>>> 
>>>>> 
>>>> 
>>>>> Thanks.
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> Jhonny Lopez
>>>> 
>>>>> Technical Architect
>>>> 
>>>>> Avenida Calle 26 No. 92 - 32, Edificio BTS3
>>>> 
>>>>> APDO. 128-1255 Bogota
>>>> 
>>>>> T: +573006805461
>>>> 
>>>>> jhonny.lo...@publicismedia.com
>>>> 
>>>>> www.prodigious.com
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>>> 
>>>>> 
>>> 
>> ----------------------------------------------------------------------
>>>> 
>>>>> -- Disclaimer The information in this email and any
>> attachments may
>>>> 
>>>>> contain proprietary and confidential information that is
>>> intended for the addressee(s) only. If you are not the intended
>>> recipient, you are hereby notified that any disclosure, copying,
>>> distribution, retention or use of the contents of this information is
>>> prohibited. When addressed to our clients or vendors, any information
>>> contained in this e-mail or any attachments is subject to the terms
>>> and conditions in any governing contract. If you have received this
>>> e-mail in error, please immediately contact the sender and delete
>> the e-mail.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> ------------------------------------------------------------------------
>>>> Disclaimer The information in this email and any attachments
>> may
>>> contain proprietary and confidential information that is intended for
>>> the
>>> addressee(s) only. If you are not the intended recipient, you are
>>> hereby notified that any disclosure, copying, distribution, retention
>>> or use of the contents of this information is prohibited. When
>>> addressed to our clients or vendors, any information contained in
>> this
>>> e-mail or any attachments is subject to the terms and conditions in
>>> any governing contract. If you have received this e-mail in error,
>>> please immediately contact the sender and delete the e-mail.
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------
>>    Disclaimer The information in this email and any attachments may
>> contain proprietary and confidential information that is intended for the
>> addressee(s) only. If you are not the intended recipient, you are hereby
>> notified that any disclosure, copying, distribution, retention or use of
>> the contents of this information is prohibited. When addressed to our
>> clients or vendors, any information contained in this e-mail or any
>> attachments is subject to the terms and conditions in any governing
>> contract. If you have received this e-mail in error, please immediately
>> contact the sender and delete the e-mail.
>> 
>> 
>> 

Reply via email to