The Porter/Snowball stemmer is an evolved version of a forty year old hack. It is neat that it works at all, but don’t expect too much. I think it is too aggressive for search use.
What does KStem do with this? That is based on better linguistic models. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 1, 2020, at 8:45 AM, Mike Drob <md...@apache.org> wrote: > > This is how things get stemmed *now*, but I believe there is an open > question as to whether that is how they *should* be stemmed. Specifically, > the case appears to be -ify words not stemming to the same as -ification - > this applies to much more than identify/identification. Also, justify, > fortify, notify, many many others. > > $ grep ification /usr/share/dict/words | wc -l > 328 > > I am by no means an expert on stemming, and if the folks at snowball decide > to tell us that this change is bad or hard because it would overstem some > other words, then I'll happily accept that. But I definitely want to use > their expertise rather than relying on my own. > > Mike > > On Fri, May 1, 2020 at 10:35 AM Audrey Lorberfeld - > audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > >> Unless I'm misunderstanding the bug in question, there is no bug. What you >> are observing is simply just how things get stemmed... >> >> Best, >> Audrey >> >> On 4/30/20, 6:37 PM, "Jhonny Lopez" <jhonny.lo...@publicismedia.com> >> wrote: >> >> Yes, sounds like worth it. >> >> Thanks guys! >> >> -----Original Message----- >> From: Mike Drob <md...@apache.org> >> Sent: jueves, 30 de abril de 2020 5:30 p. m. >> To: solr-user@lucene.apache.org >> Subject: Re: Possible issue with Stemming and nouns ended with suffix >> 'ion' >> >> This email has been sent from a source external to Publicis Groupe. >> Please use caution when clicking links or opening attachments. >> Cet email a été envoyé depuis une source externe à Publicis Groupe. >> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou >> lorsque vous ouvrez des pièces jointes. >> >> >> >> Is this worth filing a bug/suggestion to the folks over at >> snowballstem.org? >> >> On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld - >> audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: >> >>> I agree with Erick. I think that's just how the cookie crumbles when >>> stemming. If you have some time on your hands, you can integrate >>> OpenNLP with your Solr instance and start using the lemmas of tokens >>> instead of the stems. In this case, I believe if you were to >> lemmatize >>> both "identify" and "identification," they would both condense to >> "identify." >>> >>> Best, >>> Audrey >>> >>> On 4/30/20, 3:54 PM, "Erick Erickson" <erickerick...@gmail.com> >> wrote: >>> >>> They are being stemmed to two different tokens, “identif” and >>> “identifi”. Stemming is algorithmic and imperfect and in this case >>> you’re getting bitten by that algorithm. It looks like you’re using >>> PorterStemFilter, if you want you can look up the exact algorithm, >> but >>> I don’t think it’s a bug, just one of those little joys of English... >>> >>> To get a clearer picture of exactly what’s being searched, try >>> adding &debug=query to your query, in particular looking at the >> parsed >>> query that’s returned. That’ll tell you a bunch. In this particular >>> case I don’t think it’ll tell you anything more, but for future… >>> >>> Best, >>> Erick >>> >>> On, and un-checking the ‘verbose’ box on the analysis page >> removes >>> a lot of distraction, the detailed information is often TMI ;) >>> >>>> On Apr 30, 2020, at 2:51 PM, Jhonny Lopez < >>> jhonny.lo...@publicismedia.com> wrote: >>>> >>>> Sure, rewriting the message with links for images: >>>> >>>> >>>> We’re facing an issue with stemming in solr. Most of the cases >>> are working correctly, for example, if we search for bidding, solr >>> brings results for bidding, bid, bids, etc. However, with nouns >> ended with ‘ion’ >>> suffix, stemming is not working. Even when analyzers seems to have >>> correct stemming of the word, the results are not reflecting that. >> One >>> example. If I search ‘identifying’, this is the output: >>>> >>>> Analyzer (image link): >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e= >>>> >>>> A clip of results: >>>> "haschildren_b":false, >>>> "isbucket_text_s":"0", >>>> "sectionbody_t":"\n\n\nIn order to identify 1st price >>> auctions, leverage the proprietary tools available or manually pull a >>> log file report to understand the trends and gauge auction spread >>> overtime to assess the impact of variable auction >> dynamics.\n\n\n\n\n\n\n", >>>> "parsedupdatedby_s":"sitecorecarvaini", >>>> "sectionbody_t_en":"\n\n\nIn order to identify 1st price >>> auctions, leverage the proprietary tools available or manually pull a >>> log file report to understand the trends and gauge auction spread >>> overtime to assess the impact of variable auction >> dynamics.\n\n\n\n\n\n\n", >>>> "hide_section_b":false >>>> >>>> >>>> As you can see, it has used the stemming correctly and brings >>> results for other words based in the root, in this case “Identify”. >>>> >>>> However, if I search for “Identification”, this is the output: >>>> >>>> Analyzer (imagelink): >>>> >>> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE&e= >>>> >>>> >>>> Even with proper stemming, solr is only bringing results for >> the >>> word identification (or identifications) but nothing else. >>>> >>>> The queries are over the same field that has the Porter >> Stemming >>> Filter applied for both, query and index. This behavior is consistent >>> with other ‘ion’ ended nouns: representation, modification, etc. >>>> >>>> Solr Version: 8.1. Does anyone know why is it happening? Is it >> a bug? >>>> >>>> Thanks. >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> >>>> From: Erick Erickson <erickerick...@gmail.com> >>>> >>>> Sent: jueves, 30 de abril de 2020 1:47 p. m. >>>> >>>> To: solr-user@lucene.apache.org >>>> >>>> Subject: Re: Possible issue with Stemming and nouns ended with >>> suffix 'ion' >>>> >>>> >>>> >>>> This email has been sent from a source external to Publicis >> Groupe. >>> Please use caution when clicking links or opening attachments. >>>> >>>> Cet email a été envoyé depuis une source externe à Publicis >> Groupe. >>> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens >>> ou lorsque vous ouvrez des pièces jointes. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> The mail server is pretty aggressive about stripping links, so >>> we can’t see the images. >>>> >>>> >>>> >>>> Could you put them somewhere and paste a link? >>>> >>>> >>>> >>>> Best, >>>> >>>> Erick >>>> >>>> >>>> >>>>> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez < >>> jhonny.lo...@publicismedia.com> wrote: >>>> >>>>> >>>> >>>>> We’re facing an issue with stemming in solr. Most of the cases >>> are working correctly, for example, if we search for bidding, solr >>> brings results for bidding, bid, bids, etc. However, with nouns >> ended with ‘ion’ >>> suffix, stemming is not working. Even when analyzers seems to have >>> correct stemming of the word, the results are not reflecting that. >> One >>> example. If I search ‘identifying’, this is the output: >>>> >>>>> >>>> >>>>> Analyzer (image): >>>> >>>>> >>>> >>>>> A clip of results: >>>> >>>>> "haschildren_b":false, >>>> >>>>> "isbucket_text_s":"0", >>>> >>>>> "sectionbody_t":"\n\n\nIn order to identify 1st price >>> auctions, leverage the proprietary tools available or manually pull a >>> log file report to understand the trends and gauge auction spread >>> overtime to assess the impact of variable auction >> dynamics.\n\n\n\n\n\n\n", >>>> >>>>> "parsedupdatedby_s":"sitecorecarvaini", >>>> >>>>> "sectionbody_t_en":"\n\n\nIn order to identify 1st >> price >>> auctions, leverage the proprietary tools available or manually pull a >>> log file report to understand the trends and gauge auction spread >>> overtime to assess the impact of variable auction >> dynamics.\n\n\n\n\n\n\n", >>>> >>>>> "hide_section_b":false >>>> >>>>> >>>> >>>>> >>>> >>>>> As you can see, it has used the stemming correctly and brings >>> results for other words based in the root, in this case “Identify”. >>>> >>>>> >>>> >>>>> However, if I search for “Identification”, this is the output: >>>> >>>>> >>>> >>>>> Analyzer (image): >>>> >>>>> >>>> >>>>> Even with proper stemming, solr is only bringing results for >>> the word identification (or identifications) but nothing else. >>>> >>>>> >>>> >>>>> The queries are over the same field that has the Porter >>> Stemming Filter applied for both, query and index. This behavior is >>> consistent with other ‘ion’ ended nouns: representation, >> modification, etc. >>>> >>>>> >>>> >>>>> Solr Version: 8.1. Does anyone know why is it happening? Is it >>> a bug? >>>> >>>>> >>>> >>>>> Thanks. >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> Jhonny Lopez >>>> >>>>> Technical Architect >>>> >>>>> Avenida Calle 26 No. 92 - 32, Edificio BTS3 >>>> >>>>> APDO. 128-1255 Bogota >>>> >>>>> T: +573006805461 >>>> >>>>> jhonny.lo...@publicismedia.com >>>> >>>>> www.prodigious.com >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>> >> ---------------------------------------------------------------------- >>>> >>>>> -- Disclaimer The information in this email and any >> attachments may >>>> >>>>> contain proprietary and confidential information that is >>> intended for the addressee(s) only. If you are not the intended >>> recipient, you are hereby notified that any disclosure, copying, >>> distribution, retention or use of the contents of this information is >>> prohibited. When addressed to our clients or vendors, any information >>> contained in this e-mail or any attachments is subject to the terms >>> and conditions in any governing contract. If you have received this >>> e-mail in error, please immediately contact the sender and delete >> the e-mail. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> ------------------------------------------------------------------------ >>>> Disclaimer The information in this email and any attachments >> may >>> contain proprietary and confidential information that is intended for >>> the >>> addressee(s) only. If you are not the intended recipient, you are >>> hereby notified that any disclosure, copying, distribution, retention >>> or use of the contents of this information is prohibited. When >>> addressed to our clients or vendors, any information contained in >> this >>> e-mail or any attachments is subject to the terms and conditions in >>> any governing contract. If you have received this e-mail in error, >>> please immediately contact the sender and delete the e-mail. >>> >>> >>> >>> >> >> >> >> >> >> ------------------------------------------------------------------------ >> Disclaimer The information in this email and any attachments may >> contain proprietary and confidential information that is intended for the >> addressee(s) only. If you are not the intended recipient, you are hereby >> notified that any disclosure, copying, distribution, retention or use of >> the contents of this information is prohibited. When addressed to our >> clients or vendors, any information contained in this e-mail or any >> attachments is subject to the terms and conditions in any governing >> contract. If you have received this e-mail in error, please immediately >> contact the sender and delete the e-mail. >> >> >>