Is this worth filing a bug/suggestion to the folks over at snowballstem.org?
On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote: > I agree with Erick. I think that's just how the cookie crumbles when > stemming. If you have some time on your hands, you can integrate OpenNLP > with your Solr instance and start using the lemmas of tokens instead of the > stems. In this case, I believe if you were to lemmatize both "identify" and > "identification," they would both condense to "identify." > > Best, > Audrey > > On 4/30/20, 3:54 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > > They are being stemmed to two different tokens, “identif” and > “identifi”. Stemming is algorithmic and imperfect and in this case you’re > getting bitten by that algorithm. It looks like you’re using > PorterStemFilter, if you want you can look up the exact algorithm, but I > don’t think it’s a bug, just one of those little joys of English... > > To get a clearer picture of exactly what’s being searched, try adding > &debug=query to your query, in particular looking at the parsed query > that’s returned. That’ll tell you a bunch. In this particular case I don’t > think it’ll tell you anything more, but for future… > > Best, > Erick > > On, and un-checking the ‘verbose’ box on the analysis page removes a > lot of distraction, the detailed information is often TMI ;) > > > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez < > jhonny.lo...@publicismedia.com> wrote: > > > > Sure, rewriting the message with links for images: > > > > > > We’re facing an issue with stemming in solr. Most of the cases are > working correctly, for example, if we search for bidding, solr brings > results for bidding, bid, bids, etc. However, with nouns ended with ‘ion’ > suffix, stemming is not working. Even when analyzers seems to have correct > stemming of the word, the results are not reflecting that. One example. If > I search ‘identifying’, this is the output: > > > > Analyzer (image link): > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e= > > > > A clip of results: > > "haschildren_b":false, > > "isbucket_text_s":"0", > > "sectionbody_t":"\n\n\nIn order to identify 1st price > auctions, leverage the proprietary tools available or manually pull a log > file report to understand the trends and gauge auction spread overtime to > assess the impact of variable auction dynamics.\n\n\n\n\n\n\n", > > "parsedupdatedby_s":"sitecorecarvaini", > > "sectionbody_t_en":"\n\n\nIn order to identify 1st price > auctions, leverage the proprietary tools available or manually pull a log > file report to understand the trends and gauge auction spread overtime to > assess the impact of variable auction dynamics.\n\n\n\n\n\n\n", > > "hide_section_b":false > > > > > > As you can see, it has used the stemming correctly and brings > results for other words based in the root, in this case “Identify”. > > > > However, if I search for “Identification”, this is the output: > > > > Analyzer (imagelink): > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE&e= > > > > > > Even with proper stemming, solr is only bringing results for the > word identification (or identifications) but nothing else. > > > > The queries are over the same field that has the Porter Stemming > Filter applied for both, query and index. This behavior is consistent with > other ‘ion’ ended nouns: representation, modification, etc. > > > > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug? > > > > Thanks. > > > > > > > > > > > > -----Original Message----- > > > > From: Erick Erickson <erickerick...@gmail.com> > > > > Sent: jueves, 30 de abril de 2020 1:47 p. m. > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: Possible issue with Stemming and nouns ended with > suffix 'ion' > > > > > > > > This email has been sent from a source external to Publicis Groupe. > Please use caution when clicking links or opening attachments. > > > > Cet email a été envoyé depuis une source externe à Publicis Groupe. > Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou > lorsque vous ouvrez des pièces jointes. > > > > > > > > > > > > > > > > The mail server is pretty aggressive about stripping links, so we > can’t see the images. > > > > > > > > Could you put them somewhere and paste a link? > > > > > > > > Best, > > > > Erick > > > > > > > >> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez < > jhonny.lo...@publicismedia.com> wrote: > > > >> > > > >> We’re facing an issue with stemming in solr. Most of the cases are > working correctly, for example, if we search for bidding, solr brings > results for bidding, bid, bids, etc. However, with nouns ended with ‘ion’ > suffix, stemming is not working. Even when analyzers seems to have correct > stemming of the word, the results are not reflecting that. One example. If > I search ‘identifying’, this is the output: > > > >> > > > >> Analyzer (image): > > > >> > > > >> A clip of results: > > > >> "haschildren_b":false, > > > >> "isbucket_text_s":"0", > > > >> "sectionbody_t":"\n\n\nIn order to identify 1st price > auctions, leverage the proprietary tools available or manually pull a log > file report to understand the trends and gauge auction spread overtime to > assess the impact of variable auction dynamics.\n\n\n\n\n\n\n", > > > >> "parsedupdatedby_s":"sitecorecarvaini", > > > >> "sectionbody_t_en":"\n\n\nIn order to identify 1st price > auctions, leverage the proprietary tools available or manually pull a log > file report to understand the trends and gauge auction spread overtime to > assess the impact of variable auction dynamics.\n\n\n\n\n\n\n", > > > >> "hide_section_b":false > > > >> > > > >> > > > >> As you can see, it has used the stemming correctly and brings > results for other words based in the root, in this case “Identify”. > > > >> > > > >> However, if I search for “Identification”, this is the output: > > > >> > > > >> Analyzer (image): > > > >> > > > >> Even with proper stemming, solr is only bringing results for the > word identification (or identifications) but nothing else. > > > >> > > > >> The queries are over the same field that has the Porter Stemming > Filter applied for both, query and index. This behavior is consistent with > other ‘ion’ ended nouns: representation, modification, etc. > > > >> > > > >> Solr Version: 8.1. Does anyone know why is it happening? Is it a > bug? > > > >> > > > >> Thanks. > > > >> > > > >> > > > >> > > > >> > > > >> Jhonny Lopez > > > >> Technical Architect > > > >> Avenida Calle 26 No. 92 - 32, Edificio BTS3 > > > >> APDO. 128-1255 Bogota > > > >> T: +573006805461 > > > >> jhonny.lo...@publicismedia.com > > > >> www.prodigious.com > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > ---------------------------------------------------------------------- > > > >> -- Disclaimer The information in this email and any attachments may > > > >> contain proprietary and confidential information that is intended > for the addressee(s) only. If you are not the intended recipient, you are > hereby notified that any disclosure, copying, distribution, retention or > use of the contents of this information is prohibited. When addressed to > our clients or vendors, any information contained in this e-mail or any > attachments is subject to the terms and conditions in any governing > contract. If you have received this e-mail in error, please immediately > contact the sender and delete the e-mail. > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > Disclaimer The information in this email and any attachments may > contain proprietary and confidential information that is intended for the > addressee(s) only. If you are not the intended recipient, you are hereby > notified that any disclosure, copying, distribution, retention or use of > the contents of this information is prohibited. When addressed to our > clients or vendors, any information contained in this e-mail or any > attachments is subject to the terms and conditions in any governing > contract. If you have received this e-mail in error, please immediately > contact the sender and delete the e-mail. > > > >