Is this worth filing a bug/suggestion to the folks over at snowballstem.org?

On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com <audrey.lorberf...@ibm.com> wrote:

> I agree with Erick. I think that's just how the cookie crumbles when
> stemming. If you have some time on your hands, you can integrate OpenNLP
> with your Solr instance and start using the lemmas of tokens instead of the
> stems. In this case, I believe if you were to lemmatize both "identify" and
> "identification," they would both condense to "identify."
>
> Best,
> Audrey
>
> On 4/30/20, 3:54 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>     They are being stemmed to two different tokens, “identif” and
> “identifi”. Stemming is algorithmic and imperfect and in this case you’re
> getting bitten by that algorithm. It looks like you’re using
> PorterStemFilter, if you want you can look up the exact algorithm, but I
> don’t think it’s a bug, just one of those little joys of English...
>
>     To get a clearer picture of exactly what’s being searched, try adding
> &debug=query to your query, in particular looking at the parsed query
> that’s returned. That’ll tell you a bunch. In this particular case I don’t
> think it’ll tell you anything more, but for future…
>
>     Best,
>     Erick
>
>     On, and un-checking the ‘verbose’ box on the analysis page removes a
> lot of distraction, the detailed information is often TMI ;)
>
>     > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
>     >
>     > Sure, rewriting the message with links for images:
>     >
>     >
>     > We’re facing an issue with stemming in solr. Most of the cases are
> working correctly, for example, if we search for bidding, solr brings
> results for bidding, bid, bids, etc. However, with nouns ended with ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have correct
> stemming of the word, the results are not reflecting that. One example. If
> I search ‘identifying’, this is the output:
>     >
>     > Analyzer (image link):
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e=
>     >
>     > A clip of results:
>     > "haschildren_b":false,
>     >        "isbucket_text_s":"0",
>     >        "sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
>     >        "parsedupdatedby_s":"sitecorecarvaini",
>     >        "sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
>     >        "hide_section_b":false
>     >
>     >
>     > As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
>     >
>     > However, if I search for “Identification”, this is the output:
>     >
>     > Analyzer (imagelink):
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE&e=
>     >
>     >
>     > Even with proper stemming, solr is only bringing results for the
> word identification (or identifications) but nothing else.
>     >
>     > The queries are over the same field that has the Porter Stemming
> Filter applied for both, query and index. This behavior is consistent with
> other ‘ion’ ended nouns: representation, modification, etc.
>     >
>     > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
>     >
>     > Thanks.
>     >
>     >
>     >
>     >
>     >
>     > -----Original Message-----
>     >
>     > From: Erick Erickson <erickerick...@gmail.com>
>     >
>     > Sent: jueves, 30 de abril de 2020 1:47 p. m.
>     >
>     > To: solr-user@lucene.apache.org
>     >
>     > Subject: Re: Possible issue with Stemming and nouns ended with
> suffix 'ion'
>     >
>     >
>     >
>     > This email has been sent from a source external to Publicis Groupe.
> Please use caution when clicking links or opening attachments.
>     >
>     > Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > The mail server is pretty aggressive about stripping links, so we
> can’t see the images.
>     >
>     >
>     >
>     > Could you put them somewhere and paste a link?
>     >
>     >
>     >
>     > Best,
>     >
>     > Erick
>     >
>     >
>     >
>     >> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
>     >
>     >>
>     >
>     >> We’re facing an issue with stemming in solr. Most of the cases are
> working correctly, for example, if we search for bidding, solr brings
> results for bidding, bid, bids, etc. However, with nouns ended with ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have correct
> stemming of the word, the results are not reflecting that. One example. If
> I search ‘identifying’, this is the output:
>     >
>     >>
>     >
>     >> Analyzer (image):
>     >
>     >>
>     >
>     >> A clip of results:
>     >
>     >> "haschildren_b":false,
>     >
>     >>        "isbucket_text_s":"0",
>     >
>     >>        "sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
>     >
>     >>        "parsedupdatedby_s":"sitecorecarvaini",
>     >
>     >>        "sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
>     >
>     >>        "hide_section_b":false
>     >
>     >>
>     >
>     >>
>     >
>     >> As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
>     >
>     >>
>     >
>     >> However, if I search for “Identification”, this is the output:
>     >
>     >>
>     >
>     >> Analyzer (image):
>     >
>     >>
>     >
>     >> Even with proper stemming, solr is only bringing results for the
> word identification (or identifications) but nothing else.
>     >
>     >>
>     >
>     >> The queries are over the same field that has the Porter Stemming
> Filter applied for both, query and index. This behavior is consistent with
> other ‘ion’ ended nouns: representation, modification, etc.
>     >
>     >>
>     >
>     >> Solr Version: 8.1. Does anyone know why is it happening? Is it a
> bug?
>     >
>     >>
>     >
>     >> Thanks.
>     >
>     >>
>     >
>     >>
>     >
>     >>
>     >
>     >>
>     >
>     >>  Jhonny Lopez
>     >
>     >>  Technical Architect
>     >
>     >>  Avenida Calle 26 No. 92 - 32, Edificio BTS3
>     >
>     >>  APDO. 128-1255 Bogota
>     >
>     >>  T: +573006805461
>     >
>     >>  jhonny.lo...@publicismedia.com
>     >
>     >>  www.prodigious.com
>     >
>     >>
>     >
>     >>
>     >
>     >>
>     >
>     >>
>     >
>     >>
>     >
>     >>
>     >
>     >>
> ----------------------------------------------------------------------
>     >
>     >> -- Disclaimer The information in this email and any attachments may
>     >
>     >> contain proprietary and confidential information that is intended
> for the addressee(s) only. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution, retention or
> use of the contents of this information is prohibited. When addressed to
> our clients or vendors, any information contained in this e-mail or any
> attachments is subject to the terms and conditions in any governing
> contract. If you have received this e-mail in error, please immediately
> contact the sender and delete the e-mail.
>     >
>     >
>     >
>     >
>     >
>     >
>     >
> ------------------------------------------------------------------------
>     > Disclaimer The information in this email and any attachments may
> contain proprietary and confidential information that is intended for the
> addressee(s) only. If you are not the intended recipient, you are hereby
> notified that any disclosure, copying, distribution, retention or use of
> the contents of this information is prohibited. When addressed to our
> clients or vendors, any information contained in this e-mail or any
> attachments is subject to the terms and conditions in any governing
> contract. If you have received this e-mail in error, please immediately
> contact the sender and delete the e-mail.
>
>
>
>

Reply via email to