I agree with Erick. I think that's just how the cookie crumbles when stemming. 
If you have some time on your hands, you can integrate OpenNLP with your Solr 
instance and start using the lemmas of tokens instead of the stems. In this 
case, I believe if you were to lemmatize both "identify" and "identification," 
they would both condense to "identify."

Best,
Audrey

On 4/30/20, 3:54 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

    They are being stemmed to two different tokens, “identif” and “identifi”. 
Stemming is algorithmic and imperfect and in this case you’re getting bitten by 
that algorithm. It looks like you’re using PorterStemFilter, if you want you 
can look up the exact algorithm, but I don’t think it’s a bug, just one of 
those little joys of English...
    
    To get a clearer picture of exactly what’s being searched, try adding 
&debug=query to your query, in particular looking at the parsed query that’s 
returned. That’ll tell you a bunch. In this particular case I don’t think it’ll 
tell you anything more, but for future…
    
    Best,
    Erick
    
    On, and un-checking the ‘verbose’ box on the analysis page removes a lot of 
distraction, the detailed information is often TMI ;)
    
    > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez 
<jhonny.lo...@publicismedia.com> wrote:
    > 
    > Sure, rewriting the message with links for images:
    > 
    > 
    > We’re facing an issue with stemming in solr. Most of the cases are 
working correctly, for example, if we search for bidding, solr brings results 
for bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
stemming is not working. Even when analyzers seems to have correct stemming of 
the word, the results are not reflecting that. One example. If I search 
‘identifying’, this is the output:
    > 
    > Analyzer (image link):
    > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e=
 
    > 
    > A clip of results:
    > "haschildren_b":false,
    >        "isbucket_text_s":"0",
    >        "sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
    >        "parsedupdatedby_s":"sitecorecarvaini",
    >        "sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
    >        "hide_section_b":false
    > 
    > 
    > As you can see, it has used the stemming correctly and brings results for 
other words based in the root, in this case “Identify”.
    > 
    > However, if I search for “Identification”, this is the output:
    > 
    > Analyzer (imagelink):
    > 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE&e=
 
    > 
    > 
    > Even with proper stemming, solr is only bringing results for the word 
identification (or identifications) but nothing else.
    > 
    > The queries are over the same field that has the Porter Stemming Filter 
applied for both, query and index. This behavior is consistent with other ‘ion’ 
ended nouns: representation, modification, etc.
    > 
    > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
    > 
    > Thanks.
    > 
    > 
    > 
    > 
    > 
    > -----Original Message-----
    > 
    > From: Erick Erickson <erickerick...@gmail.com>
    > 
    > Sent: jueves, 30 de abril de 2020 1:47 p. m.
    > 
    > To: solr-user@lucene.apache.org
    > 
    > Subject: Re: Possible issue with Stemming and nouns ended with suffix 
'ion'
    > 
    > 
    > 
    > This email has been sent from a source external to Publicis Groupe. 
Please use caution when clicking links or opening attachments.
    > 
    > Cet email a été envoyé depuis une source externe à Publicis Groupe. 
Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque 
vous ouvrez des pièces jointes.
    > 
    > 
    > 
    > 
    > 
    > 
    > 
    > The mail server is pretty aggressive about stripping links, so we can’t 
see the images.
    > 
    > 
    > 
    > Could you put them somewhere and paste a link?
    > 
    > 
    > 
    > Best,
    > 
    > Erick
    > 
    > 
    > 
    >> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez 
<jhonny.lo...@publicismedia.com> wrote:
    > 
    >> 
    > 
    >> We’re facing an issue with stemming in solr. Most of the cases are 
working correctly, for example, if we search for bidding, solr brings results 
for bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
stemming is not working. Even when analyzers seems to have correct stemming of 
the word, the results are not reflecting that. One example. If I search 
‘identifying’, this is the output:
    > 
    >> 
    > 
    >> Analyzer (image):
    > 
    >> 
    > 
    >> A clip of results:
    > 
    >> "haschildren_b":false,
    > 
    >>        "isbucket_text_s":"0",
    > 
    >>        "sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
    > 
    >>        "parsedupdatedby_s":"sitecorecarvaini",
    > 
    >>        "sectionbody_t_en":"\n\n\nIn order to identify 1st price 
auctions, leverage the proprietary tools available or manually pull a log file 
report to understand the trends and gauge auction spread overtime to assess the 
impact of variable auction dynamics.\n\n\n\n\n\n\n",
    > 
    >>        "hide_section_b":false
    > 
    >> 
    > 
    >> 
    > 
    >> As you can see, it has used the stemming correctly and brings results 
for other words based in the root, in this case “Identify”.
    > 
    >> 
    > 
    >> However, if I search for “Identification”, this is the output:
    > 
    >> 
    > 
    >> Analyzer (image):
    > 
    >> 
    > 
    >> Even with proper stemming, solr is only bringing results for the word 
identification (or identifications) but nothing else.
    > 
    >> 
    > 
    >> The queries are over the same field that has the Porter Stemming Filter 
applied for both, query and index. This behavior is consistent with other ‘ion’ 
ended nouns: representation, modification, etc.
    > 
    >> 
    > 
    >> Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
    > 
    >> 
    > 
    >> Thanks.
    > 
    >> 
    > 
    >> 
    > 
    >> 
    > 
    >> 
    > 
    >>  Jhonny Lopez
    > 
    >>  Technical Architect
    > 
    >>  Avenida Calle 26 No. 92 - 32, Edificio BTS3
    > 
    >>  APDO. 128-1255 Bogota
    > 
    >>  T: +573006805461
    > 
    >>  jhonny.lo...@publicismedia.com
    > 
    >>  www.prodigious.com
    > 
    >> 
    > 
    >> 
    > 
    >> 
    > 
    >> 
    > 
    >> 
    > 
    >> 
    > 
    >> ----------------------------------------------------------------------
    > 
    >> -- Disclaimer The information in this email and any attachments may
    > 
    >> contain proprietary and confidential information that is intended for 
the addressee(s) only. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution, retention or use of the 
contents of this information is prohibited. When addressed to our clients or 
vendors, any information contained in this e-mail or any attachments is subject 
to the terms and conditions in any governing contract. If you have received 
this e-mail in error, please immediately contact the sender and delete the 
e-mail.
    > 
    > 
    > 
    > 
    > 
    > 
    > ------------------------------------------------------------------------
    > Disclaimer The information in this email and any attachments may contain 
proprietary and confidential information that is intended for the addressee(s) 
only. If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, retention or use of the contents of this 
information is prohibited. When addressed to our clients or vendors, any 
information contained in this e-mail or any attachments is subject to the terms 
and conditions in any governing contract. If you have received this e-mail in 
error, please immediately contact the sender and delete the e-mail.
    
    

Reply via email to