RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Jhonny Lopez
I tried KStem but some other scenarios were broken, so I reverted id. However, 
it might have happened for some misconfiguration. I will try that once more.

Thanks.

  Jhonny Lopez
  Technical Architect
  Avenida Calle 26 No. 92 - 32, Edificio BTS3
  APDO. 128-1255 Bogota
  T: +57 300 6805461
  jhonny.lo...@prodigious.com
  www.prodigious.com



-Mensaje original-
De: Walter Underwood 
Enviado el: viernes, 1 de mayo de 2020 11:24 a.m.
Para: solr-user@lucene.apache.org
Asunto: Re: Possible issue with Stemming and nouns ended with suffix 'ion'

This email has been sent from a source external to Publicis Groupe. Please use 
caution when clicking links or opening attachments.
Cet email a été envoyé depuis une source externe à Publicis Groupe. Veuillez 
faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque vous 
ouvrez des pièces jointes.



The Porter/Snowball stemmer is an evolved version of a forty year old hack.
It is neat that it works at all, but don’t expect too much. I think it is too 
aggressive for search use.

What does KStem do with this? That is based on better linguistic models.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 1, 2020, at 8:45 AM, Mike Drob  wrote:
>
> This is how things get stemmed *now*, but I believe there is an open
> question as to whether that is how they *should* be stemmed.
> Specifically, the case appears to be -ify words not stemming to the
> same as -ification - this applies to much more than
> identify/identification. Also, justify, fortify, notify, many many others.
>
> $ grep ification /usr/share/dict/words | wc -l
> 328
>
> I am by no means an expert on stemming, and if the folks at snowball
> decide to tell us that this change is bad or hard because it would
> overstem some other words, then I'll happily accept that. But I
> definitely want to use their expertise rather than relying on my own.
>
> Mike
>
> On Fri, May 1, 2020 at 10:35 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> Unless I'm misunderstanding the bug in question, there is no bug.
>> What you are observing is simply just how things get stemmed...
>>
>> Best,
>> Audrey
>>
>> On 4/30/20, 6:37 PM, "Jhonny Lopez" 
>> wrote:
>>
>>Yes, sounds like worth it.
>>
>>Thanks guys!
>>
>>-Original Message-
>>From: Mike Drob 
>>Sent: jueves, 30 de abril de 2020 5:30 p. m.
>>To: solr-user@lucene.apache.org
>>Subject: Re: Possible issue with Stemming and nouns ended with
>> suffix 'ion'
>>
>>This email has been sent from a source external to Publicis Groupe.
>> Please use caution when clicking links or opening attachments.
>>Cet email a été envoyé depuis une source externe à Publicis Groupe.
>> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens
>> ou lorsque vous ouvrez des pièces jointes.
>>
>>
>>
>>Is this worth filing a bug/suggestion to the folks over at
>> snowballstem.org?
>>
>>On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com  wrote:
>>
>>> I agree with Erick. I think that's just how the cookie crumbles when
>>> stemming. If you have some time on your hands, you can integrate
>>> OpenNLP with your Solr instance and start using the lemmas of tokens
>>> instead of the stems. In this case, I believe if you were to
>> lemmatize
>>> both "identify" and "identification," they would both condense to
>> "identify."
>>>
>>> Best,
>>> Audrey
>>>
>>> On 4/30/20, 3:54 PM, "Erick Erickson" 
>> wrote:
>>>
>>>They are being stemmed to two different tokens, “identif” and
>>> “identifi”. Stemming is algorithmic and imperfect and in this case
>>> you’re getting bitten by that algorithm. It looks like you’re using
>>> PorterStemFilter, if you want you can look up the exact algorithm,
>> but
>>> I don’t think it’s a bug, just one of those little joys of English...
>>>
>>>To get a clearer picture of exactly what’s being searched, try
>>> adding =query to your query, in particular looking at the
>> parsed
>>> query that’s returned. That’ll tell you a bunch. In this particular
>>> case I don’t think it’ll tell you anything more, but for future…
>>>
>>>Best,
>>>Erick
>>>
>>>On, and un-checking the ‘verbose’ box on the analysis page
>> removes
>>> a lot of distraction, the detailed information is often TMI ;)
>>>
 On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
>>> jhonny.lo...@publicismedia.com> wrote:

 Sure, rewriting the message with links for images:


 We’re facing an issue with stemming in solr. Most of the cases
>>> are working correctly, for example, if we search for bidding, solr
>>> brings results for bidding, bid, bids, etc. However, with nouns
>> ended with ‘ion’
>>> suffix, stemming is not working. Even when analyzers seems to have
>>> correct stemming of the word, the results are not reflecting that.
>> One
>>> example. If I search ‘identifying’, this is the output:

 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Walter Underwood
The Porter/Snowball stemmer is an evolved version of a forty year old hack.
It is neat that it works at all, but don’t expect too much. I think it is too 
aggressive
for search use.

What does KStem do with this? That is based on better linguistic models.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 1, 2020, at 8:45 AM, Mike Drob  wrote:
> 
> This is how things get stemmed *now*, but I believe there is an open
> question as to whether that is how they *should* be stemmed. Specifically,
> the case appears to be -ify words not stemming to the same as -ification -
> this applies to much more than identify/identification. Also, justify,
> fortify, notify, many many others.
> 
> $ grep ification /usr/share/dict/words | wc -l
> 328
> 
> I am by no means an expert on stemming, and if the folks at snowball decide
> to tell us that this change is bad or hard because it would overstem some
> other words, then I'll happily accept that. But I definitely want to use
> their expertise rather than relying on my own.
> 
> Mike
> 
> On Fri, May 1, 2020 at 10:35 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> 
>> Unless I'm misunderstanding the bug in question, there is no bug. What you
>> are observing is simply just how things get stemmed...
>> 
>> Best,
>> Audrey
>> 
>> On 4/30/20, 6:37 PM, "Jhonny Lopez" 
>> wrote:
>> 
>>Yes, sounds like worth it.
>> 
>>Thanks guys!
>> 
>>-Original Message-
>>From: Mike Drob 
>>Sent: jueves, 30 de abril de 2020 5:30 p. m.
>>To: solr-user@lucene.apache.org
>>Subject: Re: Possible issue with Stemming and nouns ended with suffix
>> 'ion'
>> 
>>This email has been sent from a source external to Publicis Groupe.
>> Please use caution when clicking links or opening attachments.
>>Cet email a été envoyé depuis une source externe à Publicis Groupe.
>> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
>> lorsque vous ouvrez des pièces jointes.
>> 
>> 
>> 
>>Is this worth filing a bug/suggestion to the folks over at
>> snowballstem.org?
>> 
>>On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com  wrote:
>> 
>>> I agree with Erick. I think that's just how the cookie crumbles when
>>> stemming. If you have some time on your hands, you can integrate
>>> OpenNLP with your Solr instance and start using the lemmas of tokens
>>> instead of the stems. In this case, I believe if you were to
>> lemmatize
>>> both "identify" and "identification," they would both condense to
>> "identify."
>>> 
>>> Best,
>>> Audrey
>>> 
>>> On 4/30/20, 3:54 PM, "Erick Erickson" 
>> wrote:
>>> 
>>>They are being stemmed to two different tokens, “identif” and
>>> “identifi”. Stemming is algorithmic and imperfect and in this case
>>> you’re getting bitten by that algorithm. It looks like you’re using
>>> PorterStemFilter, if you want you can look up the exact algorithm,
>> but
>>> I don’t think it’s a bug, just one of those little joys of English...
>>> 
>>>To get a clearer picture of exactly what’s being searched, try
>>> adding =query to your query, in particular looking at the
>> parsed
>>> query that’s returned. That’ll tell you a bunch. In this particular
>>> case I don’t think it’ll tell you anything more, but for future…
>>> 
>>>Best,
>>>Erick
>>> 
>>>On, and un-checking the ‘verbose’ box on the analysis page
>> removes
>>> a lot of distraction, the detailed information is often TMI ;)
>>> 
 On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
>>> jhonny.lo...@publicismedia.com> wrote:
 
 Sure, rewriting the message with links for images:
 
 
 We’re facing an issue with stemming in solr. Most of the cases
>>> are working correctly, for example, if we search for bidding, solr
>>> brings results for bidding, bid, bids, etc. However, with nouns
>> ended with ‘ion’
>>> suffix, stemming is not working. Even when analyzers seems to have
>>> correct stemming of the word, the results are not reflecting that.
>> One
>>> example. If I search ‘identifying’, this is the output:
 
 Analyzer (image link):
 
>>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
 
 A clip of results:
 "haschildren_b":false,
   "isbucket_text_s":"0",
   "sectionbody_t":"\n\n\nIn order to identify 1st price
>>> auctions, leverage the proprietary tools available or manually pull a
>>> log file report to understand the trends and gauge auction spread
>>> overtime to assess the impact of variable auction
>> dynamics.\n\n\n\n\n\n\n",
   "parsedupdatedby_s":"sitecorecarvaini",
   "sectionbody_t_en":"\n\n\nIn order to identify 1st price
>>> auctions, leverage the proprietary 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Mike Drob
This is how things get stemmed *now*, but I believe there is an open
question as to whether that is how they *should* be stemmed. Specifically,
the case appears to be -ify words not stemming to the same as -ification -
this applies to much more than identify/identification. Also, justify,
fortify, notify, many many others.

$ grep ification /usr/share/dict/words | wc -l
 328

I am by no means an expert on stemming, and if the folks at snowball decide
to tell us that this change is bad or hard because it would overstem some
other words, then I'll happily accept that. But I definitely want to use
their expertise rather than relying on my own.

Mike

On Fri, May 1, 2020 at 10:35 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Unless I'm misunderstanding the bug in question, there is no bug. What you
> are observing is simply just how things get stemmed...
>
> Best,
> Audrey
>
> On 4/30/20, 6:37 PM, "Jhonny Lopez" 
> wrote:
>
> Yes, sounds like worth it.
>
> Thanks guys!
>
> -Original Message-
> From: Mike Drob 
> Sent: jueves, 30 de abril de 2020 5:30 p. m.
> To: solr-user@lucene.apache.org
> Subject: Re: Possible issue with Stemming and nouns ended with suffix
> 'ion'
>
> This email has been sent from a source external to Publicis Groupe.
> Please use caution when clicking links or opening attachments.
> Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
>
>
>
> Is this worth filing a bug/suggestion to the folks over at
> snowballstem.org?
>
> On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > I agree with Erick. I think that's just how the cookie crumbles when
> > stemming. If you have some time on your hands, you can integrate
> > OpenNLP with your Solr instance and start using the lemmas of tokens
> > instead of the stems. In this case, I believe if you were to
> lemmatize
> > both "identify" and "identification," they would both condense to
> "identify."
> >
> > Best,
> > Audrey
> >
> > On 4/30/20, 3:54 PM, "Erick Erickson" 
> wrote:
> >
> > They are being stemmed to two different tokens, “identif” and
> > “identifi”. Stemming is algorithmic and imperfect and in this case
> > you’re getting bitten by that algorithm. It looks like you’re using
> > PorterStemFilter, if you want you can look up the exact algorithm,
> but
> > I don’t think it’s a bug, just one of those little joys of English...
> >
> > To get a clearer picture of exactly what’s being searched, try
> > adding =query to your query, in particular looking at the
> parsed
> > query that’s returned. That’ll tell you a bunch. In this particular
> > case I don’t think it’ll tell you anything more, but for future…
> >
> > Best,
> > Erick
> >
> > On, and un-checking the ‘verbose’ box on the analysis page
> removes
> > a lot of distraction, the detailed information is often TMI ;)
> >
> > > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> > jhonny.lo...@publicismedia.com> wrote:
> > >
> > > Sure, rewriting the message with links for images:
> > >
> > >
> > > We’re facing an issue with stemming in solr. Most of the cases
> > are working correctly, for example, if we search for bidding, solr
> > brings results for bidding, bid, bids, etc. However, with nouns
> ended with ‘ion’
> > suffix, stemming is not working. Even when analyzers seems to have
> > correct stemming of the word, the results are not reflecting that.
> One
> > example. If I search ‘identifying’, this is the output:
> > >
> > > Analyzer (image link):
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
> > >
> > > A clip of results:
> > > "haschildren_b":false,
> > >"isbucket_text_s":"0",
> > >"sectionbody_t":"\n\n\nIn order to identify 1st price
> > auctions, leverage the proprietary tools available or manually pull a
> > log file report to understand the trends and gauge auction spread
> > overtime to assess the impact of variable auction
> dynamics.\n\n\n\n\n\n\n",
> > >"parsedupdatedby_s":"sitecorecarvaini",
> > >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> > auctions, leverage the proprietary tools available or manually pull a
> > log file report to understand the trends and gauge auction spread
> > overtime to assess the impact of variable auction
> 

RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Unless I'm misunderstanding the bug in question, there is no bug. What you are 
observing is simply just how things get stemmed...

Best,
Audrey

On 4/30/20, 6:37 PM, "Jhonny Lopez"  wrote:

Yes, sounds like worth it.

Thanks guys!

-Original Message-
From: Mike Drob 
Sent: jueves, 30 de abril de 2020 5:30 p. m.
To: solr-user@lucene.apache.org
Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'

This email has been sent from a source external to Publicis Groupe. Please 
use caution when clicking links or opening attachments.
Cet email a été envoyé depuis une source externe à Publicis Groupe. 
Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque 
vous ouvrez des pièces jointes.



Is this worth filing a bug/suggestion to the folks over at snowballstem.org?

On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:

> I agree with Erick. I think that's just how the cookie crumbles when
> stemming. If you have some time on your hands, you can integrate
> OpenNLP with your Solr instance and start using the lemmas of tokens
> instead of the stems. In this case, I believe if you were to lemmatize
> both "identify" and "identification," they would both condense to 
"identify."
>
> Best,
> Audrey
>
> On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
>
> They are being stemmed to two different tokens, “identif” and
> “identifi”. Stemming is algorithmic and imperfect and in this case
> you’re getting bitten by that algorithm. It looks like you’re using
> PorterStemFilter, if you want you can look up the exact algorithm, but
> I don’t think it’s a bug, just one of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try
> adding =query to your query, in particular looking at the parsed
> query that’s returned. That’ll tell you a bunch. In this particular
> case I don’t think it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes
> a lot of distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases
> are working correctly, for example, if we search for bidding, solr
> brings results for bidding, bid, bids, etc. However, with nouns ended 
with ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have
> correct stemming of the word, the results are not reflecting that. One
> example. If I search ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a
> log file report to understand the trends and gauge auction spread
> overtime to assess the impact of variable auction 
dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a
> log file report to understand the trends and gauge auction spread
> overtime to assess the impact of variable auction 
dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE=
> >
> >
> > Even with proper stemming, solr is only bringing results for the
> word identification (or identifications) but nothing else.
> >
> > The queries are over the same field that has the Porter Stemming

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Mike Drob
Jhonny,

Are you planning on reporting the issue to snowball, or would you prefer
one of us take care of it?
If you do report it, please share the link to the issue or mail archive
back here so that we know when it is resolved and can update our
dependencies.

Thanks,
Mike

On Thu, Apr 30, 2020 at 5:37 PM Jhonny Lopez 
wrote:

> Yes, sounds like worth it.
>
> Thanks guys!
>
> -Original Message-
> From: Mike Drob 
> Sent: jueves, 30 de abril de 2020 5:30 p. m.
> To: solr-user@lucene.apache.org
> Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'
>
> This email has been sent from a source external to Publicis Groupe. Please
> use caution when clicking links or opening attachments.
> Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
>
>
>
> Is this worth filing a bug/suggestion to the folks over at
> snowballstem.org?
>
> On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > I agree with Erick. I think that's just how the cookie crumbles when
> > stemming. If you have some time on your hands, you can integrate
> > OpenNLP with your Solr instance and start using the lemmas of tokens
> > instead of the stems. In this case, I believe if you were to lemmatize
> > both "identify" and "identification," they would both condense to
> "identify."
> >
> > Best,
> > Audrey
> >
> > On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
> >
> > They are being stemmed to two different tokens, “identif” and
> > “identifi”. Stemming is algorithmic and imperfect and in this case
> > you’re getting bitten by that algorithm. It looks like you’re using
> > PorterStemFilter, if you want you can look up the exact algorithm, but
> > I don’t think it’s a bug, just one of those little joys of English...
> >
> > To get a clearer picture of exactly what’s being searched, try
> > adding =query to your query, in particular looking at the parsed
> > query that’s returned. That’ll tell you a bunch. In this particular
> > case I don’t think it’ll tell you anything more, but for future…
> >
> > Best,
> > Erick
> >
> > On, and un-checking the ‘verbose’ box on the analysis page removes
> > a lot of distraction, the detailed information is often TMI ;)
> >
> > > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> > jhonny.lo...@publicismedia.com> wrote:
> > >
> > > Sure, rewriting the message with links for images:
> > >
> > >
> > > We’re facing an issue with stemming in solr. Most of the cases
> > are working correctly, for example, if we search for bidding, solr
> > brings results for bidding, bid, bids, etc. However, with nouns ended
> with ‘ion’
> > suffix, stemming is not working. Even when analyzers seems to have
> > correct stemming of the word, the results are not reflecting that. One
> > example. If I search ‘identifying’, this is the output:
> > >
> > > Analyzer (image link):
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
> > >
> > > A clip of results:
> > > "haschildren_b":false,
> > >"isbucket_text_s":"0",
> > >"sectionbody_t":"\n\n\nIn order to identify 1st price
> > auctions, leverage the proprietary tools available or manually pull a
> > log file report to understand the trends and gauge auction spread
> > overtime to assess the impact of variable auction
> dynamics.\n\n\n\n\n\n\n",
> > >"parsedupdatedby_s":"sitecorecarvaini",
> > >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> > auctions, leverage the proprietary tools available or manually pull a
> > log file report to understand the trends and gauge auction spread
> > overtime to assess the impact of variable auction
> dynamics.\n\n\n\n\n\n\n",
> > >"hide_section_b":false
> > >
> > >
> > > As you can see, it has used the stemming correctly and brings
> > results for other words based in the root, in this case “Identify”.
> > >
> > > However, if I search for “Identification”, this is the output:
> > >
> > > Analyzer (imagelink):
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE=
> > >
> > >
> > > Even with proper stemming, solr is only bringing results for the
> > word identification (or identifications) but nothing else.
> > >
> > > The queries are over the same field that has the Porter Stemming
> > Filter applied for both, query and index. 

RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Jhonny Lopez
Yes, sounds like worth it.

Thanks guys!

-Original Message-
From: Mike Drob 
Sent: jueves, 30 de abril de 2020 5:30 p. m.
To: solr-user@lucene.apache.org
Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'

This email has been sent from a source external to Publicis Groupe. Please use 
caution when clicking links or opening attachments.
Cet email a été envoyé depuis une source externe à Publicis Groupe. Veuillez 
faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque vous 
ouvrez des pièces jointes.



Is this worth filing a bug/suggestion to the folks over at snowballstem.org?

On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com 
 wrote:

> I agree with Erick. I think that's just how the cookie crumbles when
> stemming. If you have some time on your hands, you can integrate
> OpenNLP with your Solr instance and start using the lemmas of tokens
> instead of the stems. In this case, I believe if you were to lemmatize
> both "identify" and "identification," they would both condense to "identify."
>
> Best,
> Audrey
>
> On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
>
> They are being stemmed to two different tokens, “identif” and
> “identifi”. Stemming is algorithmic and imperfect and in this case
> you’re getting bitten by that algorithm. It looks like you’re using
> PorterStemFilter, if you want you can look up the exact algorithm, but
> I don’t think it’s a bug, just one of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try
> adding =query to your query, in particular looking at the parsed
> query that’s returned. That’ll tell you a bunch. In this particular
> case I don’t think it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes
> a lot of distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases
> are working correctly, for example, if we search for bidding, solr
> brings results for bidding, bid, bids, etc. However, with nouns ended with 
> ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have
> correct stemming of the word, the results are not reflecting that. One
> example. If I search ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a
> log file report to understand the trends and gauge auction spread
> overtime to assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a
> log file report to understand the trends and gauge auction spread
> overtime to assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE=
> >
> >
> > Even with proper stemming, solr is only bringing results for the
> word identification (or identifications) but nothing else.
> >
> > The queries are over the same field that has the Porter Stemming
> Filter applied for both, query and index. This behavior is consistent
> with other ‘ion’ ended nouns: representation, modification, etc.
> >
> > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> >
> > Thanks.
> >
> >
> >
> >
> >
> > -Original Message-
> >
> > From: Erick Erickson 
> >
> > Sent: jueves, 30 de abril de 2020 1:47 p. m.
> >
> > To: solr-user@lucene.apache.org
> >
> > Subject: Re: Possible issue with Stemming and nouns ended with
> suffix 'ion'
> 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Mike Drob
Is this worth filing a bug/suggestion to the folks over at snowballstem.org?

On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> I agree with Erick. I think that's just how the cookie crumbles when
> stemming. If you have some time on your hands, you can integrate OpenNLP
> with your Solr instance and start using the lemmas of tokens instead of the
> stems. In this case, I believe if you were to lemmatize both "identify" and
> "identification," they would both condense to "identify."
>
> Best,
> Audrey
>
> On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
>
> They are being stemmed to two different tokens, “identif” and
> “identifi”. Stemming is algorithmic and imperfect and in this case you’re
> getting bitten by that algorithm. It looks like you’re using
> PorterStemFilter, if you want you can look up the exact algorithm, but I
> don’t think it’s a bug, just one of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try adding
> =query to your query, in particular looking at the parsed query
> that’s returned. That’ll tell you a bunch. In this particular case I don’t
> think it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes a
> lot of distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases are
> working correctly, for example, if we search for bidding, solr brings
> results for bidding, bid, bids, etc. However, with nouns ended with ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have correct
> stemming of the word, the results are not reflecting that. One example. If
> I search ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE=
> >
> >
> > Even with proper stemming, solr is only bringing results for the
> word identification (or identifications) but nothing else.
> >
> > The queries are over the same field that has the Porter Stemming
> Filter applied for both, query and index. This behavior is consistent with
> other ‘ion’ ended nouns: representation, modification, etc.
> >
> > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> >
> > Thanks.
> >
> >
> >
> >
> >
> > -Original Message-
> >
> > From: Erick Erickson 
> >
> > Sent: jueves, 30 de abril de 2020 1:47 p. m.
> >
> > To: solr-user@lucene.apache.org
> >
> > Subject: Re: Possible issue with Stemming and nouns ended with
> suffix 'ion'
> >
> >
> >
> > This email has been sent from a source external to Publicis Groupe.
> Please use caution when clicking links or opening attachments.
> >
> > Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
> >
> >
> >
> >
> >
> >
> >
> > The mail server is pretty aggressive about stripping links, so we
> can’t see the images.
> >
> >
> >
> > Could you put 

RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I agree with Erick. I think that's just how the cookie crumbles when stemming. 
If you have some time on your hands, you can integrate OpenNLP with your Solr 
instance and start using the lemmas of tokens instead of the stems. In this 
case, I believe if you were to lemmatize both "identify" and "identification," 
they would both condense to "identify."

Best,
Audrey

On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:

They are being stemmed to two different tokens, “identif” and “identifi”. 
Stemming is algorithmic and imperfect and in this case you’re getting bitten by 
that algorithm. It looks like you’re using PorterStemFilter, if you want you 
can look up the exact algorithm, but I don’t think it’s a bug, just one of 
those little joys of English...

To get a clearer picture of exactly what’s being searched, try adding 
=query to your query, in particular looking at the parsed query that’s 
returned. That’ll tell you a bunch. In this particular case I don’t think it’ll 
tell you anything more, but for future…

Best,
Erick

On, and un-checking the ‘verbose’ box on the analysis page removes a lot of 
distraction, the detailed information is often TMI ;)

> On Apr 30, 2020, at 2:51 PM, Jhonny Lopez 
 wrote:
> 
> Sure, rewriting the message with links for images:
> 
> 
> We’re facing an issue with stemming in solr. Most of the cases are 
working correctly, for example, if we search for bidding, solr brings results 
for bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
stemming is not working. Even when analyzers seems to have correct stemming of 
the word, the results are not reflecting that. One example. If I search 
‘identifying’, this is the output:
> 
> Analyzer (image link):
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
 
> 
> A clip of results:
> "haschildren_b":false,
>"isbucket_text_s":"0",
>"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
>"parsedupdatedby_s":"sitecorecarvaini",
>"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
>"hide_section_b":false
> 
> 
> As you can see, it has used the stemming correctly and brings results for 
other words based in the root, in this case “Identify”.
> 
> However, if I search for “Identification”, this is the output:
> 
> Analyzer (imagelink):
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE=
 
> 
> 
> Even with proper stemming, solr is only bringing results for the word 
identification (or identifications) but nothing else.
> 
> The queries are over the same field that has the Porter Stemming Filter 
applied for both, query and index. This behavior is consistent with other ‘ion’ 
ended nouns: representation, modification, etc.
> 
> Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> 
> Thanks.
> 
> 
> 
> 
> 
> -Original Message-
> 
> From: Erick Erickson 
> 
> Sent: jueves, 30 de abril de 2020 1:47 p. m.
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: Possible issue with Stemming and nouns ended with suffix 
'ion'
> 
> 
> 
> This email has been sent from a source external to Publicis Groupe. 
Please use caution when clicking links or opening attachments.
> 
> Cet email a été envoyé depuis une source externe à Publicis Groupe. 
Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque 
vous ouvrez des pièces jointes.
> 
> 
> 
> 
> 
> 
> 
> The mail server is pretty aggressive about stripping links, so we can’t 
see the images.
> 
> 
> 
> Could you put them somewhere and paste a link?
> 
> 
> 
> Best,
> 
> Erick
> 
> 
> 
>> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez 
 wrote:
> 
>> 
> 
>> We’re facing an issue with stemming in solr. Most of the cases are 
working correctly, for example, if we search for bidding, solr brings 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread matthew sporleder
If you use the stemmer in your query analysis it should act the same, right?

On Thu, Apr 30, 2020 at 3:54 PM Erick Erickson  wrote:
>
> They are being stemmed to two different tokens, “identif” and “identifi”. 
> Stemming is algorithmic and imperfect and in this case you’re getting bitten 
> by that algorithm. It looks like you’re using PorterStemFilter, if you want 
> you can look up the exact algorithm, but I don’t think it’s a bug, just one 
> of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try adding 
> =query to your query, in particular looking at the parsed query that’s 
> returned. That’ll tell you a bunch. In this particular case I don’t think 
> it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes a lot of 
> distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez  
> > wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases are working 
> > correctly, for example, if we search for bidding, solr brings results for 
> > bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> > stemming is not working. Even when analyzers seems to have correct stemming 
> > of the word, the results are not reflecting that. One example. If I search 
> > ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> > https://1drv.ms/u/s!AlRTlFq8tQbShd4-Cp40Cmc0QioS0A?e=1f3GJp
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> > leverage the proprietary tools available or manually pull a log file report 
> > to understand the trends and gauge auction spread overtime to assess the 
> > impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
> > leverage the proprietary tools available or manually pull a log file report 
> > to understand the trends and gauge auction spread overtime to assess the 
> > impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings results for 
> > other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> > https://1drv.ms/u/s!AlRTlFq8tQbShd49RpiQObzMgSjVhA
> >
> >
> > Even with proper stemming, solr is only bringing results for the word 
> > identification (or identifications) but nothing else.
> >
> > The queries are over the same field that has the Porter Stemming Filter 
> > applied for both, query and index. This behavior is consistent with other 
> > ‘ion’ ended nouns: representation, modification, etc.
> >
> > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> >
> > Thanks.
> >
> >
> >
> >
> >
> > -Original Message-
> >
> > From: Erick Erickson 
> >
> > Sent: jueves, 30 de abril de 2020 1:47 p. m.
> >
> > To: solr-user@lucene.apache.org
> >
> > Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'
> >
> >
> >
> > This email has been sent from a source external to Publicis Groupe. Please 
> > use caution when clicking links or opening attachments.
> >
> > Cet email a été envoyé depuis une source externe à Publicis Groupe. 
> > Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou 
> > lorsque vous ouvrez des pièces jointes.
> >
> >
> >
> >
> >
> >
> >
> > The mail server is pretty aggressive about stripping links, so we can’t see 
> > the images.
> >
> >
> >
> > Could you put them somewhere and paste a link?
> >
> >
> >
> > Best,
> >
> > Erick
> >
> >
> >
> >> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez  
> >> wrote:
> >
> >>
> >
> >> We’re facing an issue with stemming in solr. Most of the cases are working 
> >> correctly, for example, if we search for bidding, solr brings results for 
> >> bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> >> stemming is not working. Even when analyzers seems to have correct 
> >> stemming of the word, the results are not reflecting that. One example. If 
> >> I search ‘identifying’, this is the output:
> >
> >>
> >
> >> Analyzer (image):
> >
> >>
> >
> >> A clip of results:
> >
> >> "haschildren_b":false,
> >
> >>"isbucket_text_s":"0",
> >
> >>"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> >> leverage the proprietary tools available or manually pull a log file 
> >> report to understand the trends and gauge auction spread overtime to 
> >> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >
> >>"parsedupdatedby_s":"sitecorecarvaini",
> >
> >>"sectionbody_t_en":"\n\n\nIn order 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Erick Erickson
They are being stemmed to two different tokens, “identif” and “identifi”. 
Stemming is algorithmic and imperfect and in this case you’re getting bitten by 
that algorithm. It looks like you’re using PorterStemFilter, if you want you 
can look up the exact algorithm, but I don’t think it’s a bug, just one of 
those little joys of English...

To get a clearer picture of exactly what’s being searched, try adding 
=query to your query, in particular looking at the parsed query that’s 
returned. That’ll tell you a bunch. In this particular case I don’t think it’ll 
tell you anything more, but for future…

Best,
Erick

On, and un-checking the ‘verbose’ box on the analysis page removes a lot of 
distraction, the detailed information is often TMI ;)

> On Apr 30, 2020, at 2:51 PM, Jhonny Lopez  
> wrote:
> 
> Sure, rewriting the message with links for images:
> 
> 
> We’re facing an issue with stemming in solr. Most of the cases are working 
> correctly, for example, if we search for bidding, solr brings results for 
> bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> stemming is not working. Even when analyzers seems to have correct stemming 
> of the word, the results are not reflecting that. One example. If I search 
> ‘identifying’, this is the output:
> 
> Analyzer (image link):
> https://1drv.ms/u/s!AlRTlFq8tQbShd4-Cp40Cmc0QioS0A?e=1f3GJp
> 
> A clip of results:
> "haschildren_b":false,
>"isbucket_text_s":"0",
>"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> leverage the proprietary tools available or manually pull a log file report 
> to understand the trends and gauge auction spread overtime to assess the 
> impact of variable auction dynamics.\n\n\n\n\n\n\n",
>"parsedupdatedby_s":"sitecorecarvaini",
>"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
> leverage the proprietary tools available or manually pull a log file report 
> to understand the trends and gauge auction spread overtime to assess the 
> impact of variable auction dynamics.\n\n\n\n\n\n\n",
>"hide_section_b":false
> 
> 
> As you can see, it has used the stemming correctly and brings results for 
> other words based in the root, in this case “Identify”.
> 
> However, if I search for “Identification”, this is the output:
> 
> Analyzer (imagelink):
> https://1drv.ms/u/s!AlRTlFq8tQbShd49RpiQObzMgSjVhA
> 
> 
> Even with proper stemming, solr is only bringing results for the word 
> identification (or identifications) but nothing else.
> 
> The queries are over the same field that has the Porter Stemming Filter 
> applied for both, query and index. This behavior is consistent with other 
> ‘ion’ ended nouns: representation, modification, etc.
> 
> Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> 
> Thanks.
> 
> 
> 
> 
> 
> -Original Message-
> 
> From: Erick Erickson 
> 
> Sent: jueves, 30 de abril de 2020 1:47 p. m.
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'
> 
> 
> 
> This email has been sent from a source external to Publicis Groupe. Please 
> use caution when clicking links or opening attachments.
> 
> Cet email a été envoyé depuis une source externe à Publicis Groupe. Veuillez 
> faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque vous 
> ouvrez des pièces jointes.
> 
> 
> 
> 
> 
> 
> 
> The mail server is pretty aggressive about stripping links, so we can’t see 
> the images.
> 
> 
> 
> Could you put them somewhere and paste a link?
> 
> 
> 
> Best,
> 
> Erick
> 
> 
> 
>> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez  
>> wrote:
> 
>> 
> 
>> We’re facing an issue with stemming in solr. Most of the cases are working 
>> correctly, for example, if we search for bidding, solr brings results for 
>> bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
>> stemming is not working. Even when analyzers seems to have correct stemming 
>> of the word, the results are not reflecting that. One example. If I search 
>> ‘identifying’, this is the output:
> 
>> 
> 
>> Analyzer (image):
> 
>> 
> 
>> A clip of results:
> 
>> "haschildren_b":false,
> 
>>"isbucket_text_s":"0",
> 
>>"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
>> leverage the proprietary tools available or manually pull a log file report 
>> to understand the trends and gauge auction spread overtime to assess the 
>> impact of variable auction dynamics.\n\n\n\n\n\n\n",
> 
>>"parsedupdatedby_s":"sitecorecarvaini",
> 
>>"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
>> leverage the proprietary tools available or manually pull a log file report 
>> to understand the trends and gauge auction spread overtime to assess the 
>> impact of variable auction dynamics.\n\n\n\n\n\n\n",
> 
>>"hide_section_b":false
> 
>> 
> 
>> 
> 
>> As you can see, it has used the stemming correctly and 

RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Jhonny Lopez
Sure, rewriting the message with links for images:


We’re facing an issue with stemming in solr. Most of the cases are working 
correctly, for example, if we search for bidding, solr brings results for 
bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, stemming 
is not working. Even when analyzers seems to have correct stemming of the word, 
the results are not reflecting that. One example. If I search ‘identifying’, 
this is the output:

Analyzer (image link):
https://1drv.ms/u/s!AlRTlFq8tQbShd4-Cp40Cmc0QioS0A?e=1f3GJp

A clip of results:
"haschildren_b":false,
"isbucket_text_s":"0",
"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
"parsedupdatedby_s":"sitecorecarvaini",
"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
"hide_section_b":false


As you can see, it has used the stemming correctly and brings results for other 
words based in the root, in this case “Identify”.

However, if I search for “Identification”, this is the output:

Analyzer (imagelink):
https://1drv.ms/u/s!AlRTlFq8tQbShd49RpiQObzMgSjVhA


Even with proper stemming, solr is only bringing results for the word 
identification (or identifications) but nothing else.

The queries are over the same field that has the Porter Stemming Filter applied 
for both, query and index. This behavior is consistent with other ‘ion’ ended 
nouns: representation, modification, etc.

Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?

Thanks.





-Original Message-

From: Erick Erickson 

Sent: jueves, 30 de abril de 2020 1:47 p. m.

To: solr-user@lucene.apache.org

Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'



This email has been sent from a source external to Publicis Groupe. Please use 
caution when clicking links or opening attachments.

Cet email a été envoyé depuis une source externe à Publicis Groupe. Veuillez 
faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque vous 
ouvrez des pièces jointes.







The mail server is pretty aggressive about stripping links, so we can’t see the 
images.



Could you put them somewhere and paste a link?



Best,

Erick



> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez  
> wrote:

>

> We’re facing an issue with stemming in solr. Most of the cases are working 
> correctly, for example, if we search for bidding, solr brings results for 
> bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> stemming is not working. Even when analyzers seems to have correct stemming 
> of the word, the results are not reflecting that. One example. If I search 
> ‘identifying’, this is the output:

>

> Analyzer (image):

>

> A clip of results:

> "haschildren_b":false,

> "isbucket_text_s":"0",

> "sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> leverage the proprietary tools available or manually pull a log file report 
> to understand the trends and gauge auction spread overtime to assess the 
> impact of variable auction dynamics.\n\n\n\n\n\n\n",

> "parsedupdatedby_s":"sitecorecarvaini",

> "sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
> leverage the proprietary tools available or manually pull a log file report 
> to understand the trends and gauge auction spread overtime to assess the 
> impact of variable auction dynamics.\n\n\n\n\n\n\n",

> "hide_section_b":false

>

>

> As you can see, it has used the stemming correctly and brings results for 
> other words based in the root, in this case “Identify”.

>

> However, if I search for “Identification”, this is the output:

>

> Analyzer (image):

>

> Even with proper stemming, solr is only bringing results for the word 
> identification (or identifications) but nothing else.

>

> The queries are over the same field that has the Porter Stemming Filter 
> applied for both, query and index. This behavior is consistent with other 
> ‘ion’ ended nouns: representation, modification, etc.

>

> Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?

>

> Thanks.

>

>

>

>

>   Jhonny Lopez

>   Technical Architect

>   Avenida Calle 26 No. 92 - 32, Edificio BTS3

>   APDO. 128-1255 Bogota

>   T: +573006805461

>   jhonny.lo...@publicismedia.com

>   www.prodigious.com

>

>

>

>

>

>

> --

> -- Disclaimer The information in this email and any attachments may

> contain proprietary and confidential information that is intended 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Erick Erickson
The mail server is pretty aggressive about stripping links, so we can’t see the 
images.

Could you put them somewhere and paste a link?

Best,
Erick

> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez  
> wrote:
> 
> We’re facing an issue with stemming in solr. Most of the cases are working 
> correctly, for example, if we search for bidding, solr brings results for 
> bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> stemming is not working. Even when analyzers seems to have correct stemming 
> of the word, the results are not reflecting that. One example. If I search 
> ‘identifying’, this is the output:
>  
> Analyzer (image):
> 
> A clip of results:
> "haschildren_b":false,
> "isbucket_text_s":"0",
> "sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> leverage the proprietary tools available or manually pull a log file report 
> to understand the trends and gauge auction spread overtime to assess the 
> impact of variable auction dynamics.\n\n\n\n\n\n\n",
> "parsedupdatedby_s":"sitecorecarvaini",
> "sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
> leverage the proprietary tools available or manually pull a log file report 
> to understand the trends and gauge auction spread overtime to assess the 
> impact of variable auction dynamics.\n\n\n\n\n\n\n",
> "hide_section_b":false
>  
>  
> As you can see, it has used the stemming correctly and brings results for 
> other words based in the root, in this case “Identify”.
>  
> However, if I search for “Identification”, this is the output:
>  
> Analyzer (image):
> 
> Even with proper stemming, solr is only bringing results for the word 
> identification (or identifications) but nothing else.
>  
> The queries are over the same field that has the Porter Stemming Filter 
> applied for both, query and index. This behavior is consistent with other 
> ‘ion’ ended nouns: representation, modification, etc.
>  
> Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
>  
> Thanks.
>  
>  
>  
> 
>   Jhonny Lopez
>   Technical Architect 
>   Avenida Calle 26 No. 92 - 32, Edificio BTS3
>   APDO. 128-1255 Bogota
>   T: +573006805461
>   jhonny.lo...@publicismedia.com
>   www.prodigious.com
>  
>  
> 
> 
> 
> 
>  
> Disclaimer The information in this email and any attachments may contain 
> proprietary and confidential information that is intended for the 
> addressee(s) only. If you are not the intended recipient, you are hereby 
> notified that any disclosure, copying, distribution, retention or use of the 
> contents of this information is prohibited. When addressed to our clients or 
> vendors, any information contained in this e-mail or any attachments is 
> subject to the terms and conditions in any governing contract. If you have 
> received this e-mail in error, please immediately contact the sender and 
> delete the e-mail.