Re: Changing the order of stemming and stopwords cleaning in techproducts config

Walter Underwood Mon, 08 Nov 2021 09:50:14 -0800

Please, please, please remove the stopwords filter from ALL the sample configs. 
I’m sure this has been suggested before. Try searching for “vitamin a” with 
stopwords removed.


I pulled the stopword filter from the Solr 1.3 config at Netflix when I 
discovered it was completely removing several movie titles, making them 
unsearchable. Here is that list, from 2007.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

Infoseek retained stopwords 25 years ago. And yes, we had better relevance than 
Google. The stopword filter should not have been in the default config in 
version 1.2.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 8, 2021, at 9:33 AM, Eric Pugh <ep...@opensourceconnections.com> wrote:
> 
> I wonder if it would make sense to open a JIRA to remove stopwords from tech 
> products config?   
> 
> While tech products isn’t meant to be a “This is what your index should look 
> like”, I bet a LOT of people use it that way so maybe we remove stop words?
> 
> I never use them either!
> 
>> On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C] 
>> <daniel.da...@nih.gov.INVALID> wrote:
>> 
>> I cannot agree more.  On the product provided by www.indexengines.com, we 
>> stopped using stopwords when we noted that first names that would be flagged 
>> as such by Named Entity Recognition would also be categorized as stopwords 
>> in some language.  Namely - the key developers Ben and Dan (speaking).
>> 
>> On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:
>> 
>>   Hello Güven,
>> 
>>   You should consider not using stopwords at all. The filter is useless or
>>   problematic in almost all cases. If you want to avoid trouble, drop the
>>   filter, because:
>> 
>>   * Due to modern compression rates, the memory/disk space the filter clears
>>   up is negligible.
>>   * The scoring, tf*idf, gives low scores for high frequency terms.
>>   * At some point, a product's name or specification/type/brand will contain
>>   one or more stopwords. This is inevitable!
>> 
>>   Regards,
>>   Markus
>> 
>>   Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <guv...@gmail.com>:
>> 
>>> Hi all,
>>> 
>>> We are experimenting with the sample techproducts schema
>>> <
>>> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459
>>>> 
>>> from
>>> the Apache Solr master repo.
>>> 
>>> We realized that having the stemming(PorterStemFilterFactory) filter after
>>> the stopword filter(StopFilterFactory) seems to create issues.
>>> 
>>> For example, we added “what” to the stopword list and we noticed that for
>>> the input “what’s in the box”,  we end up with “what box” after stemming.
>>> However, we would want to have only the word “box” at the end of this
>>> process. This desired result “box” can only be achieved when the stopwords
>>> filter is placed after the stemming. Additionally, having the stopwords
>>> filter after lowercasing and stemming seems to create better stopfilter
>>> performance. At the end, we ended up with the following order in our
>>> configuration:
>>> 
>>> 
>>>  1. LowerCaseFilterFactory
>>>  2. PorterStemFilterFactory
>>>  3. StopFilterFactory
>>> 
>>> 
>>> Since we are new to the Apache Solr and we are using what it seems a
>>> “default” configuration, we fear that we might be missing some important
>>> context here. Is there a justification for the default ordering, which I
>>> assume most people will use as-is, and that we might be missing? Do you see
>>> any issues placing the stopwords filter after stemming? Do you see any
>>> issues placing the lowercasing before stopwords filter and stemming?
>>> 
>>> Regards,
>>> Guven
>>> 
>> 
> 
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com <http://www.opensourceconnections.com/> 
> | My Free/Busy <http://tinyurl.com/eric-cal>  
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>   
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.
>

Re: Changing the order of stemming and stopwords cleaning in techproducts config

Reply via email to