Re: Changing the order of stemming and stopwords cleaning in techproducts config

Eric Pugh Tue, 09 Nov 2021 04:31:04 -0800

https://issues.apache.org/jira/browse/SOLR-15779


Feel free to weigh in!

> On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C] 
> <daniel.da...@nih.gov.INVALID> wrote:
> 
> I cannot agree more.  On the product provided by www.indexengines.com, we 
> stopped using stopwords when we noted that first names that would be flagged 
> as such by Named Entity Recognition would also be categorized as stopwords in 
> some language.  Namely - the key developers Ben and Dan (speaking).
> 
> On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:
> 
>    Hello Güven,
> 
>    You should consider not using stopwords at all. The filter is useless or
>    problematic in almost all cases. If you want to avoid trouble, drop the
>    filter, because:
> 
>    * Due to modern compression rates, the memory/disk space the filter clears
>    up is negligible.
>    * The scoring, tf*idf, gives low scores for high frequency terms.
>    * At some point, a product's name or specification/type/brand will contain
>    one or more stopwords. This is inevitable!
> 
>    Regards,
>    Markus
> 
>    Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <guv...@gmail.com>:
> 
>> Hi all,
>> 
>> We are experimenting with the sample techproducts schema
>> <
>> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459
>>> 
>> from
>> the Apache Solr master repo.
>> 
>> We realized that having the stemming(PorterStemFilterFactory) filter after
>> the stopword filter(StopFilterFactory) seems to create issues.
>> 
>> For example, we added “what” to the stopword list and we noticed that for
>> the input “what’s in the box”,  we end up with “what box” after stemming.
>> However, we would want to have only the word “box” at the end of this
>> process. This desired result “box” can only be achieved when the stopwords
>> filter is placed after the stemming. Additionally, having the stopwords
>> filter after lowercasing and stemming seems to create better stopfilter
>> performance. At the end, we ended up with the following order in our
>> configuration:
>> 
>> 
>>   1. LowerCaseFilterFactory
>>   2. PorterStemFilterFactory
>>   3. StopFilterFactory
>> 
>> 
>> Since we are new to the Apache Solr and we are using what it seems a
>> “default” configuration, we fear that we might be missing some important
>> context here. Is there a justification for the default ordering, which I
>> assume most people will use as-is, and that we might be missing? Do you see
>> any issues placing the stopwords filter after stemming? Do you see any
>> issues placing the lowercasing before stopwords filter and stemming?
>> 
>> Regards,
>> Guven
>> 
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Re: Changing the order of stemming and stopwords cleaning in techproducts config

Reply via email to