Re: Changing the order of stemming and stopwords cleaning in techproducts config

Davis, Daniel (NIH/NLM) [C] Mon, 08 Nov 2021 09:31:26 -0800

I cannot agree more.  On the product provided by www.indexengines.com, we 
stopped using stopwords when we noted that first names that would be flagged as 
such by Named Entity Recognition would also be categorized as stopwords in some 
language.  Namely - the key developers Ben and Dan (speaking).


On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:

    Hello Güven,

    You should consider not using stopwords at all. The filter is useless or
    problematic in almost all cases. If you want to avoid trouble, drop the
    filter, because:

    * Due to modern compression rates, the memory/disk space the filter clears
    up is negligible.
    * The scoring, tf*idf, gives low scores for high frequency terms.
    * At some point, a product's name or specification/type/brand will contain
    one or more stopwords. This is inevitable!

    Regards,
    Markus

    Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <guv...@gmail.com>:

    > Hi all,
    >
    > We are experimenting with the sample techproducts schema
    > <
    > 
https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459
    > >
    > from
    > the Apache Solr master repo.
    >
    > We realized that having the stemming(PorterStemFilterFactory) filter after
    > the stopword filter(StopFilterFactory) seems to create issues.
    >
    > For example, we added “what” to the stopword list and we noticed that for
    > the input “what’s in the box”,  we end up with “what box” after stemming.
    > However, we would want to have only the word “box” at the end of this
    > process. This desired result “box” can only be achieved when the stopwords
    > filter is placed after the stemming. Additionally, having the stopwords
    > filter after lowercasing and stemming seems to create better stopfilter
    > performance. At the end, we ended up with the following order in our
    > configuration:
    >
    >
    >    1. LowerCaseFilterFactory
    >    2. PorterStemFilterFactory
    >    3. StopFilterFactory
    >
    >
    > Since we are new to the Apache Solr and we are using what it seems a
    > “default” configuration, we fear that we might be missing some important
    > context here. Is there a justification for the default ordering, which I
    > assume most people will use as-is, and that we might be missing? Do you 
see
    > any issues placing the stopwords filter after stemming? Do you see any
    > issues placing the lowercasing before stopwords filter and stemming?
    >
    > Regards,
    > Guven
    >

Re: Changing the order of stemming and stopwords cleaning in techproducts config

Reply via email to