Changing the order of stemming and stopwords cleaning in techproducts config

H . Güven Candoğan Mon, 08 Nov 2021 07:31:33 -0800

Hi all,

We are experimenting with the sample techproducts schema
<https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459>
from
the Apache Solr master repo.


We realized that having the stemming(PorterStemFilterFactory) filter after
the stopword filter(StopFilterFactory) seems to create issues.

For example, we added “what” to the stopword list and we noticed that for
the input “what’s in the box”,  we end up with “what box” after stemming.
However, we would want to have only the word “box” at the end of this
process. This desired result “box” can only be achieved when the stopwords
filter is placed after the stemming. Additionally, having the stopwords
filter after lowercasing and stemming seems to create better stopfilter
performance. At the end, we ended up with the following order in our
configuration:


   1. LowerCaseFilterFactory
   2. PorterStemFilterFactory
   3. StopFilterFactory


Since we are new to the Apache Solr and we are using what it seems a
“default” configuration, we fear that we might be missing some important
context here. Is there a justification for the default ordering, which I
assume most people will use as-is, and that we might be missing? Do you see
any issues placing the stopwords filter after stemming? Do you see any
issues placing the lowercasing before stopwords filter and stemming?

Regards,
Guven

Changing the order of stemming and stopwords cleaning in techproducts config

Reply via email to