Hello Güven,

You should consider not using stopwords at all. The filter is useless or
problematic in almost all cases. If you want to avoid trouble, drop the
filter, because:

* Due to modern compression rates, the memory/disk space the filter clears
up is negligible.
* The scoring, tf*idf, gives low scores for high frequency terms.
* At some point, a product's name or specification/type/brand will contain
one or more stopwords. This is inevitable!

Regards,
Markus

Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <guv...@gmail.com>:

> Hi all,
>
> We are experimenting with the sample techproducts schema
> <
> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459
> >
> from
> the Apache Solr master repo.
>
> We realized that having the stemming(PorterStemFilterFactory) filter after
> the stopword filter(StopFilterFactory) seems to create issues.
>
> For example, we added “what” to the stopword list and we noticed that for
> the input “what’s in the box”,  we end up with “what box” after stemming.
> However, we would want to have only the word “box” at the end of this
> process. This desired result “box” can only be achieved when the stopwords
> filter is placed after the stemming. Additionally, having the stopwords
> filter after lowercasing and stemming seems to create better stopfilter
> performance. At the end, we ended up with the following order in our
> configuration:
>
>
>    1. LowerCaseFilterFactory
>    2. PorterStemFilterFactory
>    3. StopFilterFactory
>
>
> Since we are new to the Apache Solr and we are using what it seems a
> “default” configuration, we fear that we might be missing some important
> context here. Is there a justification for the default ordering, which I
> assume most people will use as-is, and that we might be missing? Do you see
> any issues placing the stopwords filter after stemming? Do you see any
> issues placing the lowercasing before stopwords filter and stemming?
>
> Regards,
> Guven
>

Reply via email to