I cannot agree more. On the product provided by www.indexengines.com, we stopped using stopwords when we noted that first names that would be flagged as such by Named Entity Recognition would also be categorized as stopwords in some language. Namely - the key developers Ben and Dan (speaking).
On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io> wrote: Hello Güven, You should consider not using stopwords at all. The filter is useless or problematic in almost all cases. If you want to avoid trouble, drop the filter, because: * Due to modern compression rates, the memory/disk space the filter clears up is negligible. * The scoring, tf*idf, gives low scores for high frequency terms. * At some point, a product's name or specification/type/brand will contain one or more stopwords. This is inevitable! Regards, Markus Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <guv...@gmail.com>: > Hi all, > > We are experimenting with the sample techproducts schema > < > https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459 > > > from > the Apache Solr master repo. > > We realized that having the stemming(PorterStemFilterFactory) filter after > the stopword filter(StopFilterFactory) seems to create issues. > > For example, we added “what” to the stopword list and we noticed that for > the input “what’s in the box”, we end up with “what box” after stemming. > However, we would want to have only the word “box” at the end of this > process. This desired result “box” can only be achieved when the stopwords > filter is placed after the stemming. Additionally, having the stopwords > filter after lowercasing and stemming seems to create better stopfilter > performance. At the end, we ended up with the following order in our > configuration: > > > 1. LowerCaseFilterFactory > 2. PorterStemFilterFactory > 3. StopFilterFactory > > > Since we are new to the Apache Solr and we are using what it seems a > “default” configuration, we fear that we might be missing some important > context here. Is there a justification for the default ordering, which I > assume most people will use as-is, and that we might be missing? Do you see > any issues placing the stopwords filter after stemming? Do you see any > issues placing the lowercasing before stopwords filter and stemming? > > Regards, > Guven >