I wonder if it would make sense to open a JIRA to remove stopwords from tech products config?
While tech products isn’t meant to be a “This is what your index should look like”, I bet a LOT of people use it that way so maybe we remove stop words? I never use them either! > On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C] > <daniel.da...@nih.gov.INVALID> wrote: > > I cannot agree more. On the product provided by www.indexengines.com, we > stopped using stopwords when we noted that first names that would be flagged > as such by Named Entity Recognition would also be categorized as stopwords in > some language. Namely - the key developers Ben and Dan (speaking). > > On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io> wrote: > > Hello Güven, > > You should consider not using stopwords at all. The filter is useless or > problematic in almost all cases. If you want to avoid trouble, drop the > filter, because: > > * Due to modern compression rates, the memory/disk space the filter clears > up is negligible. > * The scoring, tf*idf, gives low scores for high frequency terms. > * At some point, a product's name or specification/type/brand will contain > one or more stopwords. This is inevitable! > > Regards, > Markus > > Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <guv...@gmail.com>: > >> Hi all, >> >> We are experimenting with the sample techproducts schema >> < >> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459 >>> >> from >> the Apache Solr master repo. >> >> We realized that having the stemming(PorterStemFilterFactory) filter after >> the stopword filter(StopFilterFactory) seems to create issues. >> >> For example, we added “what” to the stopword list and we noticed that for >> the input “what’s in the box”, we end up with “what box” after stemming. >> However, we would want to have only the word “box” at the end of this >> process. This desired result “box” can only be achieved when the stopwords >> filter is placed after the stemming. Additionally, having the stopwords >> filter after lowercasing and stemming seems to create better stopfilter >> performance. At the end, we ended up with the following order in our >> configuration: >> >> >> 1. LowerCaseFilterFactory >> 2. PorterStemFilterFactory >> 3. StopFilterFactory >> >> >> Since we are new to the Apache Solr and we are using what it seems a >> “default” configuration, we fear that we might be missing some important >> context here. Is there a justification for the default ordering, which I >> assume most people will use as-is, and that we might be missing? Do you see >> any issues placing the stopwords filter after stemming? Do you see any >> issues placing the lowercasing before stopwords filter and stemming? >> >> Regards, >> Guven >> > _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.