Feel free to read through https://cwiki.apache.org/confluence/display/SOLR/HowToContribute and when you create a PR, please tag me and I’ll review ;-)
> On Nov 10, 2021, at 5:03 AM, H. Güven Candoğan <guv...@gmail.com> wrote: > > Thanks all for the answers, appreciate it! > > I am happy to contribute. Feel free to assign the ticket to me. > > Best > Guven > > On Tue, Nov 9, 2021 at 12:31 PM Eric Pugh <ep...@opensourceconnections.com > <mailto:ep...@opensourceconnections.com>> > wrote: > >> https://issues.apache.org/jira/browse/SOLR-15779 >> >> Feel free to weigh in! >> >>> On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C] >> <daniel.da...@nih.gov.INVALID> wrote: >>> >>> I cannot agree more. On the product provided by www.indexengines.com, >> we stopped using stopwords when we noted that first names that would be >> flagged as such by Named Entity Recognition would also be categorized as >> stopwords in some language. Namely - the key developers Ben and Dan >> (speaking). >>> >>> On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io> >> wrote: >>> >>> Hello Güven, >>> >>> You should consider not using stopwords at all. The filter is useless >> or >>> problematic in almost all cases. If you want to avoid trouble, drop >> the >>> filter, because: >>> >>> * Due to modern compression rates, the memory/disk space the filter >> clears >>> up is negligible. >>> * The scoring, tf*idf, gives low scores for high frequency terms. >>> * At some point, a product's name or specification/type/brand will >> contain >>> one or more stopwords. This is inevitable! >>> >>> Regards, >>> Markus >>> >>> Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan < >> guv...@gmail.com>: >>> >>>> Hi all, >>>> >>>> We are experimenting with the sample techproducts schema >>>> < >>>> >> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459 >>>>> >>>> from >>>> the Apache Solr master repo. >>>> >>>> We realized that having the stemming(PorterStemFilterFactory) filter >> after >>>> the stopword filter(StopFilterFactory) seems to create issues. >>>> >>>> For example, we added “what” to the stopword list and we noticed that >> for >>>> the input “what’s in the box”, we end up with “what box” after >> stemming. >>>> However, we would want to have only the word “box” at the end of this >>>> process. This desired result “box” can only be achieved when the >> stopwords >>>> filter is placed after the stemming. Additionally, having the stopwords >>>> filter after lowercasing and stemming seems to create better stopfilter >>>> performance. At the end, we ended up with the following order in our >>>> configuration: >>>> >>>> >>>> 1. LowerCaseFilterFactory >>>> 2. PorterStemFilterFactory >>>> 3. StopFilterFactory >>>> >>>> >>>> Since we are new to the Apache Solr and we are using what it seems a >>>> “default” configuration, we fear that we might be missing some important >>>> context here. Is there a justification for the default ordering, which I >>>> assume most people will use as-is, and that we might be missing? Do you >> see >>>> any issues placing the stopwords filter after stemming? Do you see any >>>> issues placing the lowercasing before stopwords filter and stemming? >>>> >>>> Regards, >>>> Guven >>>> >>> >> >> _______________________ >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | >> http://www.opensourceconnections.com < >> http://www.opensourceconnections.com/ >> <http://www.opensourceconnections.com/>> | My Free/Busy < >> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < >> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw >> >> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>> >> >> This e-mail and all contents, including attachments, is considered to be >> Company Confidential unless explicitly stated otherwise, regardless of >> whether attachments are marked as such. >> >> > > -- > H. Güven Candoğan _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.