Feel free to read through 
https://cwiki.apache.org/confluence/display/SOLR/HowToContribute and when you 
create a PR, please tag me and I’ll review ;-)



> On Nov 10, 2021, at 5:03 AM, H. Güven Candoğan <guv...@gmail.com> wrote:
> 
> Thanks all for the answers, appreciate it!
> 
> I am happy to contribute. Feel free to assign the ticket to me.
> 
> Best
> Guven
> 
> On Tue, Nov 9, 2021 at 12:31 PM Eric Pugh <ep...@opensourceconnections.com 
> <mailto:ep...@opensourceconnections.com>>
> wrote:
> 
>> https://issues.apache.org/jira/browse/SOLR-15779
>> 
>> Feel free to weigh in!
>> 
>>> On Nov 8, 2021, at 12:30 PM, Davis, Daniel (NIH/NLM) [C]
>> <daniel.da...@nih.gov.INVALID> wrote:
>>> 
>>> I cannot agree more.  On the product provided by www.indexengines.com,
>> we stopped using stopwords when we noted that first names that would be
>> flagged as such by Named Entity Recognition would also be categorized as
>> stopwords in some language.  Namely - the key developers Ben and Dan
>> (speaking).
>>> 
>>> On 11/8/21, 10:58 AM, "Markus Jelsma" <markus.jel...@openindex.io>
>> wrote:
>>> 
>>>   Hello Güven,
>>> 
>>>   You should consider not using stopwords at all. The filter is useless
>> or
>>>   problematic in almost all cases. If you want to avoid trouble, drop
>> the
>>>   filter, because:
>>> 
>>>   * Due to modern compression rates, the memory/disk space the filter
>> clears
>>>   up is negligible.
>>>   * The scoring, tf*idf, gives low scores for high frequency terms.
>>>   * At some point, a product's name or specification/type/brand will
>> contain
>>>   one or more stopwords. This is inevitable!
>>> 
>>>   Regards,
>>>   Markus
>>> 
>>>   Op ma 8 nov. 2021 om 16:31 schreef H. Güven Candoğan <
>> guv...@gmail.com>:
>>> 
>>>> Hi all,
>>>> 
>>>> We are experimenting with the sample techproducts schema
>>>> <
>>>> 
>> https://github.com/apache/solr/blob/1fffc52103e77563a30fd307df1eb0b7a79a3377/solr/server/solr/configsets/sample_techproducts_configs/conf/managed-schema#L459
>>>>> 
>>>> from
>>>> the Apache Solr master repo.
>>>> 
>>>> We realized that having the stemming(PorterStemFilterFactory) filter
>> after
>>>> the stopword filter(StopFilterFactory) seems to create issues.
>>>> 
>>>> For example, we added “what” to the stopword list and we noticed that
>> for
>>>> the input “what’s in the box”,  we end up with “what box” after
>> stemming.
>>>> However, we would want to have only the word “box” at the end of this
>>>> process. This desired result “box” can only be achieved when the
>> stopwords
>>>> filter is placed after the stemming. Additionally, having the stopwords
>>>> filter after lowercasing and stemming seems to create better stopfilter
>>>> performance. At the end, we ended up with the following order in our
>>>> configuration:
>>>> 
>>>> 
>>>>  1. LowerCaseFilterFactory
>>>>  2. PorterStemFilterFactory
>>>>  3. StopFilterFactory
>>>> 
>>>> 
>>>> Since we are new to the Apache Solr and we are using what it seems a
>>>> “default” configuration, we fear that we might be missing some important
>>>> context here. Is there a justification for the default ordering, which I
>>>> assume most people will use as-is, and that we might be missing? Do you
>> see
>>>> any issues placing the stopwords filter after stemming? Do you see any
>>>> issues placing the lowercasing before stopwords filter and stemming?
>>>> 
>>>> Regards,
>>>> Guven
>>>> 
>>> 
>> 
>> _______________________
>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>> http://www.opensourceconnections.com <
>> http://www.opensourceconnections.com/ 
>> <http://www.opensourceconnections.com/>> | My Free/Busy <
>> http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>  
>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>> 
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless of
>> whether attachments are marked as such.
>> 
>> 
> 
> -- 
> H. Güven Candoğan

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to