Do not remove stopwords. Stopword removal was a hack invented for 16-bit machines and multi-megabyte disks. That hack is not needed now.
tf.idf addresses the same problem as stopwords with a much better algorithm. Removing stopwords is an on/off decision for a guess at common words. tf.idf is a proportional weighting of common words based on the statistics of your documents. Do not remove stopwords. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Mar 20, 2020, at 7:52 AM, Vikas Kumar <hers...@gmail.com> wrote: > > I have a field title in my solr schema: > > <field name="title" type="text_en" termVectors="true" indexed="true" > required="true" stored="true" /> > > text_en is defined as follows: > > <fieldType name="text_en" class="solr.TextField" > positionIncrementGap="100" docValues="false" multiValued="false"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory" /> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords_en.txt" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.ASCIIFoldingFilterFactory" > preserveOriginal="true" /> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory" /> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="synonyms_en.txt" ignoreCase="true" expand="true" /> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords_en.txt" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.PorterStemFilterFactory" /> > </analyzer> > </fieldType> > > I'm encountering strange behaviour when using multi-word synonyms which > contain stopwords. > > If the stopwords appear in the middle, it works fine. For example, if I > have the following in my synonyms file (where i is a stopword): > > iphone, apple i phone > > And if I query: /select?q=iphone&qf=title&defType=edismax > > The parsed query is: +DisjunctionMaxQuery(((((+title:appl +title:phone) > title:iphon)))) > > Same for query: /select?q=apple i phone&qf=title&defType=edismax > > But if stopwords appear at the start or end, then behaviour is > unpredictable. > > In most of the cases, the entire synonym is dropped. For example, if I > change my synonyms file to: > > iphone, i phone > > and do the same query again (with iphone), I get: > > +DisjunctionMaxQuery(((title:iphon))) > > I was expecting iphon and phone (as i would be dropped) in my dismax query. > > In some cases, behaviour is even more weird. > > For example, if my synonyms file is: > > between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best > > and I have ferns and best as my stopwords. If I do the following query: > > /select?q=netflix comedy&qf=title&defType=edismax > > I get this: > > +DisjunctionMaxQuery((((+title:between +title:two +title:galifianaki > +title:show) (+title:netflix +title:2019 +title:comedi)))) > > which is kind of a very weird combinations. > > I'm not able to understand this behaviour and have not found anything > related to this in documentation or internet. Maybe I'm missing something. > Any help/pointers is highly appreciated. > > Solr version: 8.4.1