Stopwords are fairly common in movie titles. There are even titles made entirely of stopwords. The first one I noticed was "Being There". I posted more of them here:
http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html wunder == Search Guy Netflix On 11/2/07 3:53 PM, "Sundling, Paul" <[EMAIL PROTECTED]> wrote: > I noticed very unexpected results when using stop words with and without > conditions using the default text type. > > A normal query with a stop word returns no results as expected: > > For example with 'an' being a stopword > > movieName:an (results: 0 since it's a stop word) > movieName:another (results 237) > > rating:PG-13 (results: 76095) > > > but if I put them together with AND, for normal non stop words like > 'another' the result is less than or equal to the smaller results being > ANDed. So adding another AND clause with a stop word query should have > 0 results. > > rating:PG-13 AND movieName:another (results 46) > > rating:PG-13 AND movieName:an (results 76095 should be 0) > > Commenting out the stop word filter from the text type for query will > correct this behavior, although I'm not sure that's a real solution. So > instead of anding the stop word clause it seems to ignore it. Even if > the actual problem is at the Lucene level, perhaps it would be worth > considering changes to the default to get around it. > > Workaround: > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > <!-- comment out to prevent strange behavior <filter > class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt"/>--> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > Paul Sundling
