Hi Everyone, I'm migrating from SOLR 3.x to 4.x and I'm required to keep the results as close as possible as before. So I'm running some tests and found some differences.
My query is: *title_search_pt:(geladeira/refrigerador)* And the parsed query becomes: *MultiPhraseQuery(title_search_pt:"(refriger geladeir) (refriger geladeir)")* * * This is identical in both instances (3.x and 4.x) so that's not the problem. My document is: * balcão refrigerado e geladeira frigorifica* * * Which, after analysis, becomes: * balca refriger geladeir frigorif* * * That is also identical in both versions, *except for the token positions*. Notice how 'e' disappears, because of being a stopword. In SOLR 3.x the positions are: 1, 2, *3*, 4 In SOLR 4.x the positions are: 1, 2, *4*, 5 Could that be the problem? I've posted a question before here: phrase queries on punctuation<http://stackoverflow.com/questions/15314460/solr-generates-phrase-queries-on-punctuation> which I believe that, with the issue with token positions, is causing the discrepancies. I couldn't found any documentation/changelog about token positions with stopwords, hell, I can barely google SOLR-4 specific things. Can this be solved? I whish i could fix the original StackOverflow answer (prevent phrase query generation with punctuation), but I could live with fixing the token position thing at least (remember that if things work as before, then I am able to upgrade to 4.x). Thank you in advance PS: just in case I'm adding the schema (version="1.5") part: <fieldtype name="text_pt" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="-" replacement="IIIHYPHENIII"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="IIIHYPHENIII" replacement="-"/> <filter class="solr.ASCIIFoldingFilterFactory" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" preserveOriginal="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="false" words="portugueseStopWords.txt"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="-" replacement="IIIHYPHENIII"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="IIIHYPHENIII" replacement="-"/> <filter class="solr.ASCIIFoldingFilterFactory" /> <filter class="solr.SynonymFilterFactory" ignoreCase="true" synonyms="portugueseSynonyms.txt" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" preserveOriginal="1" catenateNumbers="0" catenateAll="0" protected="protwords.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="false" words="portugueseStopWords.txt"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldtype>