[ https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748639#action_12748639 ]
Jason Rutherglen commented on SOLR-908: --------------------------------------- There is a bug that seems to be related to HTMLStripStandardTokenizerFactory where a single word query fails to generate a token using the following chain. However a StandardTokenizer in it's place returns a token as expected. When SOLR-908.patch was tested with rev 799698, HTMLSSTF worked. {code} <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="stopwords.txt"/> <filter class="solr.CommonGramsQueryFilterFactory" words="stopwords.txt"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> {code} > Port of Nutch CommonGrams filter to Solr > ----------------------------------------- > > Key: SOLR-908 > URL: https://issues.apache.org/jira/browse/SOLR-908 > Project: Solr > Issue Type: Wish > Components: Analysis > Reporter: Tom Burton-West > Assignee: Shalin Shekhar Mangar > Priority: Minor > Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, > SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch > > > Phrase queries containing common words are extremely slow. We are reluctant > to just use stop words due to various problems with false hits and some > things becoming impossible to search with stop words turned on. (For example > "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) > > Several postings regarding slow phrase queries have suggested using the > approach used by Nutch. Perhaps someone with more Java/Solr experience might > take this on. > It should be possible to port the Nutch CommonGrams code to Solr and create > a suitable Solr FilterFactory so that it could be used in Solr by listing it > in the Solr schema.xml. > "Construct n-grams for frequently occuring terms and phrases while indexing. > Optimize phrase queries to use the n-grams. Single terms are still indexed > too, with n-grams overlaid." > http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.