[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Jason Rutherglen (JIRA) Thu, 27 Aug 2009 18:41:23 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748639#action_12748639
 ]


Jason Rutherglen commented on SOLR-908:
---------------------------------------

There is a bug that seems to be related to
HTMLStripStandardTokenizerFactory where a single word query
fails to generate a token using the following chain. However a
StandardTokenizer in it's place returns a token as expected.
When SOLR-908.patch was tested with rev 799698, HTMLSSTF worked.

{code}
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="stopwords.txt"/> 
<filter class="solr.CommonGramsQueryFilterFactory" words="stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
enablePositionIncrements="true"/>
{code}

> Port of Nutch  CommonGrams filter to Solr
> -----------------------------------------
>
>                 Key: SOLR-908
>                 URL: https://issues.apache.org/jira/browse/SOLR-908
>             Project: Solr
>          Issue Type: Wish
>          Components: Analysis
>            Reporter: Tom Burton-West
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>         Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, 
> SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch
>
>
> Phrase queries containing common words are extremely slow.  We are reluctant 
> to just use stop words due to various problems with false hits and some 
> things becoming impossible to search with stop words turned on. (For example 
> "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) 
>  
> Several postings regarding slow phrase queries have suggested using the 
> approach used by Nutch.  Perhaps someone with more Java/Solr experience might 
> take this on.
> It should be possible to port the Nutch CommonGrams code to Solr  and create 
> a suitable Solr FilterFactory so that it could be used in Solr by listing it 
> in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing. 
> Optimize phrase queries to use the n-grams. Single terms are still indexed 
> too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Reply via email to