[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Jason Rutherglen (JIRA) Sun, 20 Sep 2009 21:58:55 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757820#action_12757820
 ]


Jason Rutherglen commented on SOLR-908:
---------------------------------------

Yeah, unfortunately it's going to be hard to upgrade as folks
feel a bit burned at this point and reverting to Solr trunk 8/31
plus the old HTMLStripReader which seems to be more stable than
the latest Solr builds. I need to reproduce our wacky crazy
random query truncations, and haven't yet. I'll probably try
creating completely randomized queries in multiple threads and
see what happens. Without reproducing the problem and showing it
fixed, upgrading will be difficult to justify. Logically the
threadlocal reusableTokenStream is the problem, however,
perception is things got way too broken. 

Also I need to upgrade the patch to use the new tokenizing API.
I think this belongs in Lucene analyzers rather than in Solr
anyways, and BufferedTokenStream totally changes with the new
tokenizing API. Hacking ShingleFilter to only include certain
words seemed like too much of a rewrite of it. So porting is the
next task here after hopefully reproducing.



> Port of Nutch  CommonGrams filter to Solr
> -----------------------------------------
>
>                 Key: SOLR-908
>                 URL: https://issues.apache.org/jira/browse/SOLR-908
>             Project: Solr
>          Issue Type: Wish
>          Components: Analysis
>            Reporter: Tom Burton-West
>            Priority: Minor
>         Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, 
> SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, 
> SOLR-908.patch, SOLR-908.patch, SOLR-908.patch
>
>
> Phrase queries containing common words are extremely slow.  We are reluctant 
> to just use stop words due to various problems with false hits and some 
> things becoming impossible to search with stop words turned on. (For example 
> "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) 
>  
> Several postings regarding slow phrase queries have suggested using the 
> approach used by Nutch.  Perhaps someone with more Java/Solr experience might 
> take this on.
> It should be possible to port the Nutch CommonGrams code to Solr  and create 
> a suitable Solr FilterFactory so that it could be used in Solr by listing it 
> in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing. 
> Optimize phrase queries to use the n-grams. Single terms are still indexed 
> too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Reply via email to