[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Robert Muir (JIRA) Fri, 18 Sep 2009 14:11:41 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757432#action_12757432
 ]


Robert Muir commented on SOLR-908:
----------------------------------

{quote}
It seems if BTS is caching tokens, then being reused, and isn't
reset, then there would be excess tokens instead of deletions?
{quote}

right, thats what the test case I added for BufferedTokenStream showed. 
this would be more of a corner case, as i think most BufferedTokenStreams would 
have empty lists anyway
by the time they are reset(), so its likely not causing your problem (though it 
should be fixed!)

your problem, again is probably the internal state kept in 
CommonGramsQueryFilter
as you can see, CommonGramsQueryFilter has hairy logic involving the buffered 
token 'prev'
a lot of this logic has to do with what happens at end of stream.

unfortunately there is no reset() for CommonGramsQueryFilter to set 'prev' back 
to its initial state, so when something like QueryParser tries to reuse it, it 
is probably not behaving correctly. 

> Port of Nutch  CommonGrams filter to Solr
> -----------------------------------------
>
>                 Key: SOLR-908
>                 URL: https://issues.apache.org/jira/browse/SOLR-908
>             Project: Solr
>          Issue Type: Wish
>          Components: Analysis
>            Reporter: Tom Burton-West
>            Priority: Minor
>         Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, 
> SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch
>
>
> Phrase queries containing common words are extremely slow.  We are reluctant 
> to just use stop words due to various problems with false hits and some 
> things becoming impossible to search with stop words turned on. (For example 
> "to be or not to be", "the who", "man in the moon" vs "man on the moon" etc.) 
>  
> Several postings regarding slow phrase queries have suggested using the 
> approach used by Nutch.  Perhaps someone with more Java/Solr experience might 
> take this on.
> It should be possible to port the Nutch CommonGrams code to Solr  and create 
> a suitable Solr FilterFactory so that it could be used in Solr by listing it 
> in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing. 
> Optimize phrase queries to use the n-grams. Single terms are still indexed 
> too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr

Reply via email to