[jira] Commented: (SOLR-199) N-gram

Adam Hiatt (JIRA) Thu, 26 Apr 2007 13:21:39 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492106
 ]


Adam Hiatt commented on SOLR-199:
---------------------------------

Quoted:
Adam, I think Yonik is just saying that the n-gram stuff I added to Lucene's 
contrib/analyzers was added after 2.1 was released, so we'd need a version of 
that jar from the trunk at this time. I see mentions of Solr 1.2, so perhaps we 
can grab the 2.2-dev version of that jar and add it to Solr starting with 
release 1.2?

Understood. I talked with Yonik and he mentioned possibly upgrading to a lucene 
2.2-dev in the future. I'm not sure he intended that to happen in time for solr 
1.2 however. I suppose if it came to it, we could probably use the analyzers 
2.2-dev with 2.1 core. I'm guessing the API was stable, but I'm not sure if we 
want to complicate things that much.

Quoted:
Question: How will the spellchecker you are writing or considering writing 
going to be different/better than the one in contrib/spellchecker? 

The initial use case was actually to support autocomplete functionality. IE 
using the start n-gramming functionality to build tokens that we can match term 
fragments upon. 

However, I do still plan to write a native Solr spell checker based on this 
same patch sometime in the future. The major improvements with a native system 
are several fold. First, it allows for truly native use of a Solr-configurable 
lucene index. Second, we will be able to take advantage of native Solr caching. 
Third, we will be able to boost on arbitrary aspects. For example, take the 
misspelling 'ipad' and the indexed terms 'ipod' and 'ipaq'. Both the indexed 
terms are the same edit distance away from the misspelling. They also have the 
same number of 2 grams (though not 3 grams). If find that 'ipod' is the more 
valuable term we can boost slightly based on its popularity and draw out ahead. 
The final big win is the ability to spell check on individual input tokens. For 
example, assume that we have the term 'ipod' indexed in our spell checker, but 
not the term 'apple ipod' and the misspelling 'apple ipdo' is entered. The 
overlap between 'ipod' and 'apple ipdo' is slight enough to not warrant a 
suggestion. However if we tokenize on white space and spell correct on each 
token we would be able to catch the 'ipdo' misspelling. I'm sure there are 
other use cases, but those are the ones that I've identified.



> N-gram
> ------
>
>                 Key: SOLR-199
>                 URL: https://issues.apache.org/jira/browse/SOLR-199
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Adam Hiatt
>            Priority: Trivial
>         Attachments: SOLR-81-ngram.patch
>
>
> This tracks the creation of a patch that adds the n-gram/edge n-gram 
> tokenizing functionality that was initially part of SOLR-81 (spell checking). 
> This was taken out b/c the lucene SpellChecker class removed this dependency. 
> None-the-less, I think this is useful functionality and the addition is 
> trivial. How does everyone feel about such an addition?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-199) N-gram

Reply via email to