[jira] Updated: (SOLR-606) spellcheck.colate doesn't handle multiple tokens properly

Stefan Oestreicher (JIRA) Thu, 14 Aug 2008 03:40:09 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan Oestreicher updated SOLR-606:
------------------------------------

    Attachment: handler.component.SpellCheckComponent-collate-patch.txt

I recently ran into this exact issue and I found the problem.
The collation is created by replacing the misspelled tokens with the 
suggestions using a StringBuilder:

{noformat}
for (Iterator<Map.Entry<Token, String>> bestIter = best.entrySet().iterator(); 
bestIter.hasNext();) {
        Map.Entry<Token, String> entry = bestIter.next();
        Token tok = entry.getKey();
        collation.replace(tok.startOffset(), tok.endOffset(), entry.getValue());
}
{noformat}

As you can see it's just replacing the relevant tokens in the original query. 
However, if the length of a suggestion doesn't equal the length of the original 
token, all offsets used after that replacement are no longer valid thus 
randomly yielding incorrect results.
I fixed that by keeping track of that difference and adding it to the token 
offsets. For this to work I had to change the HashMap to a LinkedHashMap since 
this solution depends on the iteration order of the Tokens to correspond to 
their occurrence in the string.

> spellcheck.colate doesn't handle multiple tokens properly
> ---------------------------------------------------------
>
>                 Key: SOLR-606
>                 URL: https://issues.apache.org/jira/browse/SOLR-606
>             Project: Solr
>          Issue Type: Bug
>          Components: spellchecker
>    Affects Versions: 1.3
>         Environment: tomcat
>            Reporter: Geoffrey Young
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: handler.component.SpellCheckComponent-collate-patch.txt, 
> SOLR-606.patch
>
>
> originally posted as part of SOLR-572:
>   
> https://issues.apache.org/jira/browse/SOLR-572?focusedCommentId=12608487#action_12608487
> the new spellcheck.collate feature seems to exhibit some strange behaviors 
> when handed a query with multiple tokens.
> {noformat}
> {
>  "responseHeader":{
>   "params":{
>       "q":"redbull air show"}},
>   "spellcheck":{
>    "suggestions":[
>       "redbull",[
>        "suggestion",["redbelly"]],
>       "show",[
>        "suggestion",["shot"]],
>       "collation","redbelly airshotw"]}}
> {noformat}
> in this case, note the fields are incorrectly concatenated (no space between 
> tokens, left over 'w' from input string)
> {noformat}
> {
>  "responseHeader":{
>   "params":{
>       "q":"redbull air show",
>       "spellcheck.q":"redbull air show"}},
>  "spellcheck":{
>   "suggestions":[
>       "redbull air show",[
>        "suggestion",["redbull singers"]],
>       "collation","redbull singersredbull air show"]}}
> {noformat}
> this is slightly different - the suggestions are still concatenated without a 
> space, but the collation is way off.
> --Geoff

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-606) spellcheck.colate doesn't handle multiple tokens properly

Reply via email to