[ 
https://issues.apache.org/jira/browse/SOLR-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598733#action_12598733
 ] 

Otis Gospodnetic commented on SOLR-572:
---------------------------------------

Just got an idea.  File-based dictionaries don't have word frequency 
information and with that we use certain value (e.g. so onlyMorePopular cannot 
be used).  What if we (also) accepted plain-text field dictionaries that 
included word frequency information?
e.g.
ball,100
boil,44
bowl,77
...
I'm not looking at sources now, but could we not feed this word frequency 
information into Lucene SC, so it makes use of that when figuring out top-N 
best words to suggest?

And how would we figure out the frequency of each word to begin with?  I 
imagine we can have a tool/class that, given a path to a dictionary file with 
words and a path to a Lucene/Solr index, looks up each dictionary word's 
frequency in the given index and outputs "<word>,<freq>" for each word.  This 
class could live in Lucene SC, but could be used by SCRH when rebuilding the SC 
index for example.

Does this sound useful and implementable?


> Spell Checker as a Search Component
> -----------------------------------
>
>                 Key: SOLR-572
>                 URL: https://issues.apache.org/jira/browse/SOLR-572
>             Project: Solr
>          Issue Type: New Feature
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 1.3
>
>         Attachments: SOLR-572.patch, SOLR-572.patch, SOLR-572.patch, 
> SOLR-572.patch
>
>
> Expose the Lucene contrib SpellChecker as a Search Component. Provide the 
> following features:
> * Allow creating a spell index on a given field and make it possible to have 
> multiple spell indices -- one for each field
> * Give suggestions on a per-field basis
> * Given a multi-word query, give only one consistent suggestion
> * Process the query with the same analyzer specified for the source field and 
> process each token separately
> * Allow the user to specify minimum length for a token (optional)
> Consistency criteria for a multi-word query can consist of the following:
> * Preserve the correct words in the original query as it is
> * Never give duplicate words in a suggestion

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to