[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Hoss Man (JIRA) Fri, 09 Mar 2007 10:13:30 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479670
 ]


Hoss Man commented on SOLR-81:
------------------------------

Okay, assuming what we are talking about is a adding the existing Lucene 
SpellChecker as a hook into a Solr instance, where the dictionary may be built 
externally, or it may be built based on the main source index, here's my 
comments based on the most recent patch (from Adam ... as i recall it already 
incorperates most of Otis's stuff)


1) we should definitely move the *NGramTokenizerFactories into a seperate 
issues since they don't come into play here.

2) when configuring where the SpellChecker Directory lives, we should probably 
support three options:  a path relative dataDir (so the regular replication 
scripts can copy a SpellChecker index from a master to slave right along with 
the main index), an absolute path, or a RAMDirectory

3) it seems like the functionality present in SpellCheckerRequestHandler and 
SpellCheckerCommitRequestHandler should all be in one request handler (where 
special query time input triggers the rebuild).  that way no redundent 
configuration is required.  There should also be an option for "reloading" the 
SpellChecker instance from disk (ie: reopening it's IndexReader) without 
rebuilding -- which would be useful for people who are (re)buidling the 
SpellChecker index external from Solr and need a way to tell Solr to start 
using the new one

A key use case i'm imagining is that a master could have a postCommit listener 
configured to ping "qt=spell&rebuild=true" after each commit, while a slave 
could have "qt=spell&rebuild=true" to pick up the changed SpellCheck index.

4) i'ts not really safe to trust this...

+        IndexReader indexReader = req.getSearcher().getReader();
+        Dictionary dictionary = new LuceneDictionary(indexReader, 
termSourceField);

...the source field might be encoded in some way.  We really need a subclass of 
LuceneDictionary that knows about the IndexSchema/FieldType of termSourceField 
to extract the Readable value from the indexed terms (either that or we go 
ahead and feed SpellChecker the raw terms, and then at query time run the users 
(mispelled) input through the query time analyzer for termSourceField before 
passing it to spellChecker.suggestSimilar

5) we definitely shouldn't have a "private static SpellChecker" it should be 
possible to have multiple SpellChecker instance (from different dictionaries) 
just by registerig multiple instances of the handler .. at first glance this 
seems like it might make adding SpellChecking funtionality to the other 
requesthandlers hard .. except that they can call core.getRequestHandler(name) 
... so we can still add code to other request handlers so that they can be 
configured to ask for a SpellCheckerRequestHandler by name, and delegate some 
spell checking functionality to it.

6) as far as configuring things like spellcheckerIndexDir and termSourceField, 
there's no reason to do that in the "invariants" list .. that's really for 
things that the code allows to be query time params, but the person configuring 
Solr doesn't want query clients to be able to specify it.  seperate init params 
can be used and accessed directly from the init method (just like 
XSLTResponseWriter)...

+    <requestHandler name="spellchecker" 
class="solr.SpellCheckerRequestHandler">
+        <!-- default values for query parameters -->
+        <lst name="defaults">
+            <str name="echoParams">explicit</str>
+            <int name="suggestionCount">1</int>
+            <float name="accuracy">0.5</float>
+        </lst>
+
+        <!-- main init params for handler -->
+        <str name="spellcheckerIndexDir">/tmp/spellchecker</str>
+        <str name="termSourceField">word</str>
+    </requestHandler>
+

7) adding a special field to the example schema (and example docs) to 
demonstrate building the SpellChecker index is a bit confusing ... we should 
just build the dictionary off of an existing field that contains data (like 
"name" or "text") to demonstrate the common use case.



> Add Query Spellchecker functionality
> ------------------------------------
>
>                 Key: SOLR-81
>                 URL: https://issues.apache.org/jira/browse/SOLR-81
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Otis Gospodnetic
>            Priority: Minor
>         Attachments: SOLR-81-edgengram-ngram.patch, 
> SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
> SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
> SOLR-81-spellchecker.patch
>
>
> Use the simple approach of n-gramming outside of Solr and indexing n-gram 
> documents.  For example:
> <doc>
> <field name="word">lettuce</field>
> <field name="start3">let</field>
> <field name="gram3">let ett ttu tuc uce</field>
> <field name="end3">uce</field>
> <field name="start4">lett</field>
> <field name="gram4">lett ettu ttuc tuce</field>
> <field name="end4">tuce</field>
> </doc>
> See:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
> Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

Reply via email to