Re: Spell checking ?'s

Grant Ingersoll Fri, 22 Feb 2008 14:12:07 -0800

Yeah, context can play a role, but that is up to the Analyzer used todetermine. I will open a JIRA issue to address the problem as itexists now and a fix to do the analysis before submitting the terms.


-Grant


On Feb 22, 2008, at 4:03 PM, Sean Timm wrote:

Sometimes context can play into the correct spelling of a term. Ihaven't looked at the 1.3 spell check stuff, but it would be nice todo term n-gramming in order to check the terms in context.
Since Otis brought up Google, here is an example of putting the terminto context.
http://www.google.com/search?q=choudhury
http://www.google.com/search?q=abdur+choudhury

-Sean

Otis Gospodnetic wrote:
Haven't used SCRH in a while, but what you are describing soundsright (thinking about how Google does it) - each word should bechecked separately and we shouldn't assume splitting onwhitespace. I'm trying to think if there are cases where you'dwant to look at the surrounding terms instead of looking at eachterm in isolation.... can think of anything exciting....maybeensure that words with dashes are properly handled.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, February 21, 2008 3:13:20 PM
Subject: Spell checking ?'s

Hi,
I've been looking a bit at the spell checker and theimplementation in the SpellCheckerRequestHandler and I have somequestions.
In looking at the code and the wiki, the SpellChecker seems totreat multiword queries differently depending on whetherextendedResults is true or not. Is the use case a multiwordquery or a single word query? It seems like one would want topass the whole query to the spell checker and have it come backwith results for each word, by default. Otherwise, theapplication would need to do the tokenization and send each termone by one to the spell checker. However, the app likely doesn'thave access to the spell check tokenizer, so this is difficult.
Which leads me to the next question, in the extendedResults,shouldn't it use the Query analyzer for the spellcheck field totokenize the terms instead of splitting on the space character?
Would it make sense to, for extendedResults anyway, do thefollowing:
Tokenize the query using the query analyzer for the spelling field
for each token
   spell check the token
   add the results
I see that extendedResults is a 1.3 addition, so we would be fineto change it, if it makes sense.
Perhaps, for back compatibility, we keep the existing way for nonextendedResults. However, it seems like multiword queries shouldbe split even in the non-extended results, but I am not sure.How are others using it?
Thanks,
Grant


--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Spell checking ?'s

Reply via email to