Yeah, context can play a role, but that is up to the Analyzer used to
determine. I will open a JIRA issue to address the problem as it
exists now and a fix to do the analysis before submitting the terms.
-Grant
On Feb 22, 2008, at 4:03 PM, Sean Timm wrote:
Sometimes context can play into the correct spelling of a term. I
haven't looked at the 1.3 spell check stuff, but it would be nice to
do term n-gramming in order to check the terms in context.
Since Otis brought up Google, here is an example of putting the term
into context.
http://www.google.com/search?q=choudhury
http://www.google.com/search?q=abdur+choudhury
-Sean
Otis Gospodnetic wrote:
Haven't used SCRH in a while, but what you are describing sounds
right (thinking about how Google does it) - each word should be
checked separately and we shouldn't assume splitting on
whitespace. I'm trying to think if there are cases where you'd
want to look at the surrounding terms instead of looking at each
term in isolation.... can think of anything exciting....maybe
ensure that words with dashes are properly handled.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, February 21, 2008 3:13:20 PM
Subject: Spell checking ?'s
Hi,
I've been looking a bit at the spell checker and the
implementation in the SpellCheckerRequestHandler and I have some
questions.
In looking at the code and the wiki, the SpellChecker seems to
treat multiword queries differently depending on whether
extendedResults is true or not. Is the use case a multiword
query or a single word query? It seems like one would want to
pass the whole query to the spell checker and have it come back
with results for each word, by default. Otherwise, the
application would need to do the tokenization and send each term
one by one to the spell checker. However, the app likely doesn't
have access to the spell check tokenizer, so this is difficult.
Which leads me to the next question, in the extendedResults,
shouldn't it use the Query analyzer for the spellcheck field to
tokenize the terms instead of splitting on the space character?
Would it make sense to, for extendedResults anyway, do the
following:
Tokenize the query using the query analyzer for the spelling field
for each token
spell check the token
add the results
I see that extendedResults is a 1.3 addition, so we would be fine
to change it, if it makes sense.
Perhaps, for back compatibility, we keep the existing way for non
extendedResults. However, it seems like multiword queries should
be split even in the non-extended results, but I am not sure.
How are others using it?
Thanks,
Grant
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ