RE: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-16 Thread Aad Nales
Also,

You can also use an alternative spellchecker for the 'checking part' and
use the Ngram algorithm for the 'suggestion' part. Only if the spell
'check' declares a word illegal the 'suggestion' part would perform its
magic.


cheers,
Aad

Doug Cutting wrote:

 David Spencer wrote:
 
 [1] The user enters a query like:
 recursize descent parser

 [2] The search code parses this and sees that the 1st word is not a
 term in the index, but the next 2 are. So it ignores the last 2 terms

 (recursive and descent) and suggests alternatives to 
 recursize...thus if any term is in the index, regardless of 
 frequency,  it is left as-is.

 I guess you're saying that, if the user enters a term that appears in
 the index and thus is sort of spelled correctly ( as it exists in
some 
 doc), then we use the heuristic that any sufficiently large doc 
 collection will have tons of misspellings, so we assume that rare 
 terms in the query might be misspelled (i.e. not what the user 
 intended) and we suggest alternativies to these words too (in
addition 
 to the words in the query that are not in the index at all).
 
 
 Almost.
 
 If the user enters a recursize purser, then: a, which is in, say,
  50% of the documents, is probably spelled correctly and recursize,

 which is in zero documents, is probably mispelled.  But what about 
 purser?  If we run the spell check algorithm on purser and
generate 
 parser, should we show it to the user?  If purser occurs in 1% of 
 documents and parser occurs in 5%, then we probably should, since 
 parser is a more common word than purser.  But if parser only 
 occurs in 1% of the documents and purser occurs in 5%, then we
probably 
 shouldn't bother suggesting parser.
 
 If you wanted to get really fancy then you could check how frequently
 combinations of query terms occur, i.e., does purser or parser
occur 
 more frequently near descent.  But that gets expensive.

I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like remove, no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove

But if you set it to false (bottom slot in the form at the bottom of the

page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=removemin=2max=5maxd=5max
r=10bstart=2.0bend=1.0btranspose=1.0popular=0

TBD I need to update the javadoc  repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave

 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters a recursize purser, then: a, which is in, say, 
 50% of the documents, is probably spelled correctly and recursize, 
which is in zero documents, is probably mispelled.  But what about 
purser?  If we run the spell check algorithm on purser and generate 
parser, should we show it to the user?  If purser occurs in 1% of 
documents and parser occurs in 5%, then we probably should, since 
parser is a more common word than purser.  But if parser only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting parser.

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does purser or parser occur 
more frequently near descent.  But that gets expensive.
I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like remove, no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove
But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=removemin=2max=5maxd=5maxr=10bstart=2.0bend=1.0btranspose=1.0popular=0
TBD I need to update the javadoc  repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).
Almost.
If the user enters a recursize purser, then: a, which is in, say, 
50% of the documents, is probably spelled correctly and recursize, 
which is in zero documents, is probably mispelled.  But what about 
purser?  If we run the spell check algorithm on purser and generate 
parser, should we show it to the user?  If purser occurs in 1% of 
documents and parser occurs in 5%, then we probably should, since 
parser is a more common word than purser.  But if parser only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting parser.

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does purser or parser occur 
more frequently near descent.  But that gets expensive.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters a recursize purser, then: a, which is in, say, 
 50% of the documents, is probably spelled correctly and recursize, 
which is in zero documents, is probably mispelled.  But what about 
purser?  If we run the spell check algorithm on purser and generate 
parser, should we show it to the user?  If purser occurs in 1% of 
documents and parser occurs in 5%, then we probably should, since 
parser is a more common word than purser.  But if parser only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting parser.
OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does purser or parser occur 
more frequently near descent.  But that gets expensive.
Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the did you mean spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).
I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the did you mean spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely 
that the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
(recursive and descent) and suggests alternatives to 
recursize...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]