RE: frequent terms - Re: combining open office spellchecker with Lucene
Also, You can also use an alternative spellchecker for the 'checking part' and use the Ngram algorithm for the 'suggestion' part. Only if the spell 'check' declares a word illegal the 'suggestion' part would perform its magic. cheers, Aad Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters a recursize purser, then: a, which is in, say, 50% of the documents, is probably spelled correctly and recursize, which is in zero documents, is probably mispelled. But what about purser? If we run the spell check algorithm on purser and generate parser, should we show it to the user? If purser occurs in 1% of documents and parser occurs in 5%, then we probably should, since parser is a more common word than purser. But if parser only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting parser. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does purser or parser occur more frequently near descent. But that gets expensive. I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction. If true (default) then for common words like remove, no results are returned now, as expected: http://www.searchmorph.com/kat/spell.jsp?s=remove But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives: http://www.searchmorph.com/kat/spell.jsp?s=removemin=2max=5maxd=5max r=10bstart=2.0bend=1.0btranspose=1.0popular=0 TBD I need to update the javadoc repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index. -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters a recursize purser, then: a, which is in, say, 50% of the documents, is probably spelled correctly and recursize, which is in zero documents, is probably mispelled. But what about purser? If we run the spell check algorithm on purser and generate parser, should we show it to the user? If purser occurs in 1% of documents and parser occurs in 5%, then we probably should, since parser is a more common word than purser. But if parser only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting parser. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does purser or parser occur more frequently near descent. But that gets expensive. I updated the code to have an optional popularity filter - if true then it only returns matches more popular (frequent) than the word that is passed in for spelling correction. If true (default) then for common words like remove, no results are returned now, as expected: http://www.searchmorph.com/kat/spell.jsp?s=remove But if you set it to false (bottom slot in the form at the bottom of the page) then the algorithm happily looks for alternatives: http://www.searchmorph.com/kat/spell.jsp?s=removemin=2max=5maxd=5maxr=10bstart=2.0bend=1.0btranspose=1.0popular=0 TBD I need to update the javadoc repost the code I guess. Also as per earlier post I also store simple transpositions for words in the ngram-index. -- Dave Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters a recursize purser, then: a, which is in, say, 50% of the documents, is probably spelled correctly and recursize, which is in zero documents, is probably mispelled. But what about purser? If we run the spell check algorithm on purser and generate parser, should we show it to the user? If purser occurs in 1% of documents and parser occurs in 5%, then we probably should, since parser is a more common word than purser. But if parser only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting parser. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does purser or parser occur more frequently near descent. But that gets expensive. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Almost. If the user enters a recursize purser, then: a, which is in, say, 50% of the documents, is probably spelled correctly and recursize, which is in zero documents, is probably mispelled. But what about purser? If we run the spell check algorithm on purser and generate parser, should we show it to the user? If purser occurs in 1% of documents and parser occurs in 5%, then we probably should, since parser is a more common word than purser. But if parser only occurs in 1% of the documents and purser occurs in 5%, then we probably shouldn't bother suggesting parser. OK, sure, got it. I'll give it a think and try to add this option to my just submitted spelling code. If you wanted to get really fancy then you could check how frequently combinations of query terms occur, i.e., does purser or parser occur more frequently near descent. But that gets expensive. Yeah, expensive for a large scale search engine, but probably appropriate for a desktop engine. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the did you mean spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: frequent terms - Re: combining open office spellchecker with Lucene
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the did you mean spelling correction algorithm determines that a frequent term is a good suggestion, why not suggest it? The very fact that it's common could mean that it's more likely that the user wanted this word (well, the heuristic here is that users frequently search for frequent terms, which is probabably wrong, but anyway..). I think you misunderstood me. What I meant to say was that if the term the user enters is very common then spell correction may be skipped. Very common words which are similar to the term the user entered should of course be shown. But if the user's term is very common one need not even attempt to find similarly-spelled words. Is that any better? Yes, sure, thx, I understand now - but maybe not - the context I was something like this: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if any term is in the index, regardless of frequency, it is left as-is. I guess you're saying that, if the user enters a term that appears in the index and thus is sort of spelled correctly ( as it exists in some doc), then we use the heuristic that any sufficiently large doc collection will have tons of misspellings, so we assume that rare terms in the query might be misspelled (i.e. not what the user intended) and we suggest alternativies to these words too (in addition to the words in the query that are not in the index at all). Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]