Hi, As you might have noticed me fixing quite a few typos in the man pages, it was because I was testing the new spell(1) implementation I have been working on: https://github.com/abhinav-upadhyay/nbspell :-)
A portion of this work I did while working on apropos(1), but didn't commit it because it was not quite in shape at that time and also, I wanted it to be outside apropos(1). Then a couple of years back dholland@ created a PR (http://gnats.netbsd.org/48684) documenting the shortcomings of the existing spell(1), at which point I volunteered to work on it but didn't get around doing it until couple of months back. The original spell(1) used inflection rules, removing prefixes/suffixes from the input word, till it is reduced to a root word and then checks for that word in the dictionary. As noted in the PR, these rules can be easily fooled and wildly hamper its accuracy. The original reason to use the rule based technique was to keep the size of the dictionary small, and also saving the pain of adding all possible variations of the different words in the dictionary. Now a days we don't need to worry about disk space, so we can spend a few more MBs of disk space to avoid using those complex and inaccurate rules for doing spell checking. As part of this project, I have expanded the original Webster's dictionary by adding all possible plural, verb and adjective forms of the words where possible. To check if a given word is misspelled or not, is now a simple lookup in this dictionary. The expanded dictionary is just over 5MB in size as compared to the original Webster's dictionary of 2.4M. But a spell checker is no good if it tells you that you misspelled a word but doesn't tell you the correct spelling. To achieve this I have used the levenshtein distance (or edit distance) technique to generate the possible corrections. (for details on how this works, read my old blog post http://abhinav-upadhyay.blogspot.in/2011/10/spell-corrector-for-apropos.html). In order to improve the accuracy of the corrections further I also decided to incorporate the soundex algorithm, which is a phonetics based technique. The algorithm generates a code for every word, where similar sounding words end up with the same code. This would allow handling cases where people type words based on how they sound like rather than how they are actually spelled. After finding the list of possible corrections, it is also important to sort them in decreasing order of probability of being the correct spelling. In order to do this sorting, I generated a count of word frequency from a 12GB dataset based on the English language books obtained from project Gutenberg. The basic premise is that a word which is more frequent is more likely to be the correct correction. Right now, in order to find possible corrections for a given word, I am taking the following strategy: 1. Try to find corrections at edit distance 1 2. If no corrections found at distance 1, go for edit distance 2 3. If no corrections found at distance 2, try soundex Following are some of the tricks I am using to further up the accuracy: - While generating the corrections at a given edit distance, if a given correction shares the same soundex code as the misspelled word, give it a higher weight (a word at a short edit distance from the original word and having the same soundex is more probable to be the correct spelling than any other alternative) - It has been observed that it is highly unlikely for someone to misspell the first character of a spelling, therefore heavily penalizing a correction where the first character is changed, for example acused is more likely a misspelling of accused than caused. - Based on the test data, I noticed it is more common for people to miss a character in the spelling than type an extra character, therefore giving a higher weight to corrections involving insertion helps. - Also, some implementations of levenshtein distance tend to give lower weight to corrections involving substitutions, doing that has also been a bit helpful. With all these things in place, I compared the performance of this implementation with that of GNU aspell on the test data provided by them (http://aspell.net/test/common-all/), and following are the results in terms of accuracy: Accuracy in finding the correct suggestion at a given position: Total number of tests: 3945 First place: 91.33% (aspell 0.60: 74%) in positions 1-5: 95.26% (aspell 0.60: 96.6%) in positions 1-10: 95.59% (aspell 0.60: 98.2%) in positions 1-25: 95.77% (aspell 0.60: 99.0%) in positions 1-50: 95.84 (asepll 0.60: 99.2%) in positions 1-100: 95.92% (asepll 0.60: 99.2%) It should be noted that many of the tests in the test data were quite unreasonable, for example ``assassination'' was expected as the correction for ``assosication'', while my implementation suggested ``association''. Also, the aspell test results mention that they only considered words which were present in all the three dictionary but they didn't provide the exact set of tests they used, so these numbers should be taken with a grain of salt. Their results are based on 3836 test cases, while I used the complete test data (minus the words which were not in our dictionary). I have also tried playing with the n-grams techniques to get more accurate corrections taking the context of the surrounding word into the account, but I have struggled with the API design. Right now, the API is as simple as calling spell_get_suggestions(word) and it will return the list of possible corrections. But with n-gram based implementation it become much more complicated. Does the caller pass one sentence at a time or it tokenizes and passes the n-grams? If spell(1) needs to tokenize, how does it decide what is a valid tokenization scheme for the caller? Things like that, because of which I have not gone in that direction yet. I do have a bigram based implementation checked in (but currently broken). I am looking forward to helpful feedback and suggestions on this work :-) - Abhinav