Chris Santerre wrote: > I remember that paper. I was impressed and sceptical at the same time. I > could see it FPing a lot. One person in the crowd brought up Niagra vs. the > V-drug word :) > > Cialis vs. Dial-Lisa > ect......
That was MailFrontier, using the term lexigraphical distancing rather than edit distance. He mentioned that (in addition to the words used against the algorithm being chosen by humans) stopwords were hand-picked to avoid false positives. A quick google for 'edit distance' led me to a talk on string matching algorithms with links to several edit distance implementations on CPAN: http://cs.haifa.ac.il/~shlomo/talks/edit_distance/slides/all_in_one.html A plugin to catch text substitutions for SA would need to be fast and/or only get invoked for strings likely to produce a match. The current version of String::Approx might be a good starting point. --d