On Mon, 28 Feb 2005 15:34:13 +0000, [EMAIL PROTECTED] (Justin Mason) writes:
> A paper at the spam conference suggested using an Edit Distance algorithm > with very good results; the idea being, the edit distance from "cialis" to > "C 1 a l | s" isn't as far as it is to "specialized" or so on. > > if I recall correctly, someone submitted an implementation quite a while > ago on our BZ, but I think the FP rates were too high. Given the > recent paper's published results, though, it may be there are good ways > to tweak it to get FPs at a tolerable rate. I did an implementation of it some time ago, but I didn't get a chance to take it far enough to test out its effectiveness. I heard remarks that naively applying edit distance is too slow. To avoid having a FP rate that was too high, the edit-distance costs are paramaterized, so some edits are much cheaper than others. Eg. # Cost of replacing a character with a punctuation in the obfu. setreps ("bcdfghijklmnpqrstvwxyz","*?.-",.08); setreps ("aeiou","*?.-",.03); # Cost to insert these into the obfuscated string is cheap setins ("/\|()=-'!*`;:?+[]\"^",.01); setins ("_,.",.01); So, 'v.agr.' and 'v..ia...gra' both cost <.10 Got a bugzilla# that I can attach the prototype code to? (Also, is it possible to report a bug/attach the code without creating a bugzilla account?) Scott