Feature Requests item #1206796, was opened at 2005-05-23 00:12 Message generated for change (Comment added) made by matthew_levine You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206796&group_id=61702
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Closed Priority: 5 Submitted By: Matt (matthew_levine) Assigned to: Nobody/Anonymous (nobody) Summary: Catch intentional mispellings Initial Comment: Most of the spam I receive have a lot of the key words intentionally mispelled to throw off spam filters. If spammers always used the same mispellings, SpamBayes would catch them just fine, but spammers are smart enough to change the way the mispell words, plus if there are many different versions of the same word in the spam database, it will greatly weaken the word's spam association. I think it would help if SpamBayes could recognize words as versions of other words and count it as the same token. A way to do this might be to, in words composed primarily of ASCII characters, to replace zeros and ones with 'o's and 'i's, replace any accented characters or other symbols with the normal letters that they resemble, and then instead of requiring the letters of the word to be in order, count the quantity of each letter in the word, and if the letter count is over a certain percentage similar to that of a known spam token, count the email as having that token. Mispellings may be more common in subject lines than bodies, so this feature could also possibly be used to test only the subject line and not the body of the email. Here's another kind of mispelling that would be even tougher to decipher: "Do u Want M:or:eInt:ense:Org:as:ms&3"inWe:eks?" To tackle this, we'd need to detect the breaks between words, which are marked either by capitalization, or by the insertion of symbols or punctuation marks. These features might be tricky to implement or resource- intensive to run, but I think they could greatly improve functionality. ---------------------------------------------------------------------- >Comment By: Matt (matthew_levine) Date: 2005-05-23 00:45 Message: Logged In: YES user_id=1283553 I don't think it's quite the same as the other feature request. That one is saying that the presence of mispelled or non- dictionary words should be a sign of spam. I'm saying that mispelled words should be treated as if they were spelled correctly, so it will know that "C!al1s" is not a new word, but a word that's been in 500 spam messages. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2005-05-23 00:24 Message: Logged In: YES user_id=552329 Dupe of [ 817813 ] Consider bad spelling a sign of spam <https://sourceforge.net/tracker/?group_id=61702&atid=498106&func=detail&aid=817813> ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1206796&group_id=61702 _______________________________________________ Spambayes-bugs mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-bugs
