Feature Requests item #854705, was opened at 2003-12-06 01:58 Message generated for change (Comment added) made by anadelonbrin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=854705&group_id=61702
Category: None Group: None >Status: Closed Priority: 5 Submitted By: Julian Morrison (julianm) Assigned to: Nobody/Anonymous (nobody) Summary: Detect "line noise" in subject and body Initial Comment: Spell check words in the message subject and body, generate tokens for the count of misspellings in each. Perhaps also generate tokens for the ratio of incorrect/correct spellings? This could be chunked to make it easier to train eg: all, more than half, about half, less than half, none. These should be seperate for subject and for body since garble in the header is very predictive of spam. Also, there has to be some way to look for words with "impossible to pronounce" consonant clusters such as "dvgkbm". Could spambayes be made to look for "syllables"? Eg: by parsing words into syllables and generating tokens for each? I'm not sure there's a parsing technique that's sufficiently internationalized. Perhaps even just generating tokens for ASCII consonant clusters would be better than nothing. ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2005-05-13 15:57 Message: Logged In: YES user_id=552329 I tried generating tokens if a token wasn't in a dictionary (more-or-less the same as spell checking), and that didn't help. See the wiki http://entrian.com/sbwiki for more details and the patch, in case anyone else wants to try it. Unless anyone can show that this helps, it won't be added. The unpronouncable suggestion is unlikely to help if dictionary words didn't. I don't see how it work would outside English, anyway. ---------------------------------------------------------------------- Comment By: Julian Morrison (julianm) Date: 2003-12-06 03:41 Message: Logged In: YES user_id=21754 Yeah you're right about "unpronounceable:xmlrpc", oops, my bad. Sorry, ignore that bit. The hack I suggested for misspellings can be extended to unpronounceability counts, or anything similar. If it's a known token and a statistical ham indicator, then never count it as "unpronounceable" or "misspelled". That approach would quickly enough learn tech-speak or whatever, but it would catch high incidence of garble. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-12-06 03:11 Message: Logged In: YES user_id=85414 What's the difference between the tokeniser spitting out "xmlrpc" and spitting out ""unpronounceable:xmlrpc"? That doesn't make any difference. The difference is when you "generate tokens for the count of misspellings" (or unpronounceables) - then your system starts to decide that high unpronounceable conts are spammy, and techie messages get more spammy. (Unless the tech-speak outweighs the spam garbage, but even we're not *that* techie!) ---------------------------------------------------------------------- Comment By: Julian Morrison (julianm) Date: 2003-12-06 03:04 Message: Logged In: YES user_id=21754 Hmm, would it not merely learn token "unpronounceable:xmlrpc" as a ham indicator? Also, as a spellcheck hack: words that are already recognised tokens, and are ham indicators, should not count as misspelled even if the spell check rejects them. This would then quickly learn not to add "xmlrpc" into the misspelled-words count and ratio. ---------------------------------------------------------------------- Comment By: Richie Hindle (richiehindle) Date: 2003-12-06 02:53 Message: Logged In: YES user_id=85414 We spambayes developers spend a lot of time talking about smtp, pop3, cdo, mapi, tcpip, http, html, py2exe, rfc822, chi2, kmail, ie, oe, xmlrpc, bsddb... Now those things would be trained as ham clues, but your scheme would dilute them. I'm not saying it's a bad idea, but just because something is unpronouncable and not in the dictionary doesn't make it the same class of thing as all the other tokens which are unpronouncable and not in the dictionary. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=854705&group_id=61702 _______________________________________________ Spambayes-bugs mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-bugs
