Feature Requests item #1242708, was opened at 07/21/05 17:11 Message generated for change (Comment added) made by sf-robot You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Closed Priority: 5 Submitted By: Mark Storer (mstorer3772) Assigned to: Nobody/Anonymous (nobody) Summary: Counter-counter-spam filtering suggestions Initial Comment: My experience is that the majority of spam that gets around filteration involves lots of deliberate misspellings, either by add1ng or ins^ertin*g non-le++er [EMAIL PROTECTED], thro wing in sp aces wher e t hey do n't belon g, or ByUsingTitleCaseToSeperateWordsRatherThanSpaces. Ditching spaces There are several possible workarounds to this: 1) Drop all non-letters and spaces, evaluating the resulting monolithic string. Downside: More compulationally expensive, as the list of possibly matches increases dramatically for each segment of the monolith, and you have to test each segment against multiple lengths. O(n^2) might be generous. 2) Attempt to merge adjacent tokens to see if they qualify as spam (or ham I suppose). This sounds more like a O(n) operation, but would only stamp out the "additional spaces" method. Downside: Again, more CPU time, but to a lesser extent than #1 Defeated by not using the "add spaces" technique. 3) Treat all new words as having a low positve spam rating of some sort. Each newly encountered misspelling would be initially biased towards spam. 4) Add a spelling checker. New misspelled words have a slightly-spam rating (outside training). Downside: Big data file tacked onto your otherwise light-weight plugin/app/thingy. One concern with #3 and #4 is how they would react to an email containing source code of whatever language. Variable and function names are infrequently found in a dictionary (as you're no doubt aware). ---------------------------------------------------------------------- >Comment By: SourceForge Robot (sf-robot) Date: 08/07/05 19:00 Message: Logged In: YES user_id=1312539 This Tracker item was closed automatically by the system. It was previously set to a Pending status, and the original submitter did not respond within 14 days (the time period specified by the administrator of this Tracker). ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 07/24/05 18:34 Message: Logged In: YES user_id=552329 Something like #2 (but better) is done by the use_bigrams option. This is an experimental option in 1.0.x, and a regular option in 1.1.x. You can enable it and see how you like it. You can change the value an unknown token is assigned. This is the unknown_word_prob option. Experimental testing indicated that the current value of 0.5 gives the best results. Various testing has been done with spell checking/adding tokens for words not in a dictionary. None have shown any improvement. I don't understand what you mean by #1. If you drop all spaces, you are left with one token per email body. This will only match for indentical mail - that will certainly not help. Or are you planning on splitting up the token somehow? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498106&aid=1242708&group_id=61702 _______________________________________________ Spambayes-bugs mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-bugs
