on Wed, Sep 15, 2004 at 09:38:56AM -0700, Rod Roark ([EMAIL PROTECTED]) wrote: > On Tuesday 14 September 2004 11:38 pm, Karsten M. Self wrote: > > on Mon, Sep 13, 2004 at 09:39:15AM -0700, Rod Roark ([EMAIL PROTECTED]) wrote: > > [...] > > > Mostly this is of interest to the > > > officers, as the mailing lists already require > > > registration in order to post; however spammers might > > > easily forge the FROM header to abuse this. > > > > Note that the greylisting is based on a tuple of which at least one > > element (immediate upstream IP) is difficult or impossible to reliably > > forge. > > Not sure if we are on the same page here. I was referring to the fact > that (not considering spam filtering) it's trivial to post to one of > the mailing lists by forging the "from:" header.
Sure. Which is why a split content/context filter's more reliable. I've received a number of spams (or virms) in the past couple of months from known, whitelisted addresses. I'm pretty sure, say, that Don Marti hasn't taken up spamming and isn't running an MS OS. > > > (2) Mail from first-time posters, or from those who post > > > less frequently than once per month, would likely be > > > delayed by an hour or so. > > > > Possibly. > > Currently I'm experimenting with a 15-second period for > greylisting. So far it appears that most MTA clients are > set to retry after either 1 minute or 1 hour. The really > busy ones are quite unpredictable; worst case I've seen is > about 3 hours. Check your retry interval in your MTA. Exim, if typical, uses the following: # This single retry rule applies to all domains and all errors. It # specifies retries every 15 minutes for 2 hours, then increasing # retry intervals, starting at 2 hours and increasing each time by a # factor of 1.5, up to 16 hours, then retries every 8 hours until 4 # days have passed since the first ...so 15 _minutes_ might be a better value. I haven't empirically tested this, however. > [insightful but long analysis of aggregation snipped] > > > Which suggests a very cheap mode of cutting into spam volumes markedly > > by employing ASNs, CIDRs, or similar IP aggregates (though I'm aware of > > none) in generating reputation data, and effecting firewalling, > > probabalistic rejection (you reject traffic from an ASN directly > > proportional to the probability it's spam), rate-limiting, etc. > > Backing off from a black-and-white allow/deny mode gives legit mail a > > fighting chance.... > > So this "probability" would necessarily only be part of a > SpamAssassin-style weighting system. Most of us hate to lose any > legitimate mail at all, so rejecting all mail from some IP block > solely because, say, 75% of that block's mail is spam, would be quite > unacceptable. It's data. How you use it is up to you. Point being that for, say, Kornet, the Bayes probability associated with it was IIRC ~98%+ (and most of the non-spam was likely admin bounce messages from attempts to deliver to abuse/reporting addresses). For _many_ of the high-spam originating ASNs / CIDRs, you'll find similar stats, and if I understand SA's Bayesian rules database correctly, the data should be available to you. I'm having a little trouble with this at present. But 'sa-learn --dump <option>' should give you the current tokenset. Other alternatives are to block _all_ mail from some points of origin, as I'd recommend doing for the top spam sources. They simply are so badly managed, or so overtly and intentionally promoting spammers, that they have no business serving legitimate traffic. The Internet Death Penalty has been applied in the past, it's a harsh and brute tool. It's also very highly effective. Or you could use an in-between option. Explicitly whitelist known good point sources, throttle or rate-limit other known addresses. While the "don't lose a single good email" mantra is popular, it's unrealistic. Example: I've recently recovered from a basically unmediated mail experience: every single email received was being dumped into a single folder (result of systems issues and not having any mail, let along spam, filtering in place). Over the course of some six weeks, over 28k mails piled up. Think about that. On recovering my systems, I ran the 28k+ mails through procmail for filtering, spam assessment, etc. I run some intensive checks and numerous remote lookups, resulting in a rather slow process chain. Took over six days for that to complete (my daily mail processing limit would appear to be about 2-4k mails daily). I found myself responding to several messages, including from known addresses, which had been sent during that interval, many several weeks old. The senders of these mails had no idea if the mail was lost, in transit, ignored, or what. This is what's known in the biz as "silent failure mode". A Very Bad Thing[tm]. Even where it's annoying as all hell, explicit IP (or CIDR or ASN) rejection serves two useful functions: - It's immediate. - For well-constructed, standards-based mail clients, it results in a well-defined error message. While the basis for rejection may not be appropriate, it's very clear that mail was, in fact, rejected. This allows the sender to, in a timely fashion, attempt some other means of contacting you. A partially-effective, but explicit, system is better than none at all, or one which is partially effective but has soft failure modes (e.g.: challenge-response). > > Which all sounds well and good. > > > > The question, though, is how much spam are you getting? > > It varies a *lot* from day to day. Stats for yesterday: > > 917 incoming messages > 706 of these blocked via DNSBLs and custom blacklists > 45 blocked by the newly-implemented greylisting (never re-sent) > 85 delayed by greylisting and later delivered > 81 delivered without delay > I have not inspected all of the delivered messages, as many > of them are not mine to view. But based on my own portion > of these I estimate that about 5% are spam. Without the > greylisting it would have been about 21% (and without any > filtering at all, 82%). Sounds like 81.8% spam, 18.1% ham, filtered, with a 5% false-negative rate on spam filtering. Since you're already using DNSBLs pretty extensively, I suspect we're largely in violent agreement here. > [...] > > On the other hand, content/context based filtering gets expensive both > > CPU and time-wise, particularly if you're making extensive use of DNSBLs > > (they're useful data sources, they're time-intensive). It takes me > > 10-20 seconds to determine spam or ham on my own system, on a high-speed > > line, via Spamassassin. I'm faster doing it manually, but I'm not going > > to sit in hour after hour, day in and day out. So the machine does it. > > Actually I find that use of DNSBLs is very fast, on the order of a > second or so per message. This is probably helped greatly by the fact > that I run DNS on the same machine as the mail server. Interesting. I've got caching DNS here, but get about 6-15 seconds per message in spamassassin. Could be that the large volume of rejected mail you've got up-front would take longer to run through DNSBLs in SA. Peace. -- Karsten M. Self <[EMAIL PROTECTED]> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? gconf-editor: reimplementation of the MS Windows Registry for GNU/Linux, with the concommitant problems of undocumented settings, cryptic keys, inability to comment settings, and use of a single, specialized application to access the configuration settings.
Description: Digital signature
_______________________________________________ vox-tech mailing list [EMAIL PROTECTED] http://lists.lugod.org/mailman/listinfo/vox-tech