On Monday, March 29, 2004, 2:57:53 AM, Daniel Quinlan wrote: > Jeff Chan <[EMAIL PROTECTED]> writes:
>> For name-based URIs that's very different from my intended use for >> SURBL so I may have been partially in error in suggesting that an >> unmodified URIDNSBL use SURBL directly. > Yeah, I didn't expect it would work based on the explanation on the > SURBL web page, but I figured I'd give it a try anyway. No harm, no > foul. I think we'll need to add another method to the URIDNSBL plugin > to support direct RHS query blacklists like SURBL. Sounds like a plan, and I certainly appreciate your giving it a try with URIDNSBL as it is now. >> Presently the RBL only has about 250 records; perhaps that's on the >> small side. > 250 seems small relative to the number of domains I see in spam each day > (very roughly about 4 domains mentioned per email, average of 2 domains > in each spam unique to a week-long period). An interesting thing is that the data seems pretty "normal" in a statistical sense. Halving of the threshold approximately doubles the size of the resulting list in the range of thresholds I looked at (approx 5 to 25 "report counts"). Lengthening the expiration period should also increase the size of the list for a given threshold, and the additional data gained from doing so could be pretty valid. One thing I did notice from top-sites.html is that there is a persistent pharmaspammer hosted in China or Brazil that almost always seems to be near the top of the list. They had used domain names like medz4cheap.com, and some other names. Currently they're using medicalfhtjk.com: http://spamcheck.freeapp.net/top-sites.html What's interesting is that their domains only last a week or so before they switch to a new one, with very similar-style spams referencing all of them. In their case at least, that kind of argues for a one week or so expiration, but that's only one anecdotal example and not really a basis for a policy. Perhaps it's not a coincidence the 7 days is also a typical minimum zone file expire time, i.e. a length of time the spam domain zone file might be cached on name servers. >> One improvement might be to encode the frequency data in the RBL so >> that more frequently reported domains could be used to give higher >> scores. > We could do that, but let's see where we are once we start doing direct > lookups and if perhaps you increase your timeout and lower your > threshold to increase the number of records somewhat. Agreed. And I'm not even sure most RBL code would know what to do with information other than "yep it resolves, so it's a match, and I'm done." That said, it could be easy to add to an RBL RR including the Text record. > The key thing with the threshold is that we want SURBL to be accurate as > a spam rule. Joe jobs are something you want to think about now as > opposed to later. And why I want to start with a somewhat high threshold and an effective whitelist. > One way you could reduce the possibility of joe jobs is to remove old > domains, ones that have been around a while. That's an interesting idea that assumes spam body domains go away eventually. My current code expires all domains equally, but could be modified to look for persistent ones and treat them differently. The averaging effect seems to be very strong however, and very few FPs seem to get in. The fact that the manual SpamCop reports can be and probably are mostly hand-tuned by every SC user seems to help. I.e. most SC users probably make an effort to uncheck legitimate domains to prevent false reporting. > Stuff like amazon.com, > ebay.com, etc. have been around for a long time. SenderBase has easily > accessed data for this (first email from domain was initialized long > enough ago to be useful now) and there are also the whois records. You > could also build-up a whitelist for repeated joe-jobs. Certainly the existing SURBL whitelist could be used for that. I've already added some of the common domains like yahoo, hotmail, etc. and have just added ebay and amazon due to your reminder. None of those has actually appeared above the threshold yet, however, so the law of averages and careful reporting seem to be on our side so far. I'm not too familiar with SenderBase. Do they have a web site or domain whitelist? For that matter, does anyone know of any such whitelists that we could incorporate? Basically it would just be a list of known, legitimate, popular sites or domains. I would assume such whitelists exist but am somewhat new to working on anti-spam technologies. > You might also to increase the timeout on domains that appear again and > again. Interesting idea. Would that be for spam domains or legitimate ones? The idea of variable expiration is interesting though. >> As another example of difference about my views on the use of the >> SURBL data, off-list Sidney brought up the question of processing >> deliberately randomized host names that spammers sometimes use and how >> that could confuse or defeat a spam message body domain RBL. He >> implied that that such deliberate attempts at randomization might be a >> reason my data was not working too well with URIDNSBL, and I partially >> agree. This observation points out potential differences in how the >> data might best be used. > Yes, but the SBL rule works pretty well, so I don't think randomized > host names are a problem yet. We've seen quite a few randomized or customized (to a username for example) host names in some of the top pharmaspam sites. The idea is exactly as others have mentioned: add chaos to the names to throw off message body checkers. Doesn't throw us off though; we thrive on it as long is their main domain is behind it! >> My take on the randomized host or subdomain problem highlights >> a different viewpoint we took into consideration when designing >> our data structure. > I *think* we also currently only do queries of the domain itself, so > it shouldn't be an issue. If so, great. If not, the approach I outlined could be worth a try. >> Instead of checking every randomized FQDN against the RBL, we prefer >> to try to strip off the random portion and pass only the basic, >> unchanging domain. The SURBL data only gets the parent of these >> randomized FQDNs since it builds its (inverted) tree from the root >> (TLD) direction down toward the leaves. (It actually starts counting >> reports from the second level, not the top level, which would be way >> too broad.) It accumulates a count of the children under the second >> level so that: >> >> lkjhlkjh.random.com >> 089yokhl.random.com >> asdsdfsd.random.com >> >> gives one entry for each FQDN, but gives the useful and desirable >> count of *3* for random.com. The randomizers *cannot hide* from >> this approach. The non-random child portion of their domains >> shows up clearly and conspicuously as a parent domain with an >> increased count (3 is greater than 1). Every time a spammer gets >> reported using a randomized host or subdomain name, it increases >> the count of their parent domain. In the words of the original, >> Apple II version of Castle Wolfenstein, "You're caught." > This is a good idea. Thanks! >> My suggested alternative approach to parsing spam URIs would be to >> start with the second level domains, compare those against SURBL, >> try the third levels next, up to some limit. (Levels 1 with 2, >> then 1 through 3 are probably enough, i.e. two DNS queries into >> the SURBL domain). Since the DNS RBL lookups are all cached and >> very fast there should not be too much of a performance penalty >> for this. > Whatever we do, we really want to do all the queries at once as early as > possible in the message check for performance reasons. Agreed, though local DNS caching helps quite a bit.... >> Probably it's less of a penalty than trying to resolve spam body FQDNs >> into numeric addresses, then do reverse lookups or name server record >> checks on the addresses, etc. > Definitely. Agreed. It's *definitely* quicker to do DNS lookups of the single, cached SURBL domain than DNS lookups on all the random domains appearing in spam (and in legitimate messages). >> Implementing this approach may require a new code branch off of >> URIDNSBL to be started. But I'm convinced my approach may have >> some definite merit if implemented. > I think it belongs in the URIDNSBL code, but another plugin would > perhaps be okay. If it can be done in the existing code, I'm all for that! If not we could consider forking it it off. >> I've never written any SA code, so could I convince someone to >> consider implementing this approach or give me a pointer to learn how >> to do it? > It sounds like Justin is thinking about it, or perhaps Sidney is > interested, or my advice if you want to do it would be to check out the > SVN tree and start hacking. :-) > Daniel Someone please try it. I think it could rock! :) Eric Kolve if you're reading this would you care to try, per my previous suggested design? If no one else will, I may give it a hack or two. It would probably be immensely faster for someone already familiar with SA to give it a try though.... ;) Thanks for your feedback! Jeff C. -- Jeff Chan mailto:[EMAIL PROTECTED] http://sc.surbl.org/
