On Sunday, March 28, 2004, 10:00:11 PM, Justin Mason wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Daniel Quinlan writes: >> No FPs, but the SPAM% is rather low. I suspect the problem is that >> SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL >> mapping.
> It's also *very* new -- I suspect it could do with more data ;) Anyone got > an address for the operator? I can send on over a partial spamtrap feed > from our server (100MBytes of spam per day), or similar. > IMO, expiring after 4 days is *way* too early. At least a month would > be better -- otherwise it allows spammers to "recycle" old domains very > quickly after their use in spam. Hi All, I'm the person behind SURBL. First I'd like to thank Sidney for relaying my announcement to you guys and also for his letting me know this developer forum. I'd also like to thank the SA developer community for building a great tool for fighting spam. In that spirit, I'm trying to make a contribution to the efforts in a way that perhaps has not been tried before. I'll try to explain what I'd hope SURBL can help accomplish. First, I think there may be some misunderstanding about the intended use of the SURBL data, partially caused by my somewhat shallow understanding of what URIDNSBL currently does, and also because my own ideas on how SURBL should be used apparently differ somewhat from how URIDNSBL appears to work. It seems that URIDNSBL wants to do address resolution on domain names found in message bodies and compare the resulting addresses against numeric RBLs. For name-based URIs that's very different from my intended use for SURBL so I may have been partially in error in suggesting that an unmodified URIDNSBL use SURBL directly. Second, we can make the expiration of records and therefore number of days any arbitrary length. Four days was chosen because I felt it was a good match for the freshness of the SpamCop (SC) Spamvertised site data. It was also chosen to keep the amount of data reasonably small. If more of a historical record would be useful, we can keep data for a week or month. The shortness was partially meant to ensure that the RBL data tracked current SC data fairly tightly and also did not result in too large of an RBL. Presently the RBL only has about 250 records; perhaps that's on the small side. I'm not too worried about Joe Jobs and other problems in the data due to some of the averaging effects explained further on. More fundamentally the question of number of days may somewhat miss the idea of what I'm trying to accomplish with SURBL. SURBL is not trained on spam in the sense of Baysean rules, etc. It is simply meant to be a record of the most frequently reported domains in spam message bodies that SpamCop users choose to report. In this sense it's like a broadly-based, hand-tuned black list of domains commonly found in spam. Because quite a few reports need to be received to for a domain to get added to SURBL, it effectively represents a consensus voting system on what body domains are spammy. One improvement might be to encode the frequency data in the RBL so that more frequently reported domains could be used to give higher scores. About the only tuning of the data I see as necessary or possible is in the number expiration days and the report count threshold for inclusion in the list (with the caveats about how those counts are generated, as mentioned in the documentation). Some statistical analysis could help with the thresholding question. http://sc.surbl.org/ As another example of difference about my views on the use of the SURBL data, off-list Sidney brought up the question of processing deliberately randomized host names that spammers sometimes use and how that could confuse or defeat a spam message body domain RBL. He implied that that such deliberate attempts at randomization might be a reason my data was not working too well with URIDNSBL, and I partially agree. This observation points out potential differences in how the data might best be used. My take on the randomized host or subdomain problem highlights a different viewpoint we took into consideration when designing our data structure. Instead of checking every randomized FQDN against the RBL, we prefer to try to strip off the random portion and pass only the basic, unchanging domain. The SURBL data only gets the parent of these randomized FQDNs since it builds its (inverted) tree from the root (TLD) direction down toward the leaves. (It actually starts counting reports from the second level, not the top level, which would be way too broad.) It accumulates a count of the children under the second level so that: lkjhlkjh.random.com 089yokhl.random.com asdsdfsd.random.com gives one entry for each FQDN, but gives the useful and desirable count of *3* for random.com. The randomizers *cannot hide* from this approach. The non-random child portion of their domains shows up clearly and conspicuously as a parent domain with an increased count (3 is greater than 1). Every time a spammer gets reported using a randomized host or subdomain name, it increases the count of their parent domain. In the words of the original, Apple II version of Castle Wolfenstein, "You're caught." So a technique to defeat the randomizers greater count is to look at the higher levels of the domain, under which SURBL will always count the randomized children of the "bad" parent. In this case the URI diversity created through randomization hurts the spammer by increasing the number of unique reports and increasing the report count of their parent domain, making them more likely to be added to SURBL. (Dooh, this paragraph is redundant...) A quick look at the data will confirm that almost all of the most often reported domains have just two levels (a few have three levels): http://spamcheck.freeapp.net/top-sites-domains This simply reflects the nature of the data, including the positive and constructive handling of randomizers. The real strength of SURBL is that the domains are very strongly spam domains. This approach would be prone to failure if the FP rate of these base domains was significantly above zero. Due to the law of averages and fairly careful SpamCop reporters., that seldom seems to happen. My suggested alternative approach to parsing spam URIs would be to start with the second level domains, compare those against SURBL, try the third levels next, up to some limit. (Levels 1 with 2, then 1 through 3 are probably enough, i.e. two DNS queries into the SURBL domain). Since the DNS RBL lookups are all cached and very fast there should not be too much of a performance penalty for this. Probably it's less of a penalty than trying to resolve spam body FQDNs into numeric addresses, then do reverse lookups or name server record checks on the addresses, etc. Some of the three-level domains are supersets of two-level domains, for example to.discreetvaluepills.com and discreetvaluepills.com are both listed, so the two level comparison may be the best place to start. Implementing this approach may require a new code branch off of URIDNSBL to be started. But I'm convinced my approach may have some definite merit if implemented. The results of feeding SURBL directly into URIDNSBL may not be too strong because the two approaches seem to have fairly different background assumptions and design approaches in mind. I now believe my data may work better when used as I describe above than when fed directly into unmodified URIDNSBL. I've never written any SA code, so could I convince someone to consider implementing this approach or give me a pointer to learn how to do it? > And finally, I think we should add a new rule eval fn to URIBL, to > allow URIs to be looked up against an RHSBL-style list. That should > be faster, as it'd mean no need for the NS and A sets of lookups. Jason's last comment would seem to include a key part of the puzzle. As I mention above, I believe the SURBL data could and quite possibly should be compared without any DNS resolution of any domains in the message body. If the domain (or numeric address) in the spam URI matches SURBL, you almost certainly hold a gen-u-ine spam. This also ties in with Daniel's earlier observation after testing SURBL using URIDNSBL: > No FPs, but the SPAM% is rather low. I suspect the problem is that > SURBL is a direct listing of URIs whereas URIBL does the NS->A->RBL > mapping. He's exactly right about the intention of SURBL. It is a direct list of spam URI domains, intended for direct comparison against domains in incoming message URIs without resorting to any DNS resolution. I consider that a feature rather than a bug. ;) TIA and Cheers, Jeff C. -- Jeff Chan mailto:[EMAIL PROTECTED] http://sc.surbl.org/
