On Tuesday, July 20, 2004, 6:58:15 AM, David Hooton wrote:
> On Tue, 20 Jul 2004 15:27:52 +0200, Marc Kool <[EMAIL PROTECTED]> wrote:

>> I did a quick check on a few domains and I do not share your conclusion.

I think we have a slight case of culture clash here.  This
adult data is meant to be used in a proxy server where
the data is apparently matched literally against URI data
from web requests, etc.

SURBLs are designed to be used with specific email message body
scanning programs that attempt to reduce the domains found in
message body URIs to their registrar (base) domain so that
subdomains like "models.home.att.net" are reduced to the
base domain "att.net" before being included in a SURBL
or checked against a SURBL.

The main reason we did this was to defeat the "random
subdomain" spammers who generate random subdomains to
try to defeat simple URI pattern matching or to key
their spams to confirm the recipient addresses.  Examples
might be "abc1.xyz.spammerdomain.com" and
"abc2.xyz.spammerdomain.com".  Those we want to reduce
to just "spammerdomain.com" since the randomized/keyed
versions may occur only once and the sc.surbl.org data
engine tries to increase the likelyhood of inclusion
in the list with an increasing number of reports.

It may be useful to read about the sc.surbl.org data:

  http://www.surbl.org/data.html

and the related Implementation Guidelines:

  http://www.surbl.org/implementation.html

to gain a clearer understanding of some of our design
decisions.

So both Mark and David's comments make sense in those
differing contexts.  The two contexts differ mainly in their
handling of subdomains:

>> # grep aol.com domains
>> adultaol.com
>> register.oscar.aol.com
>> sex-aol.com
>> sexonaol.com
>> usaol.com

> register.oscar.aol.com is the server used by AOL messenger and ICQ to
> login - how on earth does this count as an Adult Website, much less a
> sex site?!!

And more importantly in my first try at processing the data for
use as a SURBL, "register.oscar.aol.com" got reduced to "aol.com".  :-(

>> # grep att.net domains
>> adultonly.home.att.net
>> borderjumper.home.att.net
[...]

> Ahh the plot thickens...  Subdomains..

>> # grep -w au.com domains
>> aotoys.au.com
>> condoms.au.com
[...]

>> For au.com and att.net there are only adult subdomains in the blacklist.  
>> This is ok.

> However SURBL's in general don't use subdomains, I've just run a test
> on my personal SURBL and SpamCopURI doesn't currently look at
> subdomains.  I suspect because of the requirement for a lookup per
> domain level which would obviously both make things inefficient and
> also leave room for a denial of service.
[...]

>> I assume that something went wrong when you verified the quality of the 
>> database.

> I think the levels of understanding of what was in the DB and what
> SURBL was able to do were what went wrong.

> Given my very quick testing I think it would probably be worth giving
> this data a try, we would most likely need to work out how to remove
> the subdomained entries - the list is huge, and efficiency we can gain
> by removing excess data would obviously be useful.

Good suggestion, but perhaps slightly tricky to implement,
depending on the data.

I can easily use a regex to delete entries with subdomains
like "xxxmovies.home.att.net" so that "att.net" does not
get on the list.  But that would only be effective if the
deliberately randomized domains like "abc.xyz.spammerdomain.com"
were reduced to "spammerdomain.com" in the source data, otherwise
we would lose both.

In other words, if the data is a literal transcription of
everything found in spams, including randomized URIs like
"abc.xyz.spammerdomain.com," then we will lose the latter if I
discard all subdomains.

So Mark, can you tell us if the randomized domains that spammers
frequently used are reduced to the base domains in the adult
data, i.e. "spammerdomain.com" and not "abc.xyz.spammerdomain.com"?

Jeff C.

Reply via email to