What might be interesting to do, is a regexp on JUST the domain names.

Similar to how MT-Blacklist works (the spam plugin for MovableType weblogs):
http://www.jayallen.org/comment_spam/

I've seen that most websites, follow similar patterns as some that JayAllen has pointed out. Why not perform these tests on SpamURL's themselves?

He's got a nice base to work off of. And his top 15 rules, seemt o catch most of the spammers:

([\w\-_.]+\.)?(l(so|os)tr)\.[a-z]{2,}     # Catchall regex for lsotr.xxx and 
lostr.xxx with or without a subdomain
(blow)[\w\-_.]*job[\w\-_.]*\.[a-z]{2,}
(buy)[\w\-_.]*online[\w\-_.]*\.[a-z]{2,}     # Catchall regexp for many spam 
sites
(diet|penis)[\w\-_.]*(pills|enlargement)[\w\-_.]*\.[a-z]{2,}     # Catchall 
regexp for many spam sites
(i|la)-sonneries?[\w\-_.]*\.[a-z]{2,}
(levitra|lolita|phentermine|viagra|vig-?rx|zyban|valtex|xenical|adipex|meridia\b)[\w\-_.]*\.[a-z]{2,}
     # Super regexp for domains containing levitra, lolita, phentermine, 
viagra, vigrx, vig-rx, zyban, valtex, xenical, adipex and meridia
(magazine)[\w\-_.]*(finder|netfirms)[\w\-_.]*\.[a-z]{2,}
(mike)[\w\-_.]*apartment[\w\-_.]*\.[a-z]{2,}     # Catchall regexp for Mike's 
Apartment variations
(milf)[\w\-_.]*(hunter|moms|fucking)[\w\-_.]*\.[a-z]{2,}
(online)[\w\-_.]*casino[\w\-_.]*\.[a-z]{2,}     # Catchall regexp for a hundred 
online casino sites
(prozac|zoloft|xanax|valium|hydrocodone|vicodin|paxil|vioxx)[\w\-_.]*\.[a-z]{2,}
     # Super regexp for domains containing prozac, zoloft, xanax, valium, 
hydrocodone, vicodin, paxil, vioxx
(ragazze)-?\w+\.[a-z]{2,}     # Catchall regexp for many spam sites
(ultram\b|\btenuate|tramadol|pheromones|phendimetrazine|ionamin|ortho.?tricyclen|retin.?a)[\w\-_.]*\.[a-z]{2,}
     # Third drug super regexp
(valtrex|zyrtec|\bhgh\b|ambien\b|flonase|allegra|didrex|renova\b|bontril|nexium)[\w\-_.]*\.[a-z]{2,}
     # Fourth drug super regexp


That covers a ton of spam URL's.

His excellent plugin works off of the URL's spammers specify. It's a great solution for the situation. Email is a bit tougher. But perhaps we could harness this capacity?



Gary Funck wrote:

As a follow-up to, but off-topic from the bug report ...

------- Additional Comments From [EMAIL PROTECTED]


2004-01-25 02:18 -------
I don't like the idea of having to run mass-checks manually and
extracting domain names to check from that -- mostly because most
people won't do it.

How about this:

- Extract registerable domain part using reportedly existing heuristics
 (hostpart.spammer.co.uk -> spammer.co.uk)




Over the weekend, I've collected 3600 host names associated with 16,300 URL's extracted from about 80,000 spam messages going back to August of this year. They're sorted in reverse dot order, for example:

trimtram.net
trinketreach.net
www.try4free.net
www.ultrastats.net
umbrellacover.net
www.usagov.net
www.usaskylink.net
ns.usenetsolution.net
www.vacationpromo.net
mysite.verizon.net
viva-x.net
www.vivato.net
bradford.hfwnflvzxb.wealthnation.net
lane.nerbq.wealthnation.net
www.whitephantom.net
www.whitetrashsluts.net
www.whoringfor-college.net
www.wideep.net

As you can see, for example, the wealthnation.net entries are together, but
the host name prefixes are different.

Question: is there a Perl package that can be used to boil these down
to their domain name part, suitable for a whois look up? Where I'm going
with this is to try and build a data base of same regirstrar/techinal point
of contact and so on. One approach I thought of was to try a whois on the
fully qualified host names above, and if it doesn't succed, then remove
the first component and try again, and so on, but that's not very elegant.

Regarding whois, I tried a few of the domains in the list and noticed
that whois turned up empty. Is there a database somewhere that relates
domain names to their registrar, or to a server that will reply with their
whois info?








-- Robert J. Accettura [EMAIL PROTECTED]


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to