I think I know a bit about extracting URLs from spam ;)
It is pretty damn complicated. A lot of tricks they play, like www.amazon.com.buy-my-drugs-com.optelnd.net
Then you have hex and decimal links to deal with. And yeah, they do pepper the spam with legit urls. What about akami image links? Its was common to see 20 links in a spam, and only one was the evil one you wanted.
Automation without a LOT of checks and balances = FPs.
You have to have a LOT more autoresearched evidence then just that they are contained in a spam. But hey! A+ for effort! Its a start, and it will always get better.
Chris Santerre
SysAdmin and SARE/URIBL ninja
http://www.uribl.com
http://www.rulesemporium.com
> -----Original Message-----
> From: Kristopher Austin [mailto:[EMAIL PROTECTED]]
> Sent: Friday, February 10, 2006 11:04 AM
> To: [EMAIL PROTECTED]; spamassassin-users@incubator.apache.org
> Subject: RE: Xtracting urls from saved spams & making SA rules -
> xurl001.pl
>
>
> I would recommend caution when using such a program. I see
> lots of spam
> that have legitimate URLs sprayed in them as well.
>
> I do think this would be very useful though. Just need to
> make sure you
> look through the rules and remove the good guys.
>
> Kris
>
> > -----Original Message-----
> > From: Michael W Cocke [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, February 10, 2006 8:57 AM
> > To: spamassassin-users@incubator.apache.org
> > Subject: Xtracting urls from saved spams & making SA rules -
> xurl001.pl
> >
> > It's absolutely not finished, but attached is a quick perl hack I'm
> > using to read thru a directory of saved spam (text files), extract
> > urls and automatically build SA rules for them. It's not debugged
> > throughly and I have a few more things to add, but I know
> I'm not the
> > only person who can use this.
> >
> > Mike-
> > --
> > If you're not confused, you're not trying hard enough.
> > --
> > Please note - Due to the intense volume of spam, we have installed
> > site-wide spam filters at catherders.com. If email from
> you bounces,
> > try non-HTML, non-encoded, non-attachments,
>