On Wed, Sep 15, 2004 at 02:17:15AM -0700, Jeff Chan wrote: > On Wednesday, September 15, 2004, 1:38:30 AM, Julian Field wrote: > > ... Is it possible to detect where > > <A HREF="foo">bar</A> > > and foo and bar are unrelated domains? > > That could be a good idea for a rule. It would be nice if it > could be determined canonically, without actually resolving > either location.
IMHO this is near impossible. The trivial String Back-reference check can never determine whether 'foo' and 'bar' are un*related*. Just whether the text *in* the HREF is unequal to the text shown to the user highlighted as a link. In all cases, where the HREF is only 'semantically' *related* to the following link text, a string check will assume 'spam', while 'spam/scam' will sooner or later just obfuscate the text portion by javascript or encoding tricks. e.g.: <a HREF="www.eplus.de">imail.de</a> is 'related' (even if 'mis'constructed) because you find access to the 'imail.de' Mails via the 'www.eplus.de' webserver. Also many Mail-Texts of the kind ... to reach FOO click <a HREF="somedomain">here</a> would be very difficult to 'analyze correctly'. So I believe it to be an interesting idea for AI specialists, but alas not for inclusion in spamassassin as it works now. Stucki (postmaster at mi.fu-berlin.de using spamassassin 2.63)