On 09/26/2012 12:02 PM, Martin Gregorie wrote:
On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote:
I'm thinking something like this, using what you presented as an example:
Generated internal pseudo-header:
X-Spam-URL:
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec
...basically, URL-part|displayed-text-part
(Suggestions for a more appropriate delimiter than "|" are solicited...)
Repeat the header for each URL found that has displayed text; only
include those where the displayed text is not the same as the URL.
Then you could write a simple bounded rule like:
header YT_LINK_SPOOF X-Spam-URL m,\|https?://[^/]*youtube\.com/watch,i
As long as you're already capturing the data, it might be useful to
generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with
URL-domain|displayed-text-domain only if the displayed text looks like a
URL. For example:
X-Spam-URL-DomainOnly: www.probono.fr|www.youtube.com
Of the two, I would prefer the second apart from the problem of matching
the two halves if/when there is more than one URL in a message. Either
way, this type of test would only work on an HTML body part. I don't see
how it can help with plain text parts. That said, I have another
suggestion: if the HTML parser can build an associative array, using the
the URL as the element's key and the text half as the value it would be
easy to either use a plugin to compare the key and value. Alternatively,
a new type of rule could be added to handle the comparison. Note that a
simple match/nomatch comparison would not hack it because tags of the
form
<a href="http://www.example.com">My website</a>
should always be accepted and so should
<a href="http://www.example.com">www.example.com</a>
IOW, the comparison should only generate a hit if both halves of the tag
hold a URI and the second half may legitimately omit the http:// or
https:// prefix.
have you looked at the URIDetail plugin ?