On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote: > I'm thinking something like this, using what you presented as an example: > > Generated internal pseudo-header: > X-Spam-URL: > http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec > > ...basically, URL-part|displayed-text-part > > (Suggestions for a more appropriate delimiter than "|" are solicited...) > > Repeat the header for each URL found that has displayed text; only > include those where the displayed text is not the same as the URL. > > Then you could write a simple bounded rule like: > > header YT_LINK_SPOOF X-Spam-URL m,\|https?://[^/]*youtube\.com/watch,i > > As long as you're already capturing the data, it might be useful to > generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with > URL-domain|displayed-text-domain only if the displayed text looks like a > URL. For example: > > X-Spam-URL-DomainOnly: www.probono.fr|www.youtube.com > Of the two, I would prefer the second apart from the problem of matching the two halves if/when there is more than one URL in a message. Either way, this type of test would only work on an HTML body part. I don't see how it can help with plain text parts. That said, I have another suggestion: if the HTML parser can build an associative array, using the the URL as the element's key and the text half as the value it would be easy to either use a plugin to compare the key and value. Alternatively, a new type of rule could be added to handle the comparison. Note that a simple match/nomatch comparison would not hack it because tags of the form <a href="http://www.example.com">My website</a> should always be accepted and so should <a href="http://www.example.com">www.example.com</a>
IOW, the comparison should only generate a hit if both halves of the tag hold a URI and the second half may legitimately omit the http:// or https:// prefix. Martin