Re: HTML link regex

Axb Wed, 26 Sep 2012 03:06:20 -0700

On 09/26/2012 12:02 PM, Martin Gregorie wrote:

On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote:

I'm thinking something like this, using what you presented as an example:


Generated internal pseudo-header:
     X-Spam-URL:    
http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec

...basically, URL-part|displayed-text-part

(Suggestions for a more appropriate delimiter than "|" are solicited...)

Repeat the header for each URL found that has displayed text; only
include those where the displayed text is not the same as the URL.

Then you could write a simple bounded rule like:

header  YT_LINK_SPOOF  X-Spam-URL  m,\|https?://[^/]*youtube\.com/watch,i

As long as you're already capturing the data, it might be useful to
generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with
URL-domain|displayed-text-domain only if the displayed text looks like a
URL. For example:

      X-Spam-URL-DomainOnly:   www.probono.fr|www.youtube.com

Of the two, I would prefer the second apart from the problem of matching
the two halves if/when there is more than one URL in a message. Either
way, this type of test would only work on an HTML body part. I don't see
how it can help with plain text parts. That said, I have another
suggestion: if the HTML parser can build an associative array, using the
the URL as the element's key and the text half as the value it would be
easy to either use a plugin to compare the key and value. Alternatively,
a new type of rule could be added to handle the comparison. Note that a
simple match/nomatch comparison would not hack it because tags of the
form
<a href="http://www.example.com";>My website</a>
should always be accepted and so should
<a href="http://www.example.com";>www.example.com</a>

IOW, the comparison should only generate a hit if both halves of the tag
hold a URI and the second half may legitimately omit the http:// or
https:// prefix.


have you looked at the URIDetail plugin ?

Re: HTML link regex

Reply via email to