Re: HTML link regex

Martin Gregorie Wed, 26 Sep 2012 03:03:23 -0700

On Tue, 2012-09-25 at 22:12 -0700, John Hardin wrote:
> I'm thinking something like this, using what you presented as an example:
> 
> Generated internal pseudo-header:
>     X-Spam-URL:    
> http://www.probono.fr/95280_pdf|http://www.youtube.com/watch?v=3VvOFqaHbL5&feature=g-vrec&feature=g-vrec
> 
> ...basically, URL-part|displayed-text-part
> 
> (Suggestions for a more appropriate delimiter than "|" are solicited...)
> 
> Repeat the header for each URL found that has displayed text; only 
> include those where the displayed text is not the same as the URL.
> 
> Then you could write a simple bounded rule like:
> 
> header  YT_LINK_SPOOF  X-Spam-URL  m,\|https?://[^/]*youtube\.com/watch,i
> 
> As long as you're already capturing the data, it might be useful to 
> generate _two_ pseudo-headers per URL; add X-Spam-URL-DomainOnly with 
> URL-domain|displayed-text-domain only if the displayed text looks like a 
> URL. For example:
> 
>      X-Spam-URL-DomainOnly:   www.probono.fr|www.youtube.com
> 
Of the two, I would prefer the second apart from the problem of matching
the two halves if/when there is more than one URL in a message. Either
way, this type of test would only work on an HTML body part. I don't see
how it can help with plain text parts. That said, I have another
suggestion: if the HTML parser can build an associative array, using the
the URL as the element's key and the text half as the value it would be
easy to either use a plugin to compare the key and value. Alternatively,
a new type of rule could be added to handle the comparison. Note that a
simple match/nomatch comparison would not hack it because tags of the
form 
<a href="http://www.example.com";>My website</a> 
should always be accepted and so should 
<a href="http://www.example.com";>www.example.com</a>


IOW, the comparison should only generate a hit if both halves of the tag
hold a URI and the second half may legitimately omit the http:// or
https:// prefix.

  
Martin

Re: HTML link regex

Reply via email to