Re: HTML link regex

Martin Gregorie Wed, 26 Sep 2012 08:39:32 -0700

On Wed, 2012-09-26 at 07:42 -0700, John Hardin wrote:
> On Wed, 26 Sep 2012, Martin Gregorie wrote:
> 
> > apart from the problem of matching the two halves if/when there is more 
> > than one URL in a message.
> 
> I'm not following what you mean here, could you explain that in a bit more 
> detail? I'm not proposing a hash by URL. The same URL with different 
> descriptions could appear multiple times in the message, and each one 
> would generate URL headers (modulo suppression of exact duplicates).
> 
I think we're in general agreement here. I was visualising two
correlated lists of pseudo-headed content, one for the URL part of the
tag and a second for the visible text because this would allow easy
cross-comparisons between headers and HTML body text, which would add
flexibility to rule composition. It s sometimes useful to compare URIs
across headers, e.g. in most (all?) ham the domain name in the From:
header, the Message-ID: and at least one of the Received: headers is the
same, so if the sender domain isn't in any of the Received headers, it
could be forged.
 
> > Either way, this type of test would only work on an HTML body part.
> 
> That was an unstated assumption on my part, as that's the only context 
> where this sort of "obfuscation" is even possible.
> 
Yep. I just wanted to make this explicit.


> > That said, I have another suggestion: if the HTML parser can build an 
> > associative array, using the the URL as the element's key and the text 
> > half as the value it would be easy to either use a plugin to compare the 
> > key and value.
> 
> Rules are easier to write on an ad-hoc basis than are plugins. My thinking 
> was to let the plugin do the difficult part of extracting the data from 
> the HTML and quoted-unreadable markup and present it to the rules in a 
> standardized, easy-to-use form. Then a header rule (bounded) can be 
> written to perform whatever further analysis is desired or suggested by 
> spammer practices.
> 
Yes, agreed, but there are some types of tests that would be
extraordinarily difficult to write as a set of rules, often cross-header
tests as described above, that would be easy to write as a plugin,
simply because you can store message fragments in variables and do
comparisions with these, rather than with [lists of] constants in a
rule. At first I thought this could be handled by having a set of
variables that were accessible from rules, i.e. more pseudo-headers, but
then realised that SA makes no guarantees about the order in which rules
are executed other than metarules being executed after all the
(sub)rules they reference. Obviously this prevents the use of variables
in rules.

> > IOW, the comparison should only generate a hit if both halves of the tag
> > hold a URI and the second half may legitimately omit the http:// or 
> > https:// prefix.
> 
> That prohibits possibly useful analysis of the description applied to a 
> URL. I don't want to predict that will never be useful data.
> 
Agreed, but that would only be a problem for some comparison operators:
I was thinking that, say, '~=' would only fire if both sides held a URI
rather than a regex. Other comparison operators could react differently.
 
Martin

Re: HTML link regex

Reply via email to