On Wed, 2012-09-26 at 07:42 -0700, John Hardin wrote: > On Wed, 26 Sep 2012, Martin Gregorie wrote: > > > apart from the problem of matching the two halves if/when there is more > > than one URL in a message. > > I'm not following what you mean here, could you explain that in a bit more > detail? I'm not proposing a hash by URL. The same URL with different > descriptions could appear multiple times in the message, and each one > would generate URL headers (modulo suppression of exact duplicates). > I think we're in general agreement here. I was visualising two correlated lists of pseudo-headed content, one for the URL part of the tag and a second for the visible text because this would allow easy cross-comparisons between headers and HTML body text, which would add flexibility to rule composition. It s sometimes useful to compare URIs across headers, e.g. in most (all?) ham the domain name in the From: header, the Message-ID: and at least one of the Received: headers is the same, so if the sender domain isn't in any of the Received headers, it could be forged. > > Either way, this type of test would only work on an HTML body part. > > That was an unstated assumption on my part, as that's the only context > where this sort of "obfuscation" is even possible. > Yep. I just wanted to make this explicit.
> > That said, I have another suggestion: if the HTML parser can build an > > associative array, using the the URL as the element's key and the text > > half as the value it would be easy to either use a plugin to compare the > > key and value. > > Rules are easier to write on an ad-hoc basis than are plugins. My thinking > was to let the plugin do the difficult part of extracting the data from > the HTML and quoted-unreadable markup and present it to the rules in a > standardized, easy-to-use form. Then a header rule (bounded) can be > written to perform whatever further analysis is desired or suggested by > spammer practices. > Yes, agreed, but there are some types of tests that would be extraordinarily difficult to write as a set of rules, often cross-header tests as described above, that would be easy to write as a plugin, simply because you can store message fragments in variables and do comparisions with these, rather than with [lists of] constants in a rule. At first I thought this could be handled by having a set of variables that were accessible from rules, i.e. more pseudo-headers, but then realised that SA makes no guarantees about the order in which rules are executed other than metarules being executed after all the (sub)rules they reference. Obviously this prevents the use of variables in rules. > > IOW, the comparison should only generate a hit if both halves of the tag > > hold a URI and the second half may legitimately omit the http:// or > > https:// prefix. > > That prohibits possibly useful analysis of the description applied to a > URL. I don't want to predict that will never be useful data. > Agreed, but that would only be a problem for some comparison operators: I was thinking that, say, '~=' would only fire if both sides held a URI rather than a regex. Other comparison operators could react differently. Martin