Re: Get count of URLs in message

Karsten Bräckelmann Fri, 06 Dec 2013 14:39:18 -0800

On Fri, 2013-12-06 at 16:02 -0500, Joe Quinn wrote:
> The file 10_hasbase.cf has the following rule:
>      uri     __HAS_URI /./
> 
> Is there a similar rule anywhere (or a way to write one), which could 
> match against emails containing many URIs?


"tflags multiple" is the general answer to counting. There are a few
caveats and issues to consider in this particular case, though.

  uri    __HAS_N_URIS  /^./
  tflags __HAS_N_URIS  multiple

  meta     HAS_4_URIS  __HAS_N_URIS >= 4

The non-scoring sub-rule does the counting, while the meta defines an
actual rule based on the number of occurrences.


Important notes regarding that rule-set:

* It is NOT sufficient to simply add tflags multiple to the __HAS_URI
rule. Multiple RE evaluation is continued, where the previous match
ended. Thus, tflags multiple would result in counting chars in all URIs.

The RE /^./ prevents this by anchoring at the beginning of the string,
thus the beginning of a URI.

* The above requires SA 3.3, it does NOT work with 3.2.

The reason is the URI parser re-design for 3.3. Previous versions
matched uri rules against multiple cleaned, canonicalized versions in
some cases. A plain example.net in text without protocol results in a
duplicate "http://example.net"; in the list of URIs -- impossible to
filter out.


For development, or to verify what the rule matches exactly, a slightly
modified version can be used:

  uri    __HAS_N_URIS  /^.+/

Greedy matching like that consumes the whole URI, which is handy with
the -D debug output listing the actual match -- for each of the multiple
rule matches.

  dbg: rules: ran uri rule __HAS_N_URIS ======> got hit: "http://example.net";


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Get count of URLs in message

Reply via email to