On Fri, 9 Nov 2018, Amir Caspi wrote:
On Nov 9, 2018, at 8:49 AM, John Hardin <jhar...@impsec.org> wrote:
rawbody HTML_ENC_ASCII
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i
I'll add that too so that we can compare the results.
Per my reply a few minutes ago, I think this will be too restrictive. While
the current batch may rely on pure ASCII encoding, it's only a matter of time
until they start to throw unicode lookalikes in there. I don't think there's
any legitimate reason for a long string of encoded chars, so using either of
the two rules I proposed yesterday would catch ALL HTML-encoded characters (in
the full UTF-16 set).
Early results (not all corpora are in yet) look *very* promising:
3% of spam, S/O .958 and almost all spam hits are <5 points.
Cool! Though it looks like results are slightly down now, later in the day...
only ~1% of spam and S/O 0.931. Looks like it does hit a few hams, and on a
few corpora, hits ONLY ham.
I'd be interested to know if there's a performance difference between my two
proposed rules. I suspect the second should run (slightly) faster. I think
they'll both catch exactly the same number of spams (barring case sensitivity,
where the first rule needs to be corrected), and I don't foresee a significant
FP danger on the second rule despite its relative generality.
I think we have a winner. Thanks, Amir (and possibly RW)!
My pleasure. Please keep us posted on which version of the two rules performs
best.
I shall.
What's the recommendation on score? Or meta rules?
I'd have to look at the overlaps to decide what best to meta it with. For
a spam rule you look at spam with no-ham overlaps, and for FP exclusion
you look at ham with no-spam overlaps
What would be the timeline to distribute the rule via sa-update?
Potentially monday.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Drugs will always be around. Politicians are therefore making an
active decision to distribute them through violent gangs. --twitter
-----------------------------------------------------------------------
2 days until Veterans Day