On Fri, 9 Nov 2018, Amir Caspi wrote:

On Nov 9, 2018, at 8:49 AM, John Hardin <jhar...@impsec.org> wrote:

rawbody   HTML_ENC_ASCII   
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i

I'll add that too so that we can compare the results.

Per my reply a few minutes ago, I think this will be too restrictive.  While 
the current batch may rely on pure ASCII encoding, it's only a matter of time 
until they start to throw unicode lookalikes in there.  I don't think there's 
any legitimate reason for a long string of encoded chars, so using either of 
the two rules I proposed yesterday would catch ALL HTML-encoded characters (in 
the full UTF-16 set).

Early results (not all corpora are in yet) look *very* promising:
3% of spam, S/O .958 and almost all spam hits are <5 points.

Cool!  Though it looks like results are slightly down now, later in the day... 
only ~1% of spam and S/O 0.931.  Looks like it does hit a few hams, and on a 
few corpora, hits ONLY ham.

I'd be interested to know if there's a performance difference between my two 
proposed rules.  I suspect the second should run (slightly) faster.  I think 
they'll both catch exactly the same number of spams (barring case sensitivity, 
where the first rule needs to be corrected), and I don't foresee a significant 
FP danger on the second rule despite its relative generality.

I think we have a winner. Thanks, Amir (and possibly RW)!

My pleasure. Please keep us posted on which version of the two rules performs 
best.

I shall.

What's the recommendation on score?  Or meta rules?

I'd have to look at the overlaps to decide what best to meta it with. For a spam rule you look at spam with no-ham overlaps, and for FP exclusion you look at ham with no-spam overlaps

What would be the timeline to distribute the rule via sa-update?

Potentially monday.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Drugs will always be around. Politicians are therefore making an
  active decision to distribute them through violent gangs. --twitter
-----------------------------------------------------------------------
 2 days until Veterans Day

Reply via email to