On Thu, 30 Jan 2014, Amir Caspi wrote:

On Jan 30, 2014, at 10:28 AM, Kevin A. McGrail <kmcgr...@pccc.com> wrote:

      If you want to share the complete rule, I can throw it into my sandbox 
and see what masscheck thinks as well.


The complete rule would be something like this, assuming Andy implemented it as 
I wrote it:

rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}/
describe HTML_NONSENSE_TAGS Many consecutive multi-letter HTML tags, likely 
nonsense/spam
score HTML_NONSENSE_TAGS 0.001

Actually that unbounded {10,} repeat can be written as an explicit {10} with out
reducing the effectiveness of the rule and make it more CPU efficient. IE once
you've found at least 10 consecutive pseudo-tags do you care if there are more
than 10 (since you're not looking for anything specific after the match nor
doing anything with knowing the exact number of them).


--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Reply via email to