Re: Help with a regex to catch spam with gibberish html tags

John Hardin Wed, 29 Jan 2014 10:36:55 -0800

On Wed, 29 Jan 2014, Joe Quinn wrote:

On 1/29/2014 11:53 AM, Andy Jezierski wrote:

 I've been noticing a lot of spam getting through with the same traits, a
 bunch of random words within brackets.  They all seem to come after the
 </body> or the </html> tag.  Anyone much more knowledgeable than me care
 to assist with a rule to detect them?


 Example:

 </html>

 </body>
 <style>
 <geehrter>
 <convaincre>
 <eingerichtet>
 <piuttosto>
 <meny>


...etc snipped.

I've been seeing that as well. They seem to all begin with <style> as well,to keep that crap from going through mail client HTML parsers.
You can probably exploit the fact that nobody is ever going to write a styleblock that doesn't match /[{}]/, but I haven't been able to experiment yetwith any rules.


There is already a style gibberish rule.

http://ruleqa.spamassassin.org/20140128-r1562007-n/STYLE_GIBBERISH/detail

I wouldn't recommend going the more general route of counting invalid HTMLtags, simply due to the enormity of trying to maintain such a rule over time.

Not in a rule certainly. That would be more proper in a plugin. Agreedthat maintenance of the list of valid hTML tags would be an ongoing issueunless the list is available in machine-parseable form somewhere and acode generator based on that is used to support the plugin.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Maxim IX: Never turn your back on an enemy.
-----------------------------------------------------------------------
 3 days until the 11st anniversary of the loss of STS-107 Columbia

Re: Help with a regex to catch spam with gibberish html tags

Reply via email to