subject:"Re\: Bayes underperforming, HTML entities\?"

Re: Bayes underperforming, HTML entities?

2018-12-07 Thread John Hardin


On Fri, 7 Dec 2018, Amir Caspi wrote:


On Dec 6, 2018, at 12:14 PM, John Hardin  wrote:


Runaway backtracking that was killing masscheck for several people.


Hrm, that is disconcerting.  I'm not sure where any backtracking might be 
occurring...


This sort of thing is risky, especially in a rawbody rule:

   (?:\w|\s|[.,!?:'"()$])*


I don't see where there is backtracking, and I tested this on spamples prior to 
suggesting it... but clearly I must have missed something.  Any help is 
appreciated.


It was apparently only going insane on specific messages, your corpus 
probably didn't have a poisonous one.


(John: is it worth sandboxing the other proposed ZW rule, 
AC_HTML_ZEROWIDTH_BONANZA, or would that be duplicated by the unicode ZW 
obfuscation rules?  (The difference is that this is a rawbody rule.)


I'll take a look, I didn't focus on that one.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Public Education: the bureaucratic process of replacing
  an empty mind with a closed one.  -- Thorax
---
 Today: The 77th anniversary of Pearl Harbor

Re: Bayes underperforming, HTML entities?

2018-12-07 Thread Amir Caspi

On Dec 6, 2018, at 12:14 PM, John Hardin  wrote:
> 
> Runaway backtracking that was killing masscheck for several people.

Hrm, that is disconcerting.  I'm not sure where any backtracking might be 
occurring...

Can anyone help improve this suggested rule?

rawbody AC_HTML_ENTITY_BONANZA_NEW  
(?:(?:\w|\s|[.,!?:'"()$])*(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*)+){10,}
describeAC_HTML_ENTITY_BONANZA_NEW  Lots of HTML entities, possibly 
interspersed within words

I don't see where there is backtracking, and I tested this on spamples prior to 
suggesting it... but clearly I must have missed something.  Any help is 
appreciated.

(John: is it worth sandboxing the other proposed ZW rule, 
AC_HTML_ZEROWIDTH_BONANZA, or would that be duplicated by the unicode ZW 
obfuscation rules?  (The difference is that this is a rawbody rule.)

Thanks!

--- Amir

Re: Bayes underperforming, HTML entities?

2018-12-06 Thread John Hardin


On Tue, 4 Dec 2018, Amir Caspi wrote:


On Dec 1, 2018, at 10:31 AM, John Hardin  wrote:



On Thu, 29 Nov 2018, Amir Caspi wrote:


A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) and 
see how it performs, including possible FPs?


Done.


Any preliminary results?


Runaway backtracking that was killing masscheck for several people.

I tried to limit it but was unsuccessful. The test rule has been disabled.

It will need more work to be usable.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  It is not the business of government to make men virtuous
  or religious, or to preserve the fool from the consequences
  of his own folly.   -- Henry George
---
 Tomorrow: The 77th anniversary of Pearl Harbor

Re: Bayes underperforming, HTML entities?

2018-12-04 Thread John Hardin


On Tue, 4 Dec 2018, Amir Caspi wrote:


On Dec 1, 2018, at 10:31 AM, John Hardin  wrote:



On Thu, 29 Nov 2018, Amir Caspi wrote:


A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) and 
see how it performs, including possible FPs?


Done.


Any preliminary results?


Not that are really usable yet. There's something strange going on with 
peoples' masschecks that is interfering with everybody getting results in 
a timely manner.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  We should endeavour to teach our children to be gun-proof
  rather than trying to design our guns to be child-proof
---
 3 days until The 77th anniversary of Pearl Harbor

Re: Bayes underperforming, HTML entities?

2018-12-04 Thread Amir Caspi

On Dec 1, 2018, at 10:31 AM, John Hardin  wrote:
> 
>> On Thu, 29 Nov 2018, Amir Caspi wrote:
>> 
>>> A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) 
>>> and see how it performs, including possible FPs?
> 
> Done.

Any preliminary results?

Looks like we have a couple other HTML-related things that need to be added.  
See spample:
https://pastebin.com/Few8fVfF 

1) Looks like  is now being used instead of regular spaces to join some 
highly spammy words.  Are these turned into "regular" spaces by the HTML 
interpreter prior to body rules?  Or do they get turned into non-breaking space 
characters which are different than regular spaces?  Like all the ZW stuff, 
this seems like it should get "normalized" so it can be available both in raw 
and normal form for Bayes to pick up...

2) This particular spample has its "Bayes poison" text within a div with 
line-height:0, but there does not appear to be a rule to capture this.  That 
same div uses font-size:1px, so I would have thought this would trigger a "tiny 
fonts" rule, but apparently not.

It would seem our tiny font and/or other "trying to make this invisible" rules 
should be updated to capture these attempts.

I also saw another spample which had opacity:0 set on its "Bayes poison" text, 
but the "low contrast" rule didn't pop.

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-12-01 Thread John Hardin


On Thu, 29 Nov 2018, John Hardin wrote:


On Thu, 29 Nov 2018, Amir Caspi wrote:


On Nov 29, 2018, at 3:27 PM, John Hardin  wrote:


I'll see whether those can be incorporated into the existing 
UNICODE_OBFU_ZW rule (which of course will no longer actually be UNICODE 
:) )


Great. Maybe rename the rule. ;-)

What are your thoughts on item #2?  Specifically:

A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) 
and see how it performs, including possible FPs?


Sure.


Done.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Judicial Activism (n): interpreting the Constitution to grant the
  government powers that are popularly felt to be "needed" but that
  are not explicitly provided for therein (common definition);
  interpreting the Constitution as it is written (Brady definition)
---
 611 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread RW

On Fri, 30 Nov 2018 15:49:31 -0700
Amir Caspi wrote:

> > It make it harder to write rules detecting these tricks, but it may
> > happen eventually. As far as Bayes is concerned, it would be a
> > shame to lose the information.  
> 
> I'm not sure I see how Bayes can take decent advantage out of these
> zero-width chars.  If they are interspersed randomly within words,
> then Bayes has to tokenize each and every permutation (or, at least,
> very many permutations) of each word in order to be decently
> effective.  But if the zero-width chars are stripped out, then Bayes
> only has to tokenize the regular, displayable word.  Am I missing
> something?

Yes, you need something in between. A tokenization that avoids
learning the hundreds of obfuscation variants, but doesn't throw away
the existence of obfuscation. 

> But offering both converted and non-converted options is likely the
> best option, and then having Bayes work on the normalized version
> resolves the above.

Not simply on the normalized text, that way you lose information. In the
example I gave, the word:
 
  has 

would get tokenized twice, once through the body and once through
the list of obfuscated words in the pseudo-header, producing the tokens:

   'has'
   'HX-Obfuscated-Norm:has'

the former token would likely be neutral and drop out, but the second
would probably only appear in spam. 

The upshot of this is that invisible obfuscation:

- no longer breaks body rules
- is easier for Bayes to learn than non-obfuscated text
- can still be tested via X-Obfuscated-Orig without the
  complexity of rawbody

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Bill Cole


On 30 Nov 2018, at 17:49, Amir Caspi wrote:

On Nov 30, 2018, at 7:00 AM, Bill Cole 
 wrote:


Since HTML is already getting rendered to text, then perhaps the 
conversion code should strip (literally, just delete) any zero-width 
characters during this conversion? That should make normal body 
rules, and Bayes, function properly, no?


Not if they are *looking for* those characters.


But AFAIK we're only looking for those characters with rawbody rules,


Not so.

because it's really hard to search for them in regular body rules... 
no?


No.

See the relevant rule cluster (all with 'ZW' in their names) in KAM.cf 
and __UNICODE_OBFU_ZW in the standard ruleset.


Also see my more generic (but still useful!) __SCC_SHORT_WORDS and 
derivatives in KAM.cf: it is a body rule that takes advantage of the 
fact that zero-width typographical control characters create logical 
word breaks as far as Perl is concerned.





--
Bill Cole

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Amir Caspi

On Nov 30, 2018, at 7:00 AM, Bill Cole 
 wrote:
> 
>> Since HTML is already getting rendered to text, then perhaps the conversion 
>> code should strip (literally, just delete) any zero-width characters during 
>> this conversion? That should make normal body rules, and Bayes, function 
>> properly, no?
> 
> Not if they are *looking for* those characters.

But AFAIK we're only looking for those characters with rawbody rules, because 
it's really hard to search for them in regular body rules... no?  I'm not 
trying to advocate for removal of rawbody rules, but rather making it easier 
for normal body rules to work.

But RW's suggestion is probably a good one: offer both paths:

On Nov 30, 2018, at 7:46 AM, RW  wrote:
> 
> It make it harder to write rules detecting these tricks, but it may
> happen eventually. As far as Bayes is concerned, it would be a shame to
> lose the information.

I'm not sure I see how Bayes can take decent advantage out of these zero-width 
chars.  If they are interspersed randomly within words, then Bayes has to 
tokenize each and every permutation (or, at least, very many permutations) of 
each word in order to be decently effective.  But if the zero-width chars are 
stripped out, then Bayes only has to tokenize the regular, displayable word.  
Am I missing something?

But offering both converted and non-converted options is likely the best 
option, and then having Bayes work on the normalized version resolves the above.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread RW

On Fri, 30 Nov 2018 06:29:31 -0700
Amir Caspi wrote:

> On Nov 30, 2018, at 6:09 AM, RW  wrote:
> > 
> > The most substantial problem here is that these invisible characters
> > make it very hard to write ordinary body rules.  
> 
> Thanks for the clarification on my confusion. Since HTML is already
> getting rendered to text, then perhaps the conversion code should
> strip (literally, just delete) any zero-width characters during this
> conversion? That should make normal body rules, and Bayes, function
> properly, no?
> 
> Is there a reason not to strip out zero-width characters? That is, is
> there any benefit or reason to maintain invisible chars versus
> throwing them out?

It make it harder to write rules detecting these tricks, but it may
happen eventually. As far as Bayes is concerned, it would be a shame to
lose the information.

What I think might be a good compromise is to normalize out all
invisible and high quality obfuscations, but add the original and
normalized  words to two metadata headers.

So, if  represent a homoglyph for 'a' and  is an invisible
character, then the text

   my mlware has copied your address book

would be converted to 

   my malware has copied your address book

with the generation of

X-Obfuscated-Orig: mlware has  address
X-Obfuscated-Norm: malware has address

It would be possible to run headers rules against either pseudo header.

Bayes would ignore X-Obfuscated-Orig and tokenize X-Obfuscated-Norm with
a dedicated prefix. Most common English works from that header would be
strongly spammy.

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Bill Cole


On 30 Nov 2018, at 8:29, Amir Caspi wrote:


On Nov 30, 2018, at 6:09 AM, RW  wrote:


The most substantial problem here is that these invisible characters
make it very hard to write ordinary body rules.


Thanks for the clarification on my confusion. Since HTML is already 
getting rendered to text, then perhaps the conversion code should 
strip (literally, just delete) any zero-width characters during this 
conversion? That should make normal body rules, and Bayes, function 
properly, no?


Not if they are *looking for* those characters.

Is there a reason not to strip out zero-width characters? That is, is 
there any benefit or reason to maintain invisible chars versus 
throwing them out?


The presence of zero-width characters is a very strong spam indicator. 
It isn't quite perfect however, since at least one procedurally 
legitimate and rather popular US entity is sending mail that people 
affirmatively want to receive like this: 
https://www.scconsult.com/atkspam.txt


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Amir Caspi

On Nov 30, 2018, at 6:09 AM, RW  wrote:
> 
> The most substantial problem here is that these invisible characters
> make it very hard to write ordinary body rules.

Thanks for the clarification on my confusion. Since HTML is already getting 
rendered to text, then perhaps the conversion code should strip (literally, 
just delete) any zero-width characters during this conversion? That should make 
normal body rules, and Bayes, function properly, no?

Is there a reason not to strip out zero-width characters? That is, is there any 
benefit or reason to maintain invisible chars versus throwing them out?

Thanks!

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread RW

On Thu, 29 Nov 2018 22:33:12 -0700
Amir Caspi wrote:

> On Nov 29, 2018, at 10:11 PM, Bill Cole
>  wrote:
> > 
> > I have no issue with adding a new rule type to act on the output of
> > a partial well-defined HTML parsing, something in between 'rawbody'
> > and 'body' types, but overloading normalize_charset with that and
> > so affecting every existing rule of all body-oriented rule types
> > would be a bad design.  
> 
> The problem as I see it is that spammers are using HTML encoding as
> effectively another charset, and as a way of obfuscating like they
> did/do with Unicode lookalikes... but unless those HTML characters
> are translated there is no way to catch this obfuscation.

normalize_charset is about converting  text from whatever character set
it's in to UTF-8, and nothing else. SpamAssassin should already decode
HTML to text for body rules. Rules matching the HTML entities use
rawbody specifically to avoid having them converted to plain text.

The most substantial problem here is that these invisible characters
make it very hard to write ordinary body rules.

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Amir Caspi

On Nov 29, 2018, at 10:11 PM, Bill Cole 
 wrote:
> 
> I have no issue with adding a new rule type to act on the output of a partial 
> well-defined HTML parsing, something in between 'rawbody' and 'body' types, 
> but overloading normalize_charset with that and so affecting every existing 
> rule of all body-oriented rule types would be a bad design.

The problem as I see it is that spammers are using HTML encoding as effectively 
another charset, and as a way of obfuscating like they did/do with Unicode 
lookalikes... but unless those HTML characters are translated there is no way 
to catch this obfuscation.

In other words — the encoded entities DISPLAY as something different than the 
content over which rules run... and because encoding is cumbersome and not 
human-readable, it also makes writing rules to catch these MUCH harder. Worse 
yet, they evade Bates almost completely because the encoded words don’t 
tokenize well.

Maybe normalize_charset isn’t the right place to do it, but it seems like there 
should be some way of converting HTML-encoded entities into their 
single-character ASCII or Unicode equivalents before body rules and especially 
before Bayes tokenization, so that we can tokenize and run our rules on the 
-displayed- text and not the encoded text...

How best to achieve this?

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Bill Cole


On 29 Nov 2018, at 17:32, Amir Caspi wrote:

B) Do you think that normalize_charsets could evolve to handle HTML 
entities?


That would be a mess. The normalize_charset option acts on the decoded 
text of text/* MIME parts before that text is parsed into meaningful 
tokens.


I have no issue with adding a new rule type to act on the output of a 
partial well-defined HTML parsing, something in between 'rawbody' and 
'body' types, but overloading normalize_charset with that and so 
affecting every existing rule of all body-oriented rule types would be a 
bad design.




--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread John Hardin


On Thu, 29 Nov 2018, Amir Caspi wrote:


On Nov 29, 2018, at 3:27 PM, John Hardin  wrote:


I'll see whether those can be incorporated into the existing UNICODE_OBFU_ZW 
rule (which of course will no longer actually be UNICODE :) )


Great. Maybe rename the rule. ;-)

What are your thoughts on item #2?  Specifically:

A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) and 
see how it performs, including possible FPs?


Sure.


B) Do you think that normalize_charsets could evolve to handle HTML entities?


Potentially. I'm not familiar with that part of the code.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Gun Control laws aren't enacted to control guns, they are enacted
  to control people: catholics (1500s), japanese peasants (1600s),
  blacks (1860s), italian immigrants (1911), armenians (1911),
  the irish (1920s), jews (1930s), blacks (1960s), the poor (always)
---
 609 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Amir Caspi

On Nov 29, 2018, at 3:27 PM, John Hardin  wrote:
> 
> I'll see whether those can be incorporated into the existing UNICODE_OBFU_ZW 
> rule (which of course will no longer actually be UNICODE :) )

Great. Maybe rename the rule. ;-)

What are your thoughts on item #2?  Specifically:

A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) and 
see how it performs, including possible FPs?

B) Do you think that normalize_charsets could evolve to handle HTML entities?

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread John Hardin


On Thu, 29 Nov 2018, Amir Caspi wrote:


1) A new variant is showing up lately, with liberal use of zero-width 
spaces/joiners. See spample:
https://pastebin.com/zBVWaiew 

This uses the  (zero-width joiner) HTML entity, interspersed within words. I 
don't see any legitimate reason that these should be present for Roman charsets and 
other non-complex scripts that don't require it.  Later in the spample there is similar 
usage of the  (zero-width space) entity. I've seen a few other examples 
with other zero-width entities, as well.


I'll see whether those can be incorporated into the existing 
UNICODE_OBFU_ZW rule (which of course will no longer actually be UNICODE :) )


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason;
  it also shuts down in sympathy when the servers at Microsoft crash.
---
 609 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Amir Caspi

On Nov 10, 2018, at 11:30 AM, John Hardin  wrote:
> 
> Initial results (again, all corpora aren't in yet)...
> 
> The rawbody rules perform much better (unsurprising), and the ASCII-only one 
> has a better raw S/O:
> 
> https://ruleqa.spamassassin.org/20181110-r1846283-n/__RW_HTML_ENTITY_ASCII_RAW/detail
>  
> 
> https://ruleqa.spamassassin.org/20181110-r1846283-n/__AC_HTML_ENTITY_BONANZA_SHRT_RAW/detail
>  
> 
> 
> The body one is still getting hits:
> 
> https://ruleqa.spamassassin.org/20181110-r1846283-n/__AC_HTML_ENTITY_BONANZA_SHRT_BODY/detail
>  
> 
> 
> ...but it's 99-100% overlap with the RAW version so it looks like it's purely 
> due to misformatting of the message.

Two new complications on this -- one could be solved by a new rule, but both 
could be solved by an evolution of the main rule.  The problem is that evolving 
the rule could pose an FP risk.  See spamples and discussion below.  Ultimately 
I think we need to consider HTML entities to be a form of character set, and 
have normalize_charsets convert them...

1) A new variant is showing up lately, with liberal use of zero-width 
spaces/joiners. See spample:
https://pastebin.com/zBVWaiew 

This uses the  (zero-width joiner) HTML entity, interspersed within words. 
I don't see any legitimate reason that these should be present for Roman 
charsets and other non-complex scripts that don't require it.  Later in the 
spample there is similar usage of the  (zero-width space) entity. I've 
seen a few other examples with other zero-width entities, as well.

A proposed rule to catch these zero-width entities (and variants) within Roman 
script:

rawbody AC_HTML_ZEROWIDTH_BONANZA

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread John Hardin


On Thu, 15 Nov 2018, Amir Caspi wrote:


On Nov 15, 2018, at 2:36 PM, John Hardin  wrote:



It doesn't seem to have a very high score just yet... I'm still getting FNs 
with the rule hitting (due to those messages hitting BAYES_00/05).


Manually train those messages as spam and that should repair itself...


Actually... right now it looks like the rule still has the test score of 0.001 
when network tests are enabled... is that correct or a bug?


That will cleanup after the network masscheck this weekend.


72_scores.cf: score HTML_ENTITY_ASCII 3.000 0.001 3.000 
0.001
72_scores.cf: score HTML_ENTITY_ASCII_TINY2.000 0.001 2.000 
0.001

Seems like the score shouldn't depend on network tests, for this...


The score generation has to take into account the effect of hits from 
other rules that *are* network-based.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  One unexpected benefit of time passing more quickly as you get older
  is the perceived increase in the frequency of paychecks.
---
 595 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread John Hardin


On Thu, 15 Nov 2018, Amir Caspi wrote:


On Nov 15, 2018, at 2:36 PM, John Hardin  wrote:


That and its resistance to FP avoidance.


Despite the generality, I don't see a significant FP risk on the general unicode version. 
 I don't see ANY legitimate reason why an email would hard-encode long sequences of 
human-readable text, in any language or character set, using HTML entities like this.  
Legitimate emails can be sent with a character encoding intended for the target language 
and then the content doesn't need to be entity-encoded, it can just be included 
"properly" in the email.

My recollection is there were few to no FPs in the corpora test, right?  Or am 
I misremembering?


Fairly low; I asked the corpora owner for a review and they were all 
apparently legit.


I'll reenable the base rules so we can watch their performance. I don't 
think a subrule that isn't used gets published unless it's pushed with a 
tflag...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  One unexpected benefit of time passing more quickly as you get older
  is the perceived increase in the frequency of paychecks.
---
 595 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread Amir Caspi

On Nov 15, 2018, at 2:36 PM, John Hardin  wrote:
> 
>> It doesn't seem to have a very high score just yet... I'm still getting FNs 
>> with the rule hitting (due to those messages hitting BAYES_00/05).
> 
> Manually train those messages as spam and that should repair itself...

Actually... right now it looks like the rule still has the test score of 0.001 
when network tests are enabled... is that correct or a bug?

72_scores.cf: score HTML_ENTITY_ASCII 3.000 0.001 3.000 
0.001
72_scores.cf: score HTML_ENTITY_ASCII_TINY2.000 0.001 2.000 
0.001

Seems like the score shouldn't depend on network tests, for this...

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread Amir Caspi

On Nov 15, 2018, at 2:36 PM, John Hardin  wrote:
> 
> That and its resistance to FP avoidance.

Despite the generality, I don't see a significant FP risk on the general 
unicode version.  I don't see ANY legitimate reason why an email would 
hard-encode long sequences of human-readable text, in any language or character 
set, using HTML entities like this.  Legitimate emails can be sent with a 
character encoding intended for the target language and then the content 
doesn't need to be entity-encoded, it can just be included "properly" in the 
email.

My recollection is there were few to no FPs in the corpora test, right?  Or am 
I misremembering?

> Manually train those messages as spam and that should repair itself...

I've been... I think it's helping but there are still some FNs slipping 
through.  It'll probably take a while to get this caught up.

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread John Hardin


On Thu, 15 Nov 2018, Amir Caspi wrote:


On Nov 10, 2018, at 11:30 AM, John Hardin  wrote:


The rawbody rules perform much better (unsurprising), and the ASCII-only one 
has a better raw S/O:


It looks like HTML_ENTITY_ASCII has been rolled out -- did you decide 
against the more general unicode due to S/O score?


That and its resistance to FP avoidance.


I predict we will need it sooner rather than later.


I can easily reenable the subrules so that we can see how they perform in 
masscheck.


It doesn't seem to have a very high score just yet... I'm still getting 
FNs with the rule hitting (due to those messages hitting BAYES_00/05).


Manually train those messages as spam and that should repair itself...

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  How can you reason with someone who thinks we're on a glidepath to
  a police state and yet their solution is to grant the government a
  monopoly on force? They are insane.
---
 595 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread Amir Caspi

On Nov 10, 2018, at 11:30 AM, John Hardin  wrote:
> 
> The rawbody rules perform much better (unsurprising), and the ASCII-only one 
> has a better raw S/O:

It looks like HTML_ENTITY_ASCII has been rolled out -- did you decide against 
the more general unicode due to S/O score?  I predict we will need it sooner 
rather than later.

It doesn't seem to have a very high score just yet... I'm still getting FNs 
with the rule hitting (due to those messages hitting BAYES_00/05).

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-10 Thread John Hardin

On Fri, 9 Nov 2018, John Hardin wrote:

On Fri, 9 Nov 2018, Amir Caspi wrote:

I'd be interested to know if there's a performance difference between my
two proposed rules. I suspect the second should run (slightly) faster.

It looks that way - only .0001s difference on *some* messages.

Re body vs. rawbody:

I fixed the MIME boundaries and the body version stopped working (as
expected), so I added rawbody versions.

I do note that the first version of the rule checked in was a body rule, and
it did hit on a bunch of spam... Any speculation as to why?

revisions checked in for side-by-side tests:
Sendingsvn/trunk/rulesrc/sandbox/jhardin/20_misc_testing.cf
Transmitting file data .done
Committing transaction...
Committed revision 1846277.

Note the rule name changes - that's temporary, the survivor's name will be
cleaned up a bit.

Initial results (again, all corpora aren't in yet)...

The rawbody rules perform much better (unsurprising), and the ASCII-only
one has a better raw S/O:

https://ruleqa.spamassassin.org/20181110-r1846283-n/__RW_HTML_ENTITY_ASCII_RAW/detail
https://ruleqa.spamassassin.org/20181110-r1846283-n/__AC_HTML_ENTITY_BONANZA_SHRT_RAW/detail

The body one is still getting hits:

https://ruleqa.spamassassin.org/20181110-r1846283-n/__AC_HTML_ENTITY_BONANZA_SHRT_BODY/detail

...but it's 99-100% overlap with the RAW version so it looks like it's
purely due to misformatting of the message.

--
John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
---
Perfect Security and Absolute Safety are unattainable; beware
those who would try to sell them to you, regardless of the cost,
for they are trying to sell you your own slavery.
---
Tomorrow: Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin


On Fri, 9 Nov 2018, John Hardin wrote:


On Fri, 9 Nov 2018, Amir Caspi wrote:

I'd be interested to know if there's a performance difference between my 
two proposed rules.  I suspect the second should run (slightly) faster.


It looks that way - only .0001s difference on *some* messages.

Re body vs. rawbody:

I fixed the MIME boundaries and the body version stopped working (as 
expected), so I added rawbody versions.


I do note that the first version of the rule checked in was a body 
rule, and it did hit on a bunch of spam... Any speculation as to why?


revisions checked in for side-by-side tests:
Sendingsvn/trunk/rulesrc/sandbox/jhardin/20_misc_testing.cf
Transmitting file data .done
Committing transaction...
Committed revision 1846277.

Note the rule name changes - that's temporary, the survivor's name will be 
cleaned up a bit.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Campuses today are a theatrical mashup of
  1984 and Lord of the Flies, performed by people
  who don't understand these references.   -- David Burge
---
 2 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread RW

On Fri, 9 Nov 2018 15:34:47 -0500
Kris Deugau wrote:

> Amir Caspi wrote:
> > On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas
> >  wrote:  
> >>
> >> how many spams and hams did you train then?  
> > 
> > As of right now:
> > 0.000  0 258427  0  non-token data: nspam
> > 0.000  0 106813  0  non-token data: nham
> > 0.000  0 438310  0  non-token data: ntokens
> >   
> >> I have increased to this number, on some servers even to double of
> >> that number.  
> > 
> > I increased to your recommendation, so per above, am now storing
> > more tokens... hopefully this helps.  
> 
> My target for tweaking bayes_expiry_max_db_size at work has been to
> try to hit no more than 5-10% daily churn in tokens;  IIRC I've asked
> once or twice but nobody else has spoken up with any of their own
> rules of thumb.  Right now it's probably a bit high at 245 (given
> that every so often, there are a couple of days with no tokens
> expired), but the default of 250K was far too low.

The default is actually 150,000. IIRC his retention was 64 days which
isn't too bad. I'd take it up to 300,000 and see how it goes.

The standard expiry algorithm isn't designed to handle very long
retention and it may stop working altogether. IIRC the retention at the
target size of 0.75*bayes_expiry_max_db_size should be less than 256
days.

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin


On Fri, 9 Nov 2018, Amir Caspi wrote:


On Nov 9, 2018, at 8:49 AM, John Hardin  wrote:



rawbody   HTML_ENC_ASCII   
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i


I'll add that too so that we can compare the results.


Per my reply a few minutes ago, I think this will be too restrictive.  While 
the current batch may rely on pure ASCII encoding, it's only a matter of time 
until they start to throw unicode lookalikes in there.  I don't think there's 
any legitimate reason for a long string of encoded chars, so using either of 
the two rules I proposed yesterday would catch ALL HTML-encoded characters (in 
the full UTF-16 set).


Early results (not all corpora are in yet) look *very* promising:
3% of spam, S/O .958 and almost all spam hits are <5 points.


Cool!  Though it looks like results are slightly down now, later in the day... 
only ~1% of spam and S/O 0.931.  Looks like it does hit a few hams, and on a 
few corpora, hits ONLY ham.

I'd be interested to know if there's a performance difference between my two 
proposed rules.  I suspect the second should run (slightly) faster.  I think 
they'll both catch exactly the same number of spams (barring case sensitivity, 
where the first rule needs to be corrected), and I don't foresee a significant 
FP danger on the second rule despite its relative generality.


I think we have a winner. Thanks, Amir (and possibly RW)!


My pleasure. Please keep us posted on which version of the two rules performs 
best.


I shall.


What's the recommendation on score?  Or meta rules?


I'd have to look at the overlaps to decide what best to meta it with. For 
a spam rule you look at spam with no-ham overlaps, and for FP exclusion 
you look at ham with no-spam overlaps



What would be the timeline to distribute the rule via sa-update?


Potentially monday.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Drugs will always be around. Politicians are therefore making an
  active decision to distribute them through violent gangs. --twitter
---
 2 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Kris Deugau


Amir Caspi wrote:

On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas  wrote:


how many spams and hams did you train then?


As of right now:
0.000  0 258427  0  non-token data: nspam
0.000  0 106813  0  non-token data: nham
0.000  0 438310  0  non-token data: ntokens


I have increased to this number, on some servers even to double of that
number.


I increased to your recommendation, so per above, am now storing more tokens... 
hopefully this helps.


My target for tweaking bayes_expiry_max_db_size at work has been to try 
to hit no more than 5-10% daily churn in tokens;  IIRC I've asked once 
or twice but nobody else has spoken up with any of their own rules of 
thumb.  Right now it's probably a bit high at 245 (given that every 
so often, there are a couple of days with no tokens expired), but the 
default of 250K was far too low.


-kgd

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi

On Nov 9, 2018, at 8:49 AM, John Hardin  wrote:
> 
>> rawbody   HTML_ENC_ASCII   
>> /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i
> 
> I'll add that too so that we can compare the results.

Per my reply a few minutes ago, I think this will be too restrictive.  While 
the current batch may rely on pure ASCII encoding, it's only a matter of time 
until they start to throw unicode lookalikes in there.  I don't think there's 
any legitimate reason for a long string of encoded chars, so using either of 
the two rules I proposed yesterday would catch ALL HTML-encoded characters (in 
the full UTF-16 set).

> Early results (not all corpora are in yet) look *very* promising:
> 3% of spam, S/O .958 and almost all spam hits are <5 points.

Cool!  Though it looks like results are slightly down now, later in the day... 
only ~1% of spam and S/O 0.931.  Looks like it does hit a few hams, and on a 
few corpora, hits ONLY ham.

I'd be interested to know if there's a performance difference between my two 
proposed rules.  I suspect the second should run (slightly) faster.  I think 
they'll both catch exactly the same number of spams (barring case sensitivity, 
where the first rule needs to be corrected), and I don't foresee a significant 
FP danger on the second rule despite its relative generality.

> I think we have a winner. Thanks, Amir (and possibly RW)!

My pleasure. Please keep us posted on which version of the two rules performs 
best.

What's the recommendation on score?  Or meta rules?

What would be the timeline to distribute the rule via sa-update?

Cheers!

-- Amir

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi

On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas  wrote:
> 
> how many spams and hams did you train then?

As of right now:
0.000  0 258427  0  non-token data: nspam
0.000  0 106813  0  non-token data: nham
0.000  0 438310  0  non-token data: ntokens

> I have increased to this number, on some servers even to double of that
> number.

I increased to your recommendation, so per above, am now storing more tokens... 
hopefully this helps.

Thanks!

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi

On Nov 9, 2018, at 7:41 AM, RW  wrote:
> 
> I was really referring to the fact that it's pure ASCII text that's
> being encoded rather than long runs per se

That is true for the current batch of messages, but as we've seen, spammers 
love to use unicode obfuscation to try to foil Bayes and other filters... hence 
why I figured we should be proactive and catch all HTML encoding.

> but you may well be right that long runs are inherently suspicious, I'm
> not very familiar with HTML practices.

AFAIK there is no good or sane reason to ever encode readable Roman-character 
text in this way, or likely any character set.  If characters can be embedded 
"as is" (i.e., without encoding) in the email, then they will be properly 
interpreted and rendered in HTML without any need for encoding.  Encoding only 
adds bulk and obfuscation, and I can see no legitimate reason to encode 
language in this way.

Apparently John's masscheck run last night would seem to agree. =)

Cheers!

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin


On Fri, 9 Nov 2018, RW wrote:


On Thu, 8 Nov 2018 19:24:47 -0700
Amir Caspi wrote:


On Nov 8, 2018, at 4:51 PM, RW  wrote:


Unnecessary encoding is fairly common, but a long runs of ASCII
characters encoded like this seems extreme.


Right, that was a question I had asked in my email this morning...
whether we have a rule to detect long sequences of HTML entities.



I was really referring to the fact that it's pure ASCII text that's
being encoded rather than long runs per se, so I'm trying:

rawbody   HTML_ENC_ASCII   
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i


I'll add that too so that we can compare the results.


but you may well be right that long runs are inherently suspicious, I'm
not very familiar with HTML practices.




Proposed rule:
bodyAC_HTML_ENTITY_BONANZA
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describeAC_HTML_ENTITY_BONANZA  Long run of
HTML-encoded characters score   AC_HTML_ENTITY_BONANZA



Early results (not all corpora are in yet) look *very* promising:

https://ruleqa.spamassassin.org/20181109-r1846219-n/__AC_HTML_ENTITY_BONANZA/detail

3% of spam, S/O .958 and almost all spam hits are <5 points.

I think we have a winner. Thanks, Amir (and possibly RW)!


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Activist: Someone who gets involved.
  Unregistered Lobbyist: Someone who gets involved
   with something the MSM doesn't approve of. -- WizardPC
---
 2 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin


On Thu, 8 Nov 2018, Bill Cole wrote:


On 8 Nov 2018, at 21:55, John Hardin wrote:


On Thu, 8 Nov 2018, Amir Caspi wrote:


On Nov 8, 2018, at 7:41 PM, John Hardin  wrote:


Sure, but I't also prefer to have a sample to test on before committing. 
I'll see if I can get the pastebin to work (i.e. fix the boundary)


I can send you some new spamples via attachment, privately.


No, the pastebinned ones work unaltered.


The problem with that: they are mangled in a way that prevents HTML 
interpretation of then HTML part. Hence a 'body' rule will match the 
uninterpreted entities. For the real world, i.e. with proper MIME structure, 
I think you need a 'rawbody' rule to match against the uninterpreted 
entities.


Okay, I will confirm that today too. Thanks for pointing it out.

I have confirmed that de-munging the boundaries fixes them to allow proper 
MIME interpretation. In both cases, the boundary line between the 2 MIME 
parts is apparently unchan ged, so you can just use it to fix the other 3 
places that it needs to match.


Okay.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Activist: Someone who gets involved.
  Unregistered Lobbyist: Someone who gets involved
   with something the MSM doesn't approve of. -- WizardPC
---
 2 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Matus UHLAR - fantomas


On Nov 8, 2018, at 2:30 AM, Matus UHLAR - fantomas  wrote:

Do you use autolearn? There are a few rules to detect ham (score
negatively), many of them based on default whitelists and DNS whitelists,
where many mails come from grey area companies, not necessarily spam, but
training their mail as ham can lower the detection rate of real spams.


On 08.11.18 12:06, Amir Caspi wrote:

autolearn is technically enabled, but every single message in ham (inbox)
has autolearn=no, and the same is true for my spam store.  So, none of my
tokens were autolearned, and all (should have) resulted only from my
manual training.


how many spams and hams did you train then?


I found this number of tokens low, and have increased it.

bayes_expiry_max_db_size262144



Are you recommending increasing TO this number, or FROM this number?  It
looks like my spam tokens are approaching this number, so I am assuming
you think I should go higher?  Any recommended number?


I have increased to this number, on some servers even to double of that
number.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
There's a long-standing bug relating to the x86 architecture that
allows you to install Windows.   -- Matthew D. Fuller

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread RW

On Thu, 8 Nov 2018 19:24:47 -0700
Amir Caspi wrote:

> On Nov 8, 2018, at 4:51 PM, RW  wrote:
> > 
> > Unnecessary encoding is fairly common, but a long runs of ASCII
> > characters encoded like this seems extreme.  
> 
> Right, that was a question I had asked in my email this morning...
> whether we have a rule to detect long sequences of HTML entities.  


I was really referring to the fact that it's pure ASCII text that's
being encoded rather than long runs per se, so I'm trying:

rawbody   HTML_ENC_ASCII   
/(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i

but you may well be right that long runs are inherently suspicious, I'm
not very familiar with HTML practices.



> Proposed rule:
> body  AC_HTML_ENTITY_BONANZA
> (?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
> describe  AC_HTML_ENTITY_BONANZA  Long run of
> HTML-encoded characters score AC_HTML_ENTITY_BONANZA

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Bill Cole


On 8 Nov 2018, at 21:55, John Hardin wrote:


On Thu, 8 Nov 2018, Amir Caspi wrote:


On Nov 8, 2018, at 7:41 PM, John Hardin  wrote:


Sure, but I't also prefer to have a sample to test on before 
committing. I'll see if I can get the pastebin to work (i.e. fix the 
boundary)


I can send you some new spamples via attachment, privately.


No, the pastebinned ones work unaltered.


The problem with that: they are mangled in a way that prevents HTML 
interpretation of then HTML part. Hence a 'body' rule will match the 
uninterpreted entities. For the real world, i.e. with proper MIME 
structure, I think you need a 'rawbody' rule to match against the 
uninterpreted entities.


I have confirmed that de-munging the boundaries fixes them to allow 
proper MIME interpretation. In both cases, the boundary line between the 
2 MIME parts is apparently unchan ged, so you can just use it to fix the 
other 3 places that it needs to match.



--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread John Hardin


On Thu, 8 Nov 2018, Amir Caspi wrote:


On Nov 8, 2018, at 7:55 PM, John Hardin  wrote:


I left it case-sensitive; is there some reason the entities cannot be coded as 
(e.g.)  ? I kinda doubt it, so it should *probably* be case-insensitive 
to avoid trivial bypass.


I think it should be insensitive, sorry for that oversight on my part.  You 
could just use the second, cleaner regex instead.  That is case-insensitive by 
design.  I think they should perform identically (except for case), and the 
second one is more compact and may scan faster (fewer 'or' tests).


OK, I'll do some comparisons here, I'm instrumented for performance stats. 
I won't change it until tomorrow, though.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Perfect Security and Absolute Safety are unattainable; beware
  those who would try to sell them to you, regardless of the cost,
  for they are trying to sell you your own slavery.
---
 3 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 7:55 PM, John Hardin  wrote:
> 
> I left it case-sensitive; is there some reason the entities cannot be coded 
> as (e.g.)  ? I kinda doubt it, so it should *probably* be 
> case-insensitive to avoid trivial bypass.

I think it should be insensitive, sorry for that oversight on my part.  You 
could just use the second, cleaner regex instead.  That is case-insensitive by 
design.  I think they should perform identically (except for case), and the 
second one is more compact and may scan faster (fewer 'or' tests).

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread John Hardin


On Thu, 8 Nov 2018, Amir Caspi wrote:


On Nov 8, 2018, at 7:41 PM, John Hardin  wrote:


Sure, but I't also prefer to have a sample to test on before committing. I'll 
see if I can get the pastebin to work (i.e. fix the boundary)


I can send you some new spamples via attachment, privately.


No, the pastebinned ones work unaltered.

Added to my sandbox with some minor tweaks:

Sendingsvn/trunk/rulesrc/sandbox/jhardin/20_misc_testing.cf
Transmitting file data .done
Committing transaction...
Committed revision 1846207.

I left it case-sensitive; is there some reason the entities cannot be 
coded as (e.g.)  ? I kinda doubt it, so it should *probably* be 
case-insensitive to avoid trivial bypass.


We'll see how it does tomorrow morning...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 3 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 7:41 PM, John Hardin  wrote:
> 
> Sure, but I't also prefer to have a sample to test on before committing. I'll 
> see if I can get the pastebin to work (i.e. fix the boundary)
> 

I can send you some new spamples via attachment, privately.  Unfortunately I 
lost those particular spamples in an IMAP snafu today (connection barfed while 
moving those messages so they got deleted from the original folder and never 
written to the new folder)... but I've got plenty more.

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread John Hardin


On Thu, 8 Nov 2018, Amir Caspi wrote:


On Nov 8, 2018, at 4:51 PM, RW  wrote:


Unnecessary encoding is fairly common, but a long runs of ASCII
characters encoded like this seems extreme.


Right, that was a question I had asked in my email this morning... whether we 
have a rule to detect long sequences of HTML entities.  It would seem not.

John, is that something we can test in a sandbox and see how it performs in 
masscheck?


Sure, but I't also prefer to have a sample to test on before committing. 
I'll see if I can get the pastebin to work (i.e. fix the boundary)



Proposed rule:
bodyAC_HTML_ENTITY_BONANZA  
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describeAC_HTML_ENTITY_BONANZA  Long run of HTML-encoded characters
score   AC_HTML_ENTITY_BONANZA  0.001

This should catch either decimal or hex encoding, or named entities, and allows 
the characters to be separated by variable-length whitespace (in case they use 
actual whitespace instead of encoded whitespace).

If the regexp above is too complex, we could just match on the entity 
boundaries, restricting to allowable characters inside:

bodyAC_HTML_ENTITY_BONANZA  (?:&[A-Za-z0-9#]{2,};\s*){20}

Either should work, I believe.

Cheers.

--- Amir



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 3 days until Veterans Day

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 4:51 PM, RW  wrote:
> 
> Unnecessary encoding is fairly common, but a long runs of ASCII
> characters encoded like this seems extreme.

Right, that was a question I had asked in my email this morning... whether we 
have a rule to detect long sequences of HTML entities.  It would seem not.

John, is that something we can test in a sandbox and see how it performs in 
masscheck?

Proposed rule:
bodyAC_HTML_ENTITY_BONANZA  
(?:&(?:[A-Za-z0-9]{2,}|#(?:[0-9]{2,5}|x[0-9A-F]{2,4}));\s*){20}
describeAC_HTML_ENTITY_BONANZA  Long run of HTML-encoded characters
score   AC_HTML_ENTITY_BONANZA  0.001

This should catch either decimal or hex encoding, or named entities, and allows 
the characters to be separated by variable-length whitespace (in case they use 
actual whitespace instead of encoded whitespace).

If the regexp above is too complex, we could just match on the entity 
boundaries, restricting to allowable characters inside:

bodyAC_HTML_ENTITY_BONANZA  (?:&[A-Za-z0-9#]{2,};\s*){20}

Either should work, I believe.

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW

On Thu, 8 Nov 2018 23:30:42 +
RW wrote:

> On Thu, 8 Nov 2018 13:14:13 -0700
> Amir Caspi wrote:
> 
> 
> > If the HTML section is valid, as it appears to be ... then the HTML
> > should be decoded.  And yet, these emails are hitting BAYES_00 or
> > BAYES_05 despite the spammy HTML text.   
> 
> In the two examples there isn't really much in the html text. I think
> they probably get away with it by selling a wide range of innocuous
> sounding products with a terse low-key sales pitch.

I wonder if something could be made of this:



 

Unnecessary encoding is fairly common, but a long runs of ASCII
characters encoded like this seems extreme.

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW

On Thu, 8 Nov 2018 13:14:13 -0700
Amir Caspi wrote:

> If the HTML section is valid, as it appears to be ... then the HTML
> should be decoded.  And yet, these emails are hitting BAYES_00 or
> BAYES_05 despite the spammy HTML text. 

In the two examples there isn't really much in the html text. I think
they probably get away with it by selling a wide range of innocuous
sounding products with a terse low-key sales pitch.

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 2:19 PM, Bill Cole  
wrote:
> 
> [Resending because it looks like my first send went into a black hole...]

All SA messages appear to be coming with significantly delays today... not sure 
why.  I got RW's first message, sent at 8am today, only about an hour ago, 
AFTER the message when they said they'd already answered the question.

>> Do I need to completely trash and rebuild my DB, or am I missing something 
>> obvious?
> 
> No and no.
> 
> Although it is perhaps helpful to recognize that Bayes is inherently 
> imperfect and will always be wrong about some messages.

I definitely recognize that Bayes is imperfect.  My worry is that very few of 
my spams in recent days/weeks seem to be getting BAYES_99*, while some of them 
are getting BAYES_00/05.  I used to get lots of BAYES_99s, but not so much now, 
despite training.  Hence my concern that something is borked.

> Assuming that you did that breakage yourself, intentionally: Stop doing that. 
> It is pointless and hampers any attempt to assist you. The only things that 
> could ever be private about spam are the target address and internally-added 
> headers.

I did do it intentionally, and thanks for the warning -- I will stop doing that 
in the future and sanitize only the target address and internal headers.

Thanks!

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Bill Cole

[Resending because it looks like my first send went into a black 
hole...]



On 7 Nov 2018, at 14:33, Amir Caspi wrote:


Hi all,

	In the past couple of weeks I've gotten a number of clearly-spam 
messages that slipped past SA, and the only reason was because they 
were getting low Bayes scores (BAYES_50 or even down to BAYES_00 or 
BAYES_05).  I do my Bayes training manually on both ham and spam so 
there should not be any mis-categorizations... and things worked fine 
until a few weeks ago, so I don't know what's going on now.


Here's the magic dump:

-bash-3.2$ sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db 
version

0.000  0 253112  0  non-token data: nspam
0.000  0 106767  0  non-token data: nham
0.000  0 150434  0  non-token data: ntokens
0.000  0 1536087614  0  non-token data: oldest atime
0.000  0 1541617125  0  non-token data: newest atime
0.000  0 1541614751  0  non-token data: last journal 
sync atime
0.000  0 1541614749  0  non-token data: last expiry 
atime
0.000  05529600  0  non-token data: last expire 
atime delta
0.000  0   1173  0  non-token data: last expire 
reduction count



I don't see any obvious problem but I'm not an expert at interpreting 
these...


The only useful info is the the number of spams and hams scanned (nham 
and nspam) is well above the usage threshold and the fact that the 
various timestamps (other than 'oldest atime') are reasonably recent. If 
you happen not to live in Unix epoch time, the conversion is not hard:


   # date -j -f %s 1541617125
   Wed Nov  7 13:58:45 EST 2018


Do I need to completely trash and rebuild my DB, or am I missing 
something obvious?


No and no.

Although it is perhaps helpful to recognize that Bayes is inherently 
imperfect and will always be wrong about some messages.


In many cases, it would appear that these spams have either very 
little (real) text (besides the usual attempt at Bayes poisoning) 
and/or are using HTML-entity encoding to try to bypass Bayes.  Here 
are a couple of spamples:


https://pastebin.com/peiXZivJ
https://pastebin.com/3h3r7r7j


Those both have broken MIME structure, so SA can't treat the HTML part 
as HTML. No MUA would render and display them correctly.


Assuming that you did that breakage yourself, intentionally: Stop doing 
that. It is pointless and hampers any attempt to assist you. The only 
things that could ever be private about spam are the target address and 
internally-added headers.


Does SA decode HTML entities as part of normalize_charset?  If not ... 
can this be added?


I'm not  entirely certain, but the documentation of bayes_token_sources 
in Mail::SpamAssassin::Conf implies that HTML is rendered to text to the 
point where SA can tell whether it is visible, which makes me suspect 
that the entities get decoded. But that IS just a guess: I haven't 
traced the code.


Empirically, I had SA learn a message with regular text in an HTML part 
encoded as entities and then scanned a message with the same text as 
text, and I got a 1.000 Bayes score (BAYES_999) for the second one. YMMV


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Bill Cole


On 7 Nov 2018, at 14:33, Amir Caspi wrote:


Hi all,

	In the past couple of weeks I've gotten a number of clearly-spam 
messages that slipped past SA, and the only reason was because they 
were getting low Bayes scores (BAYES_50 or even down to BAYES_00 or 
BAYES_05).  I do my Bayes training manually on both ham and spam so 
there should not be any mis-categorizations... and things worked fine 
until a few weeks ago, so I don't know what's going on now.


Here's the magic dump:

-bash-3.2$ sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db 
version

0.000  0 253112  0  non-token data: nspam
0.000  0 106767  0  non-token data: nham
0.000  0 150434  0  non-token data: ntokens
0.000  0 1536087614  0  non-token data: oldest atime
0.000  0 1541617125  0  non-token data: newest atime
0.000  0 1541614751  0  non-token data: last journal 
sync atime
0.000  0 1541614749  0  non-token data: last expiry 
atime
0.000  05529600  0  non-token data: last expire 
atime delta
0.000  0   1173  0  non-token data: last expire 
reduction count



I don't see any obvious problem but I'm not an expert at interpreting 
these...


The only useful info is the the number of spams and hams scanned (nham 
and nspam) is well above the usage threshold and the fact that the 
various timestamps (other than 'oldest atime') are reasonably recent. If 
you happen not to live in Unix epoch time, the conversion is not hard:


   # date -j -f %s 1541617125
   Wed Nov  7 13:58:45 EST 2018


Do I need to completely trash and rebuild my DB, or am I missing 
something obvious?


No and no.

Although it is perhaps helpful to recognize that Bayes is inherently 
imperfect and will always be wrong about some messages.


In many cases, it would appear that these spams have either very 
little (real) text (besides the usual attempt at Bayes poisoning) 
and/or are using HTML-entity encoding to try to bypass Bayes.  Here 
are a couple of spamples:


https://pastebin.com/peiXZivJ
https://pastebin.com/3h3r7r7j


Those both have broken MIME structure, so SA can't treat the HTML part 
as HTML. No MUA would render and display them correctly.


Assuming that you did that breakage yourself, intentionally: Stop doing 
that. It is pointless and hampers any attempt to assist you. The only 
things that could ever be private about spam are the target address and 
internally-added headers.


Does SA decode HTML entities as part of normalize_charset?  If not ... 
can this be added?


I'm not  entirely certain, but the documentation of bayes_token_sources 
in Mail::SpamAssassin::Conf implies that HTML is rendered to text to the 
point where SA can tell whether it is visible, which makes me suspect 
that the entities get decoded. But that IS just a guess: I haven't 
traced the code.


Empirically, I had SA learn a message with regular text in an HTML part 
encoded as entities and then scanned a message with the same text as 
text, and I got a 1.000 Bayes score (BAYES_999) for the second one. YMMV


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW

On Wed, 7 Nov 2018 12:33:35 -0700
Amir Caspi wrote:

> In many cases, it would appear that these spams have either very
> little (real) text (besides the usual attempt at Bayes poisoning)
> and/or are using HTML-entity encoding to try to bypass Bayes.  Here
> are a couple of spamples:
> 
> https://pastebin.com/peiXZivJ
> https://pastebin.com/3h3r7r7j
> 
> Does SA decode HTML entities as part of normalize_charset?  If
> not ... can this be added?

Ordinarily yes, but these don't actually have a separate html part
because of the broken mime - the separators don't match. Presumably the
raw HTML is being treated as plain text.

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 12:20 PM, RW  wrote:
> 
> these emails don't contain a valid HTML mime section. They contain a bogus 
> html section that doesn't
> start with the separator defined in the top-level  Content-Type header.

Sorry, that is totally my fault.  In the spample, I was trying to sanitize any 
possible identifying information and I ended up over-sanitizing.  I sanitized 
the separator string for text/plain and at the end, but I missed the one for 
text/html.

So, bottom line -- the HTML mime section is actually valid in the original 
email.  The spample is invalid because of my overzealousness/paranoia/idiocy.

If the HTML section is valid, as it appears to be ... then the HTML should be 
decoded.  And yet, these emails are hitting BAYES_00 or BAYES_05 despite the 
spammy HTML text.  So, does this mean my Bayes DB is borked?  Or does it mean 
something else?

In looking through my recent spams, almost all of them are hitting either 
BAYES_50 or lower... almost none are hitting BAYES_99 (this includes the ones 
identified as spam for other scoring reasons).  This is despite the training.  
So I'm thinking maybe my Bayes DB is not working properly... unless somehow the 
Bayes poison is actually working.  Though I doubt the latter since discussions 
on here have asserted many times that "poison" doesn't work.  But, I don't know 
why the DB would stop scoring properly all of a sudden, after working fine for 
years...

Thanks.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 12:20 PM, RW  wrote:
> 
> I've already explained this.

Sorry, I don't recall this discussion, my apologies.

> Do these actually display on any email client?

Yes.  For example, for the first spample (https://pastebin.com/peiXZivJ), Apple 
Mail (OS X) displays the decoded HTML entities, as does Mail on the iPhone.  
RoundCube and SquirrelMail web clients also display it.  I didn't try 
Thunderbird or other clients, but the four above ones definitely display the 
HTML.

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW

On Thu, 8 Nov 2018 10:09:21 -0700
Amir Caspi wrote:

> (2) Does normalize_charset decode HTML entities?  If not, is this
> something that can be included?  Do I need to file a bugzilla?

I've already explained this. Ordinarily html is decoded (whether
normalize_charset is set or not), but these emails don't contain a
valid HTML mime section. They contain a bogus html section that doesn't
start with the separator defined in the top-level  Content-Type header.

Do these actually display on any email client?

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 8, 2018, at 2:30 AM, Matus UHLAR - fantomas  wrote:
> 
> Do you use autolearn? There are a few rules to detect ham (score
> negatively), many of them based on default whitelists and DNS whitelists,
> where many mails come from grey area companies, not necessarily spam, but
> training their mail as ham can lower the detection rate of real spams.

autolearn is technically enabled, but every single message in ham (inbox) has 
autolearn=no, and the same is true for my spam store.  So, none of my tokens 
were autolearned, and all (should have) resulted only from my manual training.

> I found this number of tokens low, and have increased it.
> 
> bayes_expiry_max_db_size262144

Are you recommending increasing TO this number, or FROM this number?  It looks 
like my spam tokens are approaching this number, so I am assuming you think I 
should go higher?  Any recommended number?

Thanks!

-- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

> do you regularly perform sa-update on that box?

Yes, it is run every night. However, I am still running 3.4.1, so if the sha1 
access has already been disabled, my updates are likely failing as of recently. 
 I'm working on updating to 3.4.2 but this is an ancient box and I haven't yet 
had the bandwidth to work on the SRPM.

> Furthermore I saw two hits on KAM rules [1]

I have KAM and it is updated once per day through wget.  Not sure why it didn't 
hit unless the KAM rules were updated in the day(s) since the spample receipt.  
I don't have IVM, unfortunately (can't justify the cost for my usage).

Cheers.

--- Amir

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Matus UHLAR - fantomas


On 07.11.18 12:33, Amir Caspi wrote:

In the past couple of weeks I've gotten a number of clearly-spam messages
that slipped past SA, and the only reason was because they were getting
low Bayes scores (BAYES_50 or even down to BAYES_00 or BAYES_05).  I do my
Bayes training manually on both ham and spam so there should not be any
mis-categorizations...  and things worked fine until a few weeks ago, so I
don't know what's going on now.


I've had similar experience after running SA in some pleaces.

Do you use autolearn? There are a few rules to detect ham (score
negatively), many of them based on default whitelists and DNS whitelists,
where many mails come from grey area companies, not necessarily spam, but
training their mail as ham can lower the detection rate of real spams.


Here's the magic dump:

-bash-3.2$ sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0 253112  0  non-token data: nspam
0.000  0 106767  0  non-token data: nham
0.000  0 150434  0  non-token data: ntokens


I found this number of tokens low, and have increased it.

bayes_expiry_max_db_size262144

could help in the long run.


0.000  0 1536087614  0  non-token data: oldest atime
0.000  0 1541617125  0  non-token data: newest atime
0.000  0 1541614751  0  non-token data: last journal sync atime
0.000  0 1541614749  0  non-token data: last expiry atime
0.000  05529600  0  non-token data: last expire atime delta
0.000  0   1173  0  non-token data: last expire reduction 
count


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Support bacteria - they're the only culture some people have.

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Tobi

Hi

I checked the first message on my SA and found multiple hits on
__SCC_SHORT_WORDS rule which resulted in hits on the metas

*  1.0 SCC_10_SHORT_WORD_LINES 10 lines with many short words
*  1.0 SCC_5_SHORT_WORD_LINES 5 lines with many short words
*  1.0 SCC_20_SHORT_WORD_LINES 20 lines with many short words

do you regularly perform sa-update on that box?
My bayes hit on BAYES_50

Furthermore I saw two hits on KAM rules [1]

As well as several dnsbl lookups hits (but its possible that these
listings are younger than the msg you received). On my SA the lookups
from IVM [2] (not free) hit very nice (on IPs and URIs). As well hits
from razor2


[1] https://www.pccc.com/downloads/SpamAssassin/contrib/KAM.cf
[2] https://www.invaluement.com/

Am 07.11.18 um 20:33 schrieb Amir Caspi:
> Hi all,
> 
>   In the past couple of weeks I've gotten a number of clearly-spam 
> messages that slipped past SA, and the only reason was because they were 
> getting low Bayes scores (BAYES_50 or even down to BAYES_00 or BAYES_05).  I 
> do my Bayes training manually on both ham and spam so there should not be any 
> mis-categorizations... and things worked fine until a few weeks ago, so I 
> don't know what's going on now.
> 
> Here's the magic dump:
> 
> -bash-3.2$ sa-learn --dump magic
> 0.000  0  3  0  non-token data: bayes db version
> 0.000  0 253112  0  non-token data: nspam
> 0.000  0 106767  0  non-token data: nham
> 0.000  0 150434  0  non-token data: ntokens
> 0.000  0 1536087614  0  non-token data: oldest atime
> 0.000  0 1541617125  0  non-token data: newest atime
> 0.000  0 1541614751  0  non-token data: last journal sync 
> atime
> 0.000  0 1541614749  0  non-token data: last expiry atime
> 0.000  05529600  0  non-token data: last expire atime 
> delta
> 0.000  0   1173  0  non-token data: last expire reduction 
> count
> 
> 
> I don't see any obvious problem but I'm not an expert at interpreting these...
> 
> Do I need to completely trash and rebuild my DB, or am I missing something 
> obvious?
> 
> In many cases, it would appear that these spams have either very little 
> (real) text (besides the usual attempt at Bayes poisoning) and/or are using 
> HTML-entity encoding to try to bypass Bayes.  Here are a couple of spamples:
> 
> https://pastebin.com/peiXZivJ
> https://pastebin.com/3h3r7r7j
> 
> Does SA decode HTML entities as part of normalize_charset?  If not ... can 
> this be added?
> 
> I'm using SA 3.4.1 (working on upgrading to 3.4.2 but have not had time to 
> build it yet).
> 
> Thanks!
> 
> --- Amir
>

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi

On Nov 7, 2018, at 12:33 PM, Amir Caspi  wrote:
> 
> In many cases, it would appear that these spams have either very little 
> (real) text (besides the usual attempt at Bayes poisoning) and/or are using 
> HTML-entity encoding to try to bypass Bayes.  Here are a couple of spamples:
> 
> https://pastebin.com/peiXZivJ
> https://pastebin.com/3h3r7r7j
> 
> Does SA decode HTML entities as part of normalize_charset?  If not ... can 
> this be added?

I'm getting a bunch more of these this morning -- all of them are using HTML 
entities to encode the spammy text.  There is a bunch of "Bayes poison" 
cleartext but all the spamminess is contained in HTML entities.

(1) Do we have any rules to detect long sequences of HTML entities?  That by 
itself seems spammy, but not definitive.

(2) Does normalize_charset decode HTML entities?  If not, is this something 
that can be included?  Do I need to file a bugzilla?

Thanks.

--- Amir

58 matches

Mail list logo