Re: Bayes underperforming, HTML entities?

2018-12-07 Thread John Hardin
On Fri, 7 Dec 2018, Amir Caspi wrote: On Dec 6, 2018, at 12:14 PM, John Hardin wrote: Runaway backtracking that was killing masscheck for several people. Hrm, that is disconcerting. I'm not sure where any backtracking might be occurring... This sort of thing is risky, especially in a

Re: Bayes underperforming, HTML entities?

2018-12-07 Thread Amir Caspi
On Dec 6, 2018, at 12:14 PM, John Hardin wrote: > > Runaway backtracking that was killing masscheck for several people. Hrm, that is disconcerting. I'm not sure where any backtracking might be occurring... Can anyone help improve this suggested rule? rawbody AC_HTML_ENTITY_BONANZA_NEW

Re: Bayes underperforming, HTML entities?

2018-12-06 Thread John Hardin
On Tue, 4 Dec 2018, Amir Caspi wrote: On Dec 1, 2018, at 10:31 AM, John Hardin wrote: On Thu, 29 Nov 2018, Amir Caspi wrote: A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) and see how it performs, including possible FPs? Done. Any preliminary results?

Re: Bayes underperforming, HTML entities?

2018-12-04 Thread John Hardin
On Tue, 4 Dec 2018, Amir Caspi wrote: On Dec 1, 2018, at 10:31 AM, John Hardin wrote: On Thu, 29 Nov 2018, Amir Caspi wrote: A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) and see how it performs, including possible FPs? Done. Any preliminary results?

Re: Bayes underperforming, HTML entities?

2018-12-04 Thread Amir Caspi
On Dec 1, 2018, at 10:31 AM, John Hardin wrote: > >> On Thu, 29 Nov 2018, Amir Caspi wrote: >> >>> A) Could you sandbox the proposed rule change (AC_HTML_ENTITY_BONANZA_NEW) >>> and see how it performs, including possible FPs? > > Done. Any preliminary results? Looks like we have a couple

Re: Bayes underperforming, HTML entities?

2018-12-01 Thread John Hardin
On Thu, 29 Nov 2018, John Hardin wrote: On Thu, 29 Nov 2018, Amir Caspi wrote: On Nov 29, 2018, at 3:27 PM, John Hardin wrote: I'll see whether those can be incorporated into the existing UNICODE_OBFU_ZW rule (which of course will no longer actually be UNICODE :) ) Great. Maybe rename

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread RW
On Fri, 30 Nov 2018 15:49:31 -0700 Amir Caspi wrote: > > It make it harder to write rules detecting these tricks, but it may > > happen eventually. As far as Bayes is concerned, it would be a > > shame to lose the information. > > I'm not sure I see how Bayes can take decent advantage out of

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Bill Cole
On 30 Nov 2018, at 17:49, Amir Caspi wrote: On Nov 30, 2018, at 7:00 AM, Bill Cole wrote: Since HTML is already getting rendered to text, then perhaps the conversion code should strip (literally, just delete) any zero-width characters during this conversion? That should make normal body

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Amir Caspi
On Nov 30, 2018, at 7:00 AM, Bill Cole wrote: > >> Since HTML is already getting rendered to text, then perhaps the conversion >> code should strip (literally, just delete) any zero-width characters during >> this conversion? That should make normal body rules, and Bayes, function >>

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread RW
On Fri, 30 Nov 2018 06:29:31 -0700 Amir Caspi wrote: > On Nov 30, 2018, at 6:09 AM, RW wrote: > > > > The most substantial problem here is that these invisible characters > > make it very hard to write ordinary body rules. > > Thanks for the clarification on my confusion. Since HTML is

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Bill Cole
On 30 Nov 2018, at 8:29, Amir Caspi wrote: On Nov 30, 2018, at 6:09 AM, RW wrote: The most substantial problem here is that these invisible characters make it very hard to write ordinary body rules. Thanks for the clarification on my confusion. Since HTML is already getting rendered to

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread Amir Caspi
On Nov 30, 2018, at 6:09 AM, RW wrote: > > The most substantial problem here is that these invisible characters > make it very hard to write ordinary body rules. Thanks for the clarification on my confusion. Since HTML is already getting rendered to text, then perhaps the conversion code

Re: Bayes underperforming, HTML entities?

2018-11-30 Thread RW
On Thu, 29 Nov 2018 22:33:12 -0700 Amir Caspi wrote: > On Nov 29, 2018, at 10:11 PM, Bill Cole > wrote: > > > > I have no issue with adding a new rule type to act on the output of > > a partial well-defined HTML parsing, something in between 'rawbody' > > and 'body' types, but overloading

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Amir Caspi
On Nov 29, 2018, at 10:11 PM, Bill Cole wrote: > > I have no issue with adding a new rule type to act on the output of a partial > well-defined HTML parsing, something in between 'rawbody' and 'body' types, > but overloading normalize_charset with that and so affecting every existing > rule

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Bill Cole
On 29 Nov 2018, at 17:32, Amir Caspi wrote: B) Do you think that normalize_charsets could evolve to handle HTML entities? That would be a mess. The normalize_charset option acts on the decoded text of text/* MIME parts before that text is parsed into meaningful tokens. I have no issue

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread John Hardin
On Thu, 29 Nov 2018, Amir Caspi wrote: On Nov 29, 2018, at 3:27 PM, John Hardin wrote: I'll see whether those can be incorporated into the existing UNICODE_OBFU_ZW rule (which of course will no longer actually be UNICODE :) ) Great. Maybe rename the rule. ;-) What are your thoughts on

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Amir Caspi
On Nov 29, 2018, at 3:27 PM, John Hardin wrote: > > I'll see whether those can be incorporated into the existing UNICODE_OBFU_ZW > rule (which of course will no longer actually be UNICODE :) ) Great. Maybe rename the rule. ;-) What are your thoughts on item #2? Specifically: A) Could you

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread John Hardin
On Thu, 29 Nov 2018, Amir Caspi wrote: 1) A new variant is showing up lately, with liberal use of zero-width spaces/joiners. See spample: https://pastebin.com/zBVWaiew This uses the (zero-width joiner) HTML entity, interspersed within words. I don't see any

Re: Bayes underperforming, HTML entities?

2018-11-29 Thread Amir Caspi
On Nov 10, 2018, at 11:30 AM, John Hardin wrote: > > Initial results (again, all corpora aren't in yet)... > > The rawbody rules perform much better (unsurprising), and the ASCII-only one > has a better raw S/O: > >

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread John Hardin
On Thu, 15 Nov 2018, Amir Caspi wrote: On Nov 15, 2018, at 2:36 PM, John Hardin wrote: It doesn't seem to have a very high score just yet... I'm still getting FNs with the rule hitting (due to those messages hitting BAYES_00/05). Manually train those messages as spam and that should

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread John Hardin
On Thu, 15 Nov 2018, Amir Caspi wrote: On Nov 15, 2018, at 2:36 PM, John Hardin wrote: That and its resistance to FP avoidance. Despite the generality, I don't see a significant FP risk on the general unicode version. I don't see ANY legitimate reason why an email would hard-encode long

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread Amir Caspi
On Nov 15, 2018, at 2:36 PM, John Hardin wrote: > >> It doesn't seem to have a very high score just yet... I'm still getting FNs >> with the rule hitting (due to those messages hitting BAYES_00/05). > > Manually train those messages as spam and that should repair itself... Actually... right

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread Amir Caspi
On Nov 15, 2018, at 2:36 PM, John Hardin wrote: > > That and its resistance to FP avoidance. Despite the generality, I don't see a significant FP risk on the general unicode version. I don't see ANY legitimate reason why an email would hard-encode long sequences of human-readable text, in

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread John Hardin
On Thu, 15 Nov 2018, Amir Caspi wrote: On Nov 10, 2018, at 11:30 AM, John Hardin wrote: The rawbody rules perform much better (unsurprising), and the ASCII-only one has a better raw S/O: It looks like HTML_ENTITY_ASCII has been rolled out -- did you decide against the more general

Re: Bayes underperforming, HTML entities?

2018-11-15 Thread Amir Caspi
On Nov 10, 2018, at 11:30 AM, John Hardin wrote: > > The rawbody rules perform much better (unsurprising), and the ASCII-only one > has a better raw S/O: It looks like HTML_ENTITY_ASCII has been rolled out -- did you decide against the more general unicode due to S/O score? I predict we will

Re: Bayes underperforming, HTML entities?

2018-11-10 Thread John Hardin
On Fri, 9 Nov 2018, John Hardin wrote: On Fri, 9 Nov 2018, John Hardin wrote: On Fri, 9 Nov 2018, Amir Caspi wrote: I'd be interested to know if there's a performance difference between my two proposed rules. I suspect the second should run (slightly) faster. It looks that way - only

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Fri, 9 Nov 2018, John Hardin wrote: On Fri, 9 Nov 2018, Amir Caspi wrote: I'd be interested to know if there's a performance difference between my two proposed rules. I suspect the second should run (slightly) faster. It looks that way - only .0001s difference on *some* messages. Re

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread RW
On Fri, 9 Nov 2018 15:34:47 -0500 Kris Deugau wrote: > Amir Caspi wrote: > > On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas > > wrote: > >> > >> how many spams and hams did you train then? > > > > As of right now: > > 0.000 0 258427 0 non-token data: nspam > >

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Fri, 9 Nov 2018, Amir Caspi wrote: On Nov 9, 2018, at 8:49 AM, John Hardin wrote: rawbody HTML_ENC_ASCII /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i I'll add that too so that we can compare the results. Per my reply a few minutes ago, I think this will be

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Kris Deugau
Amir Caspi wrote: On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas wrote: how many spams and hams did you train then? As of right now: 0.000 0 258427 0 non-token data: nspam 0.000 0 106813 0 non-token data: nham 0.000 0 438310

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi
On Nov 9, 2018, at 8:49 AM, John Hardin wrote: > >> rawbody HTML_ENC_ASCII >> /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i > > I'll add that too so that we can compare the results. Per my reply a few minutes ago, I think this will be too restrictive. While the

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi
On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas wrote: > > how many spams and hams did you train then? As of right now: 0.000 0 258427 0 non-token data: nspam 0.000 0 106813 0 non-token data: nham 0.000 0 438310 0 non-token

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi
On Nov 9, 2018, at 7:41 AM, RW wrote: > > I was really referring to the fact that it's pure ASCII text that's > being encoded rather than long runs per se That is true for the current batch of messages, but as we've seen, spammers love to use unicode obfuscation to try to foil Bayes and other

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Fri, 9 Nov 2018, RW wrote: On Thu, 8 Nov 2018 19:24:47 -0700 Amir Caspi wrote: On Nov 8, 2018, at 4:51 PM, RW wrote: Unnecessary encoding is fairly common, but a long runs of ASCII characters encoded like this seems extreme. Right, that was a question I had asked in my email this

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Thu, 8 Nov 2018, Bill Cole wrote: On 8 Nov 2018, at 21:55, John Hardin wrote: On Thu, 8 Nov 2018, Amir Caspi wrote: On Nov 8, 2018, at 7:41 PM, John Hardin wrote: Sure, but I't also prefer to have a sample to test on before committing. I'll see if I can get the pastebin to work (i.e.

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Matus UHLAR - fantomas
On Nov 8, 2018, at 2:30 AM, Matus UHLAR - fantomas wrote: Do you use autolearn? There are a few rules to detect ham (score negatively), many of them based on default whitelists and DNS whitelists, where many mails come from grey area companies, not necessarily spam, but training their mail as

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread RW
On Thu, 8 Nov 2018 19:24:47 -0700 Amir Caspi wrote: > On Nov 8, 2018, at 4:51 PM, RW wrote: > > > > Unnecessary encoding is fairly common, but a long runs of ASCII > > characters encoded like this seems extreme. > > Right, that was a question I had asked in my email this morning... > whether

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Bill Cole
On 8 Nov 2018, at 21:55, John Hardin wrote: On Thu, 8 Nov 2018, Amir Caspi wrote: On Nov 8, 2018, at 7:41 PM, John Hardin wrote: Sure, but I't also prefer to have a sample to test on before committing. I'll see if I can get the pastebin to work (i.e. fix the boundary) I can send you

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread John Hardin
On Thu, 8 Nov 2018, Amir Caspi wrote: On Nov 8, 2018, at 7:55 PM, John Hardin wrote: I left it case-sensitive; is there some reason the entities cannot be coded as (e.g.) ? I kinda doubt it, so it should *probably* be case-insensitive to avoid trivial bypass. I think it should be

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 7:55 PM, John Hardin wrote: > > I left it case-sensitive; is there some reason the entities cannot be coded > as (e.g.) ? I kinda doubt it, so it should *probably* be > case-insensitive to avoid trivial bypass. I think it should be insensitive, sorry for that oversight on

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread John Hardin
On Thu, 8 Nov 2018, Amir Caspi wrote: On Nov 8, 2018, at 7:41 PM, John Hardin wrote: Sure, but I't also prefer to have a sample to test on before committing. I'll see if I can get the pastebin to work (i.e. fix the boundary) I can send you some new spamples via attachment, privately.

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 7:41 PM, John Hardin wrote: > > Sure, but I't also prefer to have a sample to test on before committing. I'll > see if I can get the pastebin to work (i.e. fix the boundary) > I can send you some new spamples via attachment, privately. Unfortunately I lost those

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread John Hardin
On Thu, 8 Nov 2018, Amir Caspi wrote: On Nov 8, 2018, at 4:51 PM, RW wrote: Unnecessary encoding is fairly common, but a long runs of ASCII characters encoded like this seems extreme. Right, that was a question I had asked in my email this morning... whether we have a rule to detect long

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 4:51 PM, RW wrote: > > Unnecessary encoding is fairly common, but a long runs of ASCII > characters encoded like this seems extreme. Right, that was a question I had asked in my email this morning... whether we have a rule to detect long sequences of HTML entities. It would

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW
On Thu, 8 Nov 2018 23:30:42 + RW wrote: > On Thu, 8 Nov 2018 13:14:13 -0700 > Amir Caspi wrote: > > > > If the HTML section is valid, as it appears to be ... then the HTML > > should be decoded. And yet, these emails are hitting BAYES_00 or > > BAYES_05 despite the spammy HTML text. >

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW
On Thu, 8 Nov 2018 13:14:13 -0700 Amir Caspi wrote: > If the HTML section is valid, as it appears to be ... then the HTML > should be decoded. And yet, these emails are hitting BAYES_00 or > BAYES_05 despite the spammy HTML text. In the two examples there isn't really much in the html text. I

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 2:19 PM, Bill Cole wrote: > > [Resending because it looks like my first send went into a black hole...] All SA messages appear to be coming with significantly delays today... not sure why. I got RW's first message, sent at 8am today, only about an hour ago, AFTER the

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Bill Cole
[Resending because it looks like my first send went into a black hole...] On 7 Nov 2018, at 14:33, Amir Caspi wrote: Hi all, In the past couple of weeks I've gotten a number of clearly-spam messages that slipped past SA, and the only reason was because they were getting low Bayes scores

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Bill Cole
On 7 Nov 2018, at 14:33, Amir Caspi wrote: Hi all, In the past couple of weeks I've gotten a number of clearly-spam messages that slipped past SA, and the only reason was because they were getting low Bayes scores (BAYES_50 or even down to BAYES_00 or BAYES_05). I do my Bayes training

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW
On Wed, 7 Nov 2018 12:33:35 -0700 Amir Caspi wrote: > In many cases, it would appear that these spams have either very > little (real) text (besides the usual attempt at Bayes poisoning) > and/or are using HTML-entity encoding to try to bypass Bayes. Here > are a couple of spamples: > >

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 12:20 PM, RW wrote: > > these emails don't contain a valid HTML mime section. They contain a bogus > html section that doesn't > start with the separator defined in the top-level Content-Type header. Sorry, that is totally my fault. In the spample, I was trying to sanitize

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 12:20 PM, RW wrote: > > I've already explained this. Sorry, I don't recall this discussion, my apologies. > Do these actually display on any email client? Yes. For example, for the first spample (https://pastebin.com/peiXZivJ), Apple Mail (OS X) displays the decoded HTML

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread RW
On Thu, 8 Nov 2018 10:09:21 -0700 Amir Caspi wrote: > (2) Does normalize_charset decode HTML entities? If not, is this > something that can be included? Do I need to file a bugzilla? I've already explained this. Ordinarily html is decoded (whether normalize_charset is set or not), but these

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 8, 2018, at 2:30 AM, Matus UHLAR - fantomas wrote: > > Do you use autolearn? There are a few rules to detect ham (score > negatively), many of them based on default whitelists and DNS whitelists, > where many mails come from grey area companies, not necessarily spam, but > training their

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
> do you regularly perform sa-update on that box? Yes, it is run every night. However, I am still running 3.4.1, so if the sha1 access has already been disabled, my updates are likely failing as of recently. I'm working on updating to 3.4.2 but this is an ancient box and I haven't yet had the

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Matus UHLAR - fantomas
On 07.11.18 12:33, Amir Caspi wrote: In the past couple of weeks I've gotten a number of clearly-spam messages that slipped past SA, and the only reason was because they were getting low Bayes scores (BAYES_50 or even down to BAYES_00 or BAYES_05). I do my Bayes training manually on both ham

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Tobi
Hi I checked the first message on my SA and found multiple hits on __SCC_SHORT_WORDS rule which resulted in hits on the metas * 1.0 SCC_10_SHORT_WORD_LINES 10 lines with many short words * 1.0 SCC_5_SHORT_WORD_LINES 5 lines with many short words * 1.0

Re: Bayes underperforming, HTML entities?

2018-11-08 Thread Amir Caspi
On Nov 7, 2018, at 12:33 PM, Amir Caspi wrote: > > In many cases, it would appear that these spams have either very little > (real) text (besides the usual attempt at Bayes poisoning) and/or are using > HTML-entity encoding to try to bypass Bayes. Here are a couple of spamples: > >