Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Fri, 9 Nov 2018, John Hardin wrote: On Fri, 9 Nov 2018, Amir Caspi wrote: I'd be interested to know if there's a performance difference between my two proposed rules. I suspect the second should run (slightly) faster. It looks that way - only .0001s difference on *some* messages. Re

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Fri, 9 Nov 2018, Amir Caspi wrote: On Nov 9, 2018, at 8:49 AM, John Hardin wrote: rawbody HTML_ENC_ASCII /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i I'll add that too so that we can compare the results. Per my reply a few minutes ago, I think this will be

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread RW
On Fri, 9 Nov 2018 15:34:47 -0500 Kris Deugau wrote: > Amir Caspi wrote: > > On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas > > wrote: > >> > >> how many spams and hams did you train then? > > > > As of right now: > > 0.000 0 258427 0 non-token data: nspam > >

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Kris Deugau
Amir Caspi wrote: On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas wrote: how many spams and hams did you train then? As of right now: 0.000 0 258427 0 non-token data: nspam 0.000 0 106813 0 non-token data: nham 0.000 0 438310

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread RW
On Thu, 8 Nov 2018 19:24:47 -0700 Amir Caspi wrote: > On Nov 8, 2018, at 4:51 PM, RW wrote: > > > > Unnecessary encoding is fairly common, but a long runs of ASCII > > characters encoded like this seems extreme. > > Right, that was a question I had asked in my email this morning... > whether

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Matus UHLAR - fantomas
On Nov 8, 2018, at 2:30 AM, Matus UHLAR - fantomas wrote: Do you use autolearn? There are a few rules to detect ham (score negatively), many of them based on default whitelists and DNS whitelists, where many mails come from grey area companies, not necessarily spam, but training their mail as

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
On Thu, 8 Nov 2018, Bill Cole wrote: On 8 Nov 2018, at 21:55, John Hardin wrote: On Thu, 8 Nov 2018, Amir Caspi wrote: On Nov 8, 2018, at 7:41 PM, John Hardin wrote: Sure, but I't also prefer to have a sample to test on before committing. I'll see if I can get the pastebin to work (i.e.

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread John Hardin
describeAC_HTML_ENTITY_BONANZA Long run of HTML-encoded characters score AC_HTML_ENTITY_BONANZA Early results (not all corpora are in yet) look *very* promising: https://ruleqa.spamassassin.org/20181109-r1846219-n/__AC_HTML_ENTITY_BONANZA/detail 3% of spam, S/O .958 and almost all spam hit

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi
On Nov 9, 2018, at 7:41 AM, RW wrote: > > I was really referring to the fact that it's pure ASCII text that's > being encoded rather than long runs per se That is true for the current batch of messages, but as we've seen, spammers love to use unicode obfuscation to try to foil Bayes and other

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi
On Nov 9, 2018, at 8:10 AM, Matus UHLAR - fantomas wrote: > > how many spams and hams did you train then? As of right now: 0.000 0 258427 0 non-token data: nspam 0.000 0 106813 0 non-token data: nham 0.000 0 438310 0 non-token

Re: Bayes underperforming, HTML entities?

2018-11-09 Thread Amir Caspi
On Nov 9, 2018, at 8:49 AM, John Hardin wrote: > >> rawbody HTML_ENC_ASCII >> /(?:&\#(?:(?:\d{1,2}|1[01]\d|12[0-7])|x[0-7][0-9a-f])\s*;\s*){10}/i > > I'll add that too so that we can compare the results. Per my reply a few minutes ago, I think this will be too restrictive. While the