Re: Machine learning with or vs. Bayes?

2019-06-27 Thread Shreyansh Shrivastava.
On Fri, 28 Jun 2019, 07:42 Amir Caspi,  wrote:

> Hi all,
>
> I don't suppose anyone has a neural-net-based SA Machine Learning plugin
> or external program, to complement or replace Bayes?  There are a number of
> fairly compact Python ML packages that would greatly ease this task
> nowadays, like TensorFlow.  It looks like rspamd has a neural net module...
> I wonder if it would be relatively portable.
>
Hi Amir, I am working on developing a plugin with 2/3 statistical
classifiers including (SVM and neural nets) under the Google summer of code
programme with Kevin McGrail as my mentor.

I guess there's a bunch of ML in use for QA/masscheck and auto-scoring...
> but is there anything for actual rule generation, not just scoring?  Or,
> like Bayes, where the "rule generation" is embedded in the neural net, and
> it just kicks out a spamminess indicator/probability?
>
> Of course, Gmail and the other big providers have their own ML solutions
> that seem to be pretty good, though they have an enormous user base and
> near-infinite resources...
>
> Granted, reliance on python means it's not embedded in SA, but SA already
> calls other external programs like pyzor/razor/DCC, so that wouldn't seem
> to necessarily be a big knock against it.
>
With python, as you said it wont be embedded into SA and hence I'm worried
about plugin integration. We ( me + mentors) have come up with a couple of
possible feasible solutions. Will post about any updates on the list soon.

Note- Any information in general which you think might help in this issue,
please let me know.


> Cheers.
>
> --- Amir
>

Regards,
Shreyansh Shrivastava

>


Re: Machine learning with or vs. Bayes?

2019-06-27 Thread Olivier
> Of course, Gmail and the other big providers have their own ML solutions that 
> seem to be pretty good, though they have an enormous user base and 
> near-infinite resources...

I would argue, in contrary, that Gmail performs rather poorly, I have at
least one FP a day and that is a big no no. A couple of FN are not a
problem, but if I miss an important message because it was classified as
spam, I would be really unhappy. So as a result, I have to check the
spam manually. It is not efficient!

Olivier


Machine learning with or vs. Bayes?

2019-06-27 Thread Amir Caspi
Hi all,

I don't suppose anyone has a neural-net-based SA Machine Learning plugin or 
external program, to complement or replace Bayes?  There are a number of fairly 
compact Python ML packages that would greatly ease this task nowadays, like 
TensorFlow.  It looks like rspamd has a neural net module... I wonder if it 
would be relatively portable.

I guess there's a bunch of ML in use for QA/masscheck and auto-scoring... but 
is there anything for actual rule generation, not just scoring?  Or, like 
Bayes, where the "rule generation" is embedded in the neural net, and it just 
kicks out a spamminess indicator/probability?

Of course, Gmail and the other big providers have their own ML solutions that 
seem to be pretty good, though they have an enormous user base and 
near-infinite resources...

Granted, reliance on python means it's not embedded in SA, but SA already calls 
other external programs like pyzor/razor/DCC, so that wouldn't seem to 
necessarily be a big knock against it.

Cheers.

--- Amir



Re: Zero-width rules?

2019-06-27 Thread RW
On Wed, 26 Jun 2019 18:13:02 -0400
Kevin A. McGrail wrote:

> On 6/26/2019 6:09 PM, Amir Caspi wrote:
> > On Jun 26, 2019, at 4:04 PM, Kevin A. McGrail  > > wrote:  

> > That HTML portion should have been picked up by any ZWJ/ZWS/etc.
> > rules, no?

> >  
> I don't know charset="UTF-8" is in the email and for the ZWNJ at
> least, that was in windows-1256.  Is  anything in UTF-8?

UTF-8 isn't particularly relevant as all the characters in  are
in ASCII. In the HTML section it represents the codepoint U+200B, which
as I pointed out, is a Zero Width Space.


Re: Zero-width rules?

2019-06-27 Thread Amir Caspi
On Jun 27, 2019, at 12:04 PM, John Hardin  wrote:
> 
>> There's still not enough of that to trigger a scored rule, though. It may 
>> need some review of the masscheck results, and tuning.
> 
> OK, retuned.

FWIW, the x200b entity occurs only in my spam; I see it nowhere in my ham inbox 
or non-spam trash except for the messages to this mailing list.  Unlike ZWJ 
which joins ligatures in various languages, I suspect the ZWS is a fairly 
spammy indicator.

Cheers.

--- Amir



Re: Zero-width rules?

2019-06-27 Thread John Hardin

On Thu, 27 Jun 2019, John Hardin wrote:


On Thu, 27 Jun 2019, John Hardin wrote:


On Wed, 26 Jun 2019, Amir Caspi wrote:


Any idea why this spample didn't hit the ZWJ obfuscation rules?


They were looking for multiple obfuscations in a *single* word. I've 
loosened that a bit.


There's still not enough of that to trigger a scored rule, though. It may 
need some review of the masscheck results, and tuning.


OK, retuned.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Back in 1969 the technology to fake a Moon landing didn't exist,
  but the technology to actually land there did.
  Today, it is the opposite.   -- unknown
---
 7 days until the 243rd anniversary of the Declaration of Independence


Re: Zero-width rules?

2019-06-27 Thread John Hardin

On Thu, 27 Jun 2019, John Hardin wrote:


On Wed, 26 Jun 2019, Amir Caspi wrote:


Any idea why this spample didn't hit the ZWJ obfuscation rules?


They were looking for multiple obfuscations in a *single* word. I've loosened 
that a bit.


There's still not enough of that to trigger a scored rule, though. It may 
need some review of the masscheck results, and tuning.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...much of our country's counterterrorism security spending is not
  designed to protect us from the terrorists, but instead to protect
  our public officials from criticism when another attack occurs.
-- Bruce Schneier
---
 7 days until the 243rd anniversary of the Declaration of Independence


Re: Zero-width rules?

2019-06-27 Thread John Hardin

On Wed, 26 Jun 2019, Amir Caspi wrote:


Any idea why this spample didn't hit the ZWJ obfuscation rules?


They were looking for multiple obfuscations in a *single* word. I've 
loosened that a bit.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 Warning Labels we'd like to see #1: "If you are a stupid idiot while
 using this product you may hurt yourself. And it won't be our fault."
---
 7 days until the 243rd anniversary of the Declaration of Independence


Re: Zero-width rules?

2019-06-27 Thread John Hardin

On Wed, 26 Jun 2019, Amir Caspi wrote:


John et al,

I recall from a prior thread last year that there were supposed to be some 
rules to check for zero-width joiner characters... but I'm seeing spams 
recently that have these, but don't hit any such rules.

Here's one spample, where the ZWJ entity #x200B is being used to try to 
sidestep Bayes detection of highly spammy words.
https://pastebin.com/kx0jVBtZ


I'll take a look. It's possible that there are some ZWJ the RE isn't 
looking for.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 Warning Labels we'd like to see #1: "If you are a stupid idiot while
 using this product you may hurt yourself. And it won't be our fault."
---
 7 days until the 243rd anniversary of the Declaration of Independence


Re: How to create my personal RBL

2019-06-27 Thread David Jones
On 6/26/19 3:43 AM, hg user wrote:
> Thank you everybody for your really interesting answers. In this moment 
> I'm just collecting informations.
> 
> I have one main problem: one of the engines used by our commercial 
> antispam solution returns too many FPs. I'm gradually introducing 
> spamassassin (included in zimbra) and I'd like to mitigate the FPs with 
> some other checks... using a proven, well-known technology like AskDNS 
> seems a quick and viable solution to me.
> 
> Unfortunately a personal RBL may not cover all the use cases I'm 
> thinking about and looking at the source code of a plugin that queries a 
> sql or redis server can be interesting.

Before you start working on a custom plugin, have you tuned out your MTA 
and SpamAssasin?  From my personal experience, I setup an edge MTA as 
the MX and sent filtered mail to Zimbra and smarthosted from Zimbra back 
to the edge MTA.  This provides the most flexibility to upgrade perl and 
SpamAssassin to the latest version along with many other benefits.

Tuning out the MTA:
- Setup Postfix with Postscreen
- Enable weighted RBLs in Postscreen, lots of them.  See the SA mailing 
list archives for "postscreen_dnsbl_sites".
   __This will block 80% or more of spam/junk alone.__
- Setup postfwd to give extra control to add headers based on SMTP 
conversation time so SA can use those headers later.  For example, I set 
headers based on the number of recipients which is very useful when 
email has been BCC'd.
- Setup sqlgrey and slowly phase it in where users won't even know it.
- Setup policyd-spf, OpenDMARC, and OpenDKIM
- Setup fail2ban for repeat spammers/bots
- Setup Postwhite to whitelist trusted senders by their SPF record. 
This allows for turning up other Postfix config settings
- Setup TLS with a Letsencrypt certificate
- Setup rate limiting then put exceptions in 
smtpd_client_event_limit_exceptions.
- Postfix header_checks, body_checks, smtpd_client_restrictions, 
smtpd_helo_restrictions, smtpd_sender_restrictions, 
smtpd_relay_restrictions, smtpd_recipient_restrictions, 
smtpd_data_restrictions in the main.cf can be tuned over time.
- Enable reject_unverified_recipient in smtpd_recipient_restrictions so 
Postfix will "look ahead" to Zimbra and not accept invalid recipients.
-

Tuning out SpamAssassin:
- Make sure your internal_networks and trusted_networks are correct so 
RBL checks will happen correctly for the last external IP.  I have 
extended this out to Google, Office 365, and other major platforms to 
detect the X-Originating-IP of the web/mail client.
- Install KAM.cf and KAMonly.cf
- Install DCC, Razor, Pyzor
- Install ClamAV unofficial (extra) signatures
- Add local rules to use the headers from OpenDMARC
- Enable extra RBLs that aren't in the stock SA
- I use the ShortCircuit plugin heavily, disable the ALL_TRUSTED 
shortcircuit, and enable shortcircuit on a number of the USER_IN_* rules.
- I have created a massive list of whitelist_auth entries that are 
mostly subdomain senders from trusted senders.
- Setup a way to train your Bayes easily by dragging email into a Spam 
and Ham folder as things are misclassified to keep the Bayesian DB tuned 
correctly.
- Get on the latest version of perl even if you have to compile it 
because your OS might be older.
- Install the latest stable version of SpamAssassin.
- Many more things covered on this list over the years.
- I setup local DBLs and DWLs for brand new Office 365 senders and other 
common sources of spam like secureserver.net, unifiedlayer.com, 
websitewelcome.com, myregisteredsite.com, etc to add a couple of points 
for new senders.  Then I add good senders on those bad hosting platforms 
to a DWL that subtracts a couple of points and excludes them from other 
meta rules that amplifies certain scores for the spam.

Note that a lot of this can be found by setting up a quick VM and 
installing iRedMail to check out the Postfix configuration for the 
milters mentioned above and the TLS configuration.  It uses Amavisnew so 
that might be different from how you want to "glue" SpamAssassin into 
the MTA.

I use MailScanner which has a few extra features of it's own in addition 
to processing emails in batches for high volume mail flow.

After I did all of that work above over many years, my mail filtering 
accuracy is very good for about 80,000 mailboxes.  The more mailboxes 
and domains you filter, the more time it takes to tune everything properly.


> 
> Thank you
> Francesco
> 
> On Tue, Jun 25, 2019 at 10:20 PM Matus UHLAR - fantomas 
> mailto:uh...@fantomas.sk>> wrote:
> 
>  >On Tue, 2019-06-25 at 11:09 -0500, David B Funk wrote:
>  >> that's way overthinking it.
> 
> On 25.06.19 17:55, Martin Gregorie wrote:
>  >I agree, now that there's a configurable OSS dnsbl server available,
>  >that using it is the obvious choice for dealing with a standalone
> list,
>  >but the  OP did ask specifically about using database queries to
>  >implement a 

Re: spamass-milter reject?

2019-06-27 Thread Matus UHLAR - fantomas

On 27 Jun 2019, at 9:33, Matus UHLAR - fantomas wrote:

for mail received from the net I use amavisd-new with amavisd-milter.

Content filter accepts message, I don't want to drop it, send bounce or send
it to anyone.  I use content filter for mail sent from internal network or
through alternative ports.


On 27.06.19 10:50, Matt Anton wrote:

Have you many false positives by rejecting outright mails marked as spam by 
amavisd-new?


I haven't seen any false positives.

Apparently because they all were rejected and thus were not delivered to me.

Or, because the reject score is different than tag score.

$sa_tag_level_deflt = undef;# always add status
$sa_tag2_level_deflt= 5;

$policy_bank{'AM.PDP-SOCK'} = {
   protocol => 'AM.PDP',   # select Amavis policy delegation protocol
   spam_kill_level_maps=> 8,
   final_spam_destiny  => D_REJECT,
   final_virus_destiny => D_REJECT,
   final_banned_destiny=> D_REJECT,
   spam_quarantine_to_maps => undef,
};
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
You have the right to remain silent. Anything you say will be misquoted,
then used against you. 


Re: spamass-milter reject?

2019-06-27 Thread Matt Anton
On 27 Jun 2019, at 9:33, Matus UHLAR - fantomas wrote:

> for mail received from the net I use amavisd-new with amavisd-milter.
>
> Content filter accepts message, I don't want to drop it, send bounce or send
> it to anyone.  I use content filter for mail sent from internal network or
> through alternative ports.

Have you many false positives by rejecting outright mails marked as spam by 
amavisd-new?

-- 
matt [at] lv223.org
GPG key ID: 7D91A8CA


signature.asc
Description: OpenPGP digital signature


Re: spamass-milter reject?

2019-06-27 Thread Matus UHLAR - fantomas

On 26 Jun 2019, at 9:02, @lbutlr wrote:

Well, I want spam MARKED at 5.0, but I want it REJECTED at 10.0.  It is a
subtle difference, but the majority of spam being delivered to users is
in the 10-100 range.


On 26.06.19 22:19, Matt Anton wrote:

I achieve that with amavisd-new being configured as an after queue content
filter, thus the required_score in local.cf only applies to
spamass-milter/spamd for rejecting outright before it is queued.


for mail received from the net I use amavisd-new with amavisd-milter.

Content filter accepts message, I don't want to drop it, send bounce or send
it to anyone.  I use content filter for mail sent from internal network or
through alternative ports.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I feel like I'm diagonally parked in a parallel universe.