Re: No rule updates since 1/1/17

2018-08-26 Thread Kevin A. McGrail
Thanks Tom.  I apologize I didn't know and asked Dave if he would follow-up
on some old requests for masscheck I found in a folder squirreled away!

--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

On Sun, Aug 26, 2018 at 7:01 AM, Tom Hendrikx  wrote:

>
> Hi David,
>
> I'm already running masschecks since feb 2017, results labeled
> 'thendrikx' are mine. :)
>
> I'm not adding massive volumes though, mostly because I'm running a
> small '3 men and a dog' setup. But I think it's important that I can
> contribute sample data in my locale (nl_NL), so I would invite others to
> set it up too: It's not a lot of work and it mostly runs without any
> manual intervention (I was already manually sorting ham and spam).
>
> To give a bit of an idea of how I do it: I run a postfix server on
> ubuntu, with spamassassin as a milter. I redirect all possible spam into
> my Junk folder, and check that daily.
>
> The masscheck is run using a simple wrapper script that takes the
> following steps (from daily cron):
> - Copy all spam in $workdir from Spamtraps and Junk folders (only
> IMAP-seen emails) and not older than 2 months
> - Copy all ham into $workdir from several IMAP folders that are known to
> be sorted by hand, and not older than 6 years
> - Run masscheck on the copied messages
> - Print a list of the subjects of the lowest scoring spam samples, and
> the highest scoring ham samples
> - Cleanup all copied email
> - Mail all output to myself
>
> I spent less than a day in setting this up, and it has been running
> without issues ever since. When you're interested, read up on
> https://wiki.apache.org/spamassassin/NightlyMassCheck and try to set it
> up. If you run into issues, other masscheckers can probably help you out.
>
> Kind regards,
> Tom
>
> On 25-08-18 16:12, David Jones wrote:
> > Tom,
> >
> > Let me know if you are still interested in setting up a masschecker.
> > That goes for anyone on this list as well.  I have worked out the
> > sorting issue pretty well now and my ena-weekX masscheckers are now the
> > largest contributions to the RuleQA corpus keeping the nightly rule
> > scoring updating regularly the past year.
> >
> > http://ruleqa.spamassassin.org/  (see the ena-weekX in the green box)
> >
> > New/more masscheckers are always welcome and will help you learn the
> > best way to tune your SA platform to get every last drop of accuracy
> > from your local meta rules.  We could really use masscheckers with
> > primary languages not English to add/improve core SA rules.
> >
> > Here's my setup:
> >
> > - I have an iRedmail server that I split copies of most of my email to
> > an internal-only email domain "sa.ena.net."
> >
> > - The iRedmail server has Sieve rules (easily managed by RoundCube)
> > based on certain rule hits and scores from my main Internet edge
> > MailScanner filtering that move them into Ham and Spam folders as
> > unread.  Mail scoring in the middle -- not high enough for obvious Spam
> > or low enough for obvious Ham are left in the main Inbox.
> >
> > - I spend a few minutes each day visually scanning the Subjects of the
> > unread email then mark them as Read.
> >
> > - If I find a zero-hour email in the main Inbox, then I move it to a
> > SpamCop folder.  A script that runs every 5 minutes to check the SpamCop
> > folder, strips of some extra Received headers from my internal hops,
> > then submits it as an attachment to my SpamCop account.
> >
> > - A script moves the Maildir email to 4 other masschecker VMs to split
> > out the load so they will be able to submit their results quickly.
> > Ena-week0 is the last week of ham/spam that is still on the iRedMail
> > server.  Ena-week1-4 are running on the other 4 masschecker VMs to give
> > a total of 5 weeks of recent corpus.  I currently have 100,939 Ham and
> > 292,001 Spam in ena-week0-4.
> >
> > - I run a local Bayesian train on the ena-week0 Ham and Spam folder to
> > my Redis-based Bayes storage shared across my 8 MailScanner nodes and my
> > iRedMail/amavis server.  This method has shown to keep my Bayes scores
> > very accurate.
> >
> > Hope someone finds this information helpful.
> >
> > Dave
> >
> >
> > On 01/20/2017 01:02 PM, Tom Hendrikx wrote:
> >> On 20-01-17 19:46, David Jones wrote:
> >>>> From: Kevin Golding 
> >>>> Sent: Friday, January 20, 2017 11:59 AM
> >>>> To: users@spamassassin.apache.org
> >

Re: No rule updates since 1/1/17

2018-08-26 Thread Tom Hendrikx

Hi David,

I'm already running masschecks since feb 2017, results labeled
'thendrikx' are mine. :)

I'm not adding massive volumes though, mostly because I'm running a
small '3 men and a dog' setup. But I think it's important that I can
contribute sample data in my locale (nl_NL), so I would invite others to
set it up too: It's not a lot of work and it mostly runs without any
manual intervention (I was already manually sorting ham and spam).

To give a bit of an idea of how I do it: I run a postfix server on
ubuntu, with spamassassin as a milter. I redirect all possible spam into
my Junk folder, and check that daily.

The masscheck is run using a simple wrapper script that takes the
following steps (from daily cron):
- Copy all spam in $workdir from Spamtraps and Junk folders (only
IMAP-seen emails) and not older than 2 months
- Copy all ham into $workdir from several IMAP folders that are known to
be sorted by hand, and not older than 6 years
- Run masscheck on the copied messages
- Print a list of the subjects of the lowest scoring spam samples, and
the highest scoring ham samples
- Cleanup all copied email
- Mail all output to myself

I spent less than a day in setting this up, and it has been running
without issues ever since. When you're interested, read up on
https://wiki.apache.org/spamassassin/NightlyMassCheck and try to set it
up. If you run into issues, other masscheckers can probably help you out.

Kind regards,
Tom

On 25-08-18 16:12, David Jones wrote:
> Tom,
> 
> Let me know if you are still interested in setting up a masschecker. 
> That goes for anyone on this list as well.  I have worked out the
> sorting issue pretty well now and my ena-weekX masscheckers are now the
> largest contributions to the RuleQA corpus keeping the nightly rule
> scoring updating regularly the past year.
> 
> http://ruleqa.spamassassin.org/  (see the ena-weekX in the green box)
> 
> New/more masscheckers are always welcome and will help you learn the
> best way to tune your SA platform to get every last drop of accuracy
> from your local meta rules.  We could really use masscheckers with
> primary languages not English to add/improve core SA rules.
> 
> Here's my setup:
> 
> - I have an iRedmail server that I split copies of most of my email to
> an internal-only email domain "sa.ena.net."
> 
> - The iRedmail server has Sieve rules (easily managed by RoundCube)
> based on certain rule hits and scores from my main Internet edge
> MailScanner filtering that move them into Ham and Spam folders as
> unread.  Mail scoring in the middle -- not high enough for obvious Spam
> or low enough for obvious Ham are left in the main Inbox.
> 
> - I spend a few minutes each day visually scanning the Subjects of the
> unread email then mark them as Read.
> 
> - If I find a zero-hour email in the main Inbox, then I move it to a
> SpamCop folder.  A script that runs every 5 minutes to check the SpamCop
> folder, strips of some extra Received headers from my internal hops,
> then submits it as an attachment to my SpamCop account.
> 
> - A script moves the Maildir email to 4 other masschecker VMs to split
> out the load so they will be able to submit their results quickly. 
> Ena-week0 is the last week of ham/spam that is still on the iRedMail
> server.  Ena-week1-4 are running on the other 4 masschecker VMs to give
> a total of 5 weeks of recent corpus.  I currently have 100,939 Ham and
> 292,001 Spam in ena-week0-4.
> 
> - I run a local Bayesian train on the ena-week0 Ham and Spam folder to
> my Redis-based Bayes storage shared across my 8 MailScanner nodes and my
> iRedMail/amavis server.  This method has shown to keep my Bayes scores
> very accurate.
> 
> Hope someone finds this information helpful.
> 
> Dave
> 
> 
> On 01/20/2017 01:02 PM, Tom Hendrikx wrote:
>> On 20-01-17 19:46, David Jones wrote:
>>>> From: Kevin Golding 
>>>> Sent: Friday, January 20, 2017 11:59 AM
>>>> To: users@spamassassin.apache.org
>>>> Subject: Re: No rule updates since 1/1/17
>>> 
>>>> On Fri, 20 Jan 2017 17:26:01 -, Bill Keenan  
>>>>  wrote:
>>>>> What is the fix needed so /usr/bin/sa-update starts getting updates? I  
>>>>> too have not received an update from updates.spamassassin.org  
>>>>> <http://updates.spamassassin.org/> since 1-Jan-17.
>>>>>
>>>>> Besides updates.spamassassin.org <http://updates.spamassassin.org/>, 
>>>>> what other rule sets are commonly used? Hundreds of spam messages are  
>>>>> getting through with only updates.spamassassin.org  
>>>>> <http://updates.spamassassin.org/> rules.
>>>> This seems

Re: Re: No rule updates since 1/1/17

2018-08-25 Thread David Jones

Tom,

Let me know if you are still interested in setting up a masschecker.  
That goes for anyone on this list as well.  I have worked out the 
sorting issue pretty well now and my ena-weekX masscheckers are now the 
largest contributions to the RuleQA corpus keeping the nightly rule 
scoring updating regularly the past year.


http://ruleqa.spamassassin.org/ (see the ena-weekX in the green box)

New/more masscheckers are always welcome and will help you learn the 
best way to tune your SA platform to get every last drop of accuracy 
from your local meta rules.  We could really use masscheckers with 
primary languages not English to add/improve core SA rules.


Here's my setup:

- I have an iRedmail server that I split copies of most of my email to 
an internal-only email domain "sa.ena.net."


- The iRedmail server has Sieve rules (easily managed by RoundCube) 
based on certain rule hits and scores from my main Internet edge 
MailScanner filtering that move them into Ham and Spam folders as 
unread.  Mail scoring in the middle -- not high enough for obvious Spam 
or low enough for obvious Ham are left in the main Inbox.


- I spend a few minutes each day visually scanning the Subjects of the 
unread email then mark them as Read.


- If I find a zero-hour email in the main Inbox, then I move it to a 
SpamCop folder.  A script that runs every 5 minutes to check the SpamCop 
folder, strips of some extra Received headers from my internal hops, 
then submits it as an attachment to my SpamCop account.


- A script moves the Maildir email to 4 other masschecker VMs to split 
out the load so they will be able to submit their results quickly.  
Ena-week0 is the last week of ham/spam that is still on the iRedMail 
server.  Ena-week1-4 are running on the other 4 masschecker VMs to give 
a total of 5 weeks of recent corpus.  I currently have 100,939 Ham and 
292,001 Spam in ena-week0-4.


- I run a local Bayesian train on the ena-week0 Ham and Spam folder to 
my Redis-based Bayes storage shared across my 8 MailScanner nodes and my 
iRedMail/amavis server.  This method has shown to keep my Bayes scores 
very accurate.


Hope someone finds this information helpful.

Dave


On 01/20/2017 01:02 PM, Tom Hendrikx wrote:

On 20-01-17 19:46, David Jones wrote:

From: Kevin Golding 
Sent: Friday, January 20, 2017 11:59 AM
To: users@spamassassin.apache.org
Subject: Re: No rule updates since 1/1/17
 

On Fri, 20 Jan 2017 17:26:01 -, Bill Keenan
 wrote:

What is the fix needed so /usr/bin/sa-update starts getting updates? I
too have not received an update from updates.spamassassin.org
<http://updates.spamassassin.org/> since 1-Jan-17.

Besides updates.spamassassin.org <http://updates.spamassassin.org/>,
what other rule sets are commonly used? Hundreds of spam messages are
getting through with only updates.spamassassin.org
<http://updates.spamassassin.org/> rules.

This seems like a good time to mention
https://wiki.apache.org/spamassassin/NightlyMassCheck
If more people can contribute, even just a small corpora of mail, then
updates will be published more frequently. At the moment a very small
number of people provide data, meaning there is very little margin for
error.

I would like to help with the nightly masscheck but I don't have the
resources to manually check ham and spam.  This also gets into the
grey area of how people define spam.  I also have a very good MTA
setup with RBLs and DNS checks that block most of the spam before
it reaches SA in MailScanner.  My SA only has to block a very small
percentage of my definition of spam so I am not sure how helpful
my mail filtering platform can be even though it's very accurate.

Dave


I think I can say the same about my platform, but since this issue keeps
popping up I just applied for an account just to find out if my
contribution could help. I can't speculate so I'm just gonna try if it
helps :)

Kind regards,
Tom



--
David Jones



Re: No rule updates since 1/1/17

2017-01-24 Thread David Jones
>> I think the "barrier to entry" is too difficult for most.  I would
>> have to setup a new MX on a domain without MTA checks (DNS and RBL)

I set this up and it was much easier than I had thought.  The wiki
documentation was helpful but very confusing at first.  Start with:

https://wiki.apache.org/spamassassin/NightlyMassCheck

The automasscheck-minimal.sh  does all of the work pulling
down the latest rules, running the masscheck, then uploading
the results (not your email) to the rulesQA server.

I setup an iRedMail server leaving the default Postfix MTA
postscreen settings to do basic DNS and RBL checks which
uses zen.spamhaus.org and b.barracudacentral.org.  The
default iRedmail spam filtering works pretty well so I would
recommend anyone interested take a look at it.

NOTE: I disabled all RBLs and DNS checks and saw way too
much spam to even start sorting through so I enabled it
again.  I did disable greylisting done by the iredapd service
to allow more spam in.

Here are my install notes:
- Create new VM and install iRedmail
Pick a domain that doesn't matter or conflict with real mail flow
Setup a catchall address to direct mail to the postmaster mailbox.
Don't worry about DNS PTR, SPF, and DKIM since this
server will not be sending outbound
- Run sa-update to get the rules in place without waiting on cron
- Disable the iredapd daemon doing greylisting and content filtering
/etc/postfix/main.cf
Comment lines with ''
systemctl stop iredapd.service
systemctl disable iredapd.service
- v320.pre: loadplugin Mail::SpamAssassin::Plugin::Shortcircuit
- v310.pre: loadplugin Mail::SpamAssassin::Plugin::DCC
Install DCC
- Install Pyzor
- Change amavisd-new to always add X-Spam-Status tag with rule details
/etc/amavisd/amavisd.conf   
$sa_tag_level_deflt  = -999;
- Setup MX record and A record pointing to the iRedMail server
- If you don't start seeing mail immediately, test mail flow from
the Internet using Wormly SMTP test
- I setup rules to help sort the mail into folders before I manually
sort them into the Ham and Spam folders
- Continue with step 4 from:
https://wiki.apache.org/spamassassin/NightlyMassCheck
I used the vmail user with home dir /var/vmail.
/var/vmail/bin/automasscheck-minimal.sh
/var/vmail/.auto-mass-check.cf - wiki step 6 needs to be
corrected or the script updated with correct filename
MAILDIR="/var/vmail/vmail1/[domain]/p/o/s/postmaster-[date]/Maildir"
Line 52:
run_masscheck single-corpus \
ham:dir:$MAILDIR/.Ham/ \
spam:dir:$MAILDIR/.Spam/
- Put a few ham and spam into their folder in iRedMail Roundcube
web interface and test the automasscheck-minimal.sh script
without the rsync creds to get familiar with the process
and check your ham.log and spam.log
- I added the following lines to the automasscheck-minimal.sh
to allow easy running of the hit-frequency script:
(line 95 - end of run_masscheck function)   
ln -s ham-${LOGNAME}.log ham.log
ln -s spam-${LOGNAME}.log spam.log
- Check out the tools like hit-frequency for interesting info:
(as user vmail) ~/masscheckwork/nightly_mass_check/masses

Hope this helps get more people involved in the masschecking.
If I have accidentally missed something above, please correct.

Dave

Re: No rule updates since 1/1/17

2017-01-21 Thread John Hardin

On Sat, 21 Jan 2017, Kevin Golding wrote:


On Sat, 21 Jan 2017 19:08:39 -, Jari Fredriksson  wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John Hardin kirjoitti 20.1.2017 22:38:

> Collecting spam after RBL filtering is much less helpful to masscheck.
> Ideally your spam corpus is from a totally unfiltered feed.
> 
> However, even if it is filtered and small, it helps, *especially* if

> the ham is not in English - masscheck is perennially starved for
> non-English ham and rule scoring is thus baised against non-English
> languages to a degree.

This is NOT what I have learned from SA lists. I used to do this, but
learned in SA discussions that it is *harmful* to pass such spam to
masscheck. That it harms the SA users doing proper pre SA filtering.

We do *need* an official policy! What are we going to do with mixed
messages like this??


It was written down once. I saw the unfiltered thing again when I looked 
earlier today, but I can't spot it just now. I believe I was also told by 
someone who knows this stuff that it wasn't a requirement, more an ideal.


I apologize if there's empirical evidence that including spam that would 
be blocked by RBLs causes poorer masscheck results. That seems strongly 
counterintuitive to me, especially for sites where such filtering is *not* 
done at the MTA level - there are such.


However looking for that comment again just now I registered another 
discrepancy on the wiki:


https://wiki.apache.org/spamassassin/CorpusCleaning - no spam older than 2 
months


https://wiki.apache.org/spamassassin/HandClassifiedCorpora - no spam older 
than 6 months


I don't think either are actually strict rules.


There is age filtering in the masscheck code, but I don't remember off the 
top of my head what the cutoff actually is. I agree that the discrepancies 
in the wiki should be corrected...



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 An operating system design that requires a system reboot in order to
 install a document viewing utility does not earn my respect.
---
 2 days until John Moses Browning's 162nd Birthday


Re: No rule updates since 1/1/17

2017-01-21 Thread John Hardin

On Sat, 21 Jan 2017, Kevin Golding wrote:


On Sat, 21 Jan 2017 16:35:12 -, David Jones  wrote:

I think the "barrier to entry" is too difficult for most.  I would have 
to setup a new MX on a domain without MTA checks (DNS and RBL) then 
create a honeypot email address to attract spam if I didn't have 
established recipient addresses/mailboxes.


I may be wrong but I don't believe the majority of the current masscheckers 
have honeypots in place. I also believe that at least some have some form of 
filtering in place - in fact the most common filtering in place is the manual 
classification since I bet most of us come across the odd message that we 
second guess and just put to one side.


Likely true. What I contribute is what gets throgh Zen to my personal 
mailbox, for example. I did liberally sprinkle "ideally" through my 
description... :)



Then I would have to setup an SA development
environment with scripts to keep it up-to-date from SVN and compiled 
regularly.


I forget the exact steps involved for running the checks because basically I 
set it up and largely forget about it, but essentially it was grab an svn 
copy of SpamAssassin, pick one of the various helper scripts, create a config 
and let cron deal with the daily workload of updating/checking/submitting - 
it's all done in the helper scripts.


Right, that's pretty much all there is: install and schedule the local 
masscheck script. It's not quite totally black box, but pretty close. You 
do, however, have to get all the bits working initially.


I like the idea of a VM image.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 An operating system design that requires a system reboot in order to
 install a document viewing utility does not earn my respect.
---
 2 days until John Moses Browning's 162nd Birthday


Re: No rule updates since 1/1/17

2017-01-21 Thread Jari Fredriksson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Kevin Golding kirjoitti 21.1.2017 21:22:
> On Sat, 21 Jan 2017 19:08:39 -, Jari Fredriksson  wrote:
> 
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>> 
>> John Hardin kirjoitti 20.1.2017 22:38:
>> 
>>> Collecting spam after RBL filtering is much less helpful to masscheck.
>>> Ideally your spam corpus is from a totally unfiltered feed.
>>> 
>>> However, even if it is filtered and small, it helps, *especially* if
>>> the ham is not in English - masscheck is perennially starved for
>>> non-English ham and rule scoring is thus baised against non-English
>>> languages to a degree.
>> 
>> This is NOT what I have learned from SA lists. I used to do this, but
>> learned in SA discussions that it is *harmful* to pass such spam to
>> masscheck. That it harms the SA users doing proper pre SA filtering.
>> 
>> We do *need* an official policy! What are we going to do with mixed
>> messages like this??
> 
> It was written down once. I saw the unfiltered thing again when I
> looked  earlier today, but I can't spot it just now. I believe I was
> also told by  someone who knows this stuff that it wasn't a
> requirement, more an ideal.
> 
> However looking for that comment again just now I registered another
> discrepancy on the wiki:
> 
> https://wiki.apache.org/spamassassin/CorpusCleaning - no spam older
> than 2  months
> 
> https://wiki.apache.org/spamassassin/HandClassifiedCorpora - no spam
> older  than 6 months
> 
> I don't think either are actually strict rules. It will help lower the
>  barrier to entry if we can make this stuff more uniform. It could
> also be  argued that having two such similar pages is somewhat
> redundant actually.

What has CorpusCleaning from garbage to with this? Really confused now.

- -- 
ja...@iki.fi
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iEYEARECAAYFAliDwMEACgkQKL4IzOyjSrZVegCeP+YQcK6s4AlHb4iTqbzUtige
ZTAAoKFGolEuLmElzqZu1KT3+RmMm/s2
=mDIS
-END PGP SIGNATURE-


Re: No rule updates since 1/1/17

2017-01-21 Thread Kevin Golding

On Sat, 21 Jan 2017 19:08:39 -, Jari Fredriksson  wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John Hardin kirjoitti 20.1.2017 22:38:


Collecting spam after RBL filtering is much less helpful to masscheck.
Ideally your spam corpus is from a totally unfiltered feed.

However, even if it is filtered and small, it helps, *especially* if
the ham is not in English - masscheck is perennially starved for
non-English ham and rule scoring is thus baised against non-English
languages to a degree.


This is NOT what I have learned from SA lists. I used to do this, but
learned in SA discussions that it is *harmful* to pass such spam to
masscheck. That it harms the SA users doing proper pre SA filtering.

We do *need* an official policy! What are we going to do with mixed
messages like this??


It was written down once. I saw the unfiltered thing again when I looked  
earlier today, but I can't spot it just now. I believe I was also told by  
someone who knows this stuff that it wasn't a requirement, more an ideal.


However looking for that comment again just now I registered another  
discrepancy on the wiki:


https://wiki.apache.org/spamassassin/CorpusCleaning - no spam older than 2  
months


https://wiki.apache.org/spamassassin/HandClassifiedCorpora - no spam older  
than 6 months


I don't think either are actually strict rules. It will help lower the  
barrier to entry if we can make this stuff more uniform. It could also be  
argued that having two such similar pages is somewhat redundant actually.


Re: No rule updates since 1/1/17

2017-01-21 Thread Jari Fredriksson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John Hardin kirjoitti 20.1.2017 22:38:

> Collecting spam after RBL filtering is much less helpful to masscheck.
> Ideally your spam corpus is from a totally unfiltered feed.
> 
> However, even if it is filtered and small, it helps, *especially* if
> the ham is not in English - masscheck is perennially starved for
> non-English ham and rule scoring is thus baised against non-English
> languages to a degree.

This is NOT what I have learned from SA lists. I used to do this, but
learned in SA discussions that it is *harmful* to pass such spam to
masscheck. That it harms the SA users doing proper pre SA filtering.

We do *need* an official policy! What are we going to do with mixed
messages like this??


- -- 
ja...@iki.fi
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iEYEARECAAYFAliDsbcACgkQKL4IzOyjSrbG1wCg8kbOuaUlyjogQw0Tm0bUGcNA
nrUAoINhOU8+veBBzQlipYI657FMsXfW
=2Fkw
-END PGP SIGNATURE-


Re: No rule updates since 1/1/17

2017-01-21 Thread David Jones
>On Sat, 21 Jan 2017 16:35:12 +
>David Jones wrote:

>> I think the "barrier to entry" is too difficult for most.  I would
>> have to setup a new MX on a domain without MTA checks (DNS and RBL)

>I hope it doesn't actually say that anywhere. IMO the corpora should be
>dominated by the spam that's gets through to SA in actual production
>environments.

It was implied by a response from John Hardin yesterday and makes
sense.  I have tuned my production mail filters to block > 90% of spam
via DNS checks and RBLs in Postfix so SA only has to block a few
percent of the total potential mail.  That means my production SA is
not going to see the majority of spam.  I only have to deal with the
occassional compromised account sending spam for a short period
before it is either detected and locked or becomes listed on enough
RBLs.

I am currently setting up a new MX and getting mail flowing to a newly
built iRedMail server.  Then I will look at the SVN scripts to get that
part setup.
http://svn.apache.org/viewvc/spamassassin/trunk/masses/contrib/automasscheck-minimal/
I am not familiar with amavis-new since I have been using MailScanner
so I will research how to setup the SA development environment with
iRedMail's amavis-new.  I have disabled most of the Postfix settings
to block spam (DNS and RBLs) that iRedMail sets up so SA should see
almost everything sent to a catchall mailbox.  Then I plan to login to
that account regularly and categorize ham and spam.

Re: No rule updates since 1/1/17

2017-01-21 Thread RW
On Sat, 21 Jan 2017 16:35:12 +
David Jones wrote:



> I think the "barrier to entry" is too difficult for most.  I would
> have to setup a new MX on a domain without MTA checks (DNS and RBL)

I hope it doesn't actually say that anywhere. IMO the corpora should be
dominated by the spam that's gets through to SA in actual production
environments.


Re: No rule updates since 1/1/17

2017-01-21 Thread Kevin Golding

On Sat, 21 Jan 2017 16:35:12 -, David Jones  wrote:

I think the "barrier to entry" is too difficult for most.  I would have  
to

setup a new MX on a domain without MTA checks (DNS and RBL) then
create a honeypot email address to attract spam if I didn't have  
established

recipient addresses/mailboxes.


I may be wrong but I don't believe the majority of the current  
masscheckers have honeypots in place. I also believe that at least some  
have some form of filtering in place - in fact the most common filtering  
in place is the manual classification since I bet most of us come across  
the odd message that we second guess and just put to one side.



Then I would have to setup an SA development
environment with scripts to keep it up-to-date from SVN and compiled  
regularly.


I forget the exact steps involved for running the checks because basically  
I set it up and largely forget about it, but essentially it was grab an  
svn copy of SpamAssassin, pick one of the various helper scripts, create a  
config and let cron deal with the daily workload of  
updating/checking/submitting - it's all done in the helper scripts. You  
can write your own if you like too, but in the various options out there  
one stands a good chance of meeting your needs.


I only really have to think about it when I move my masschecks to a new  
machine.



Finally I would need to manually categorize the ham and spam.


Okay, I agree this part involves doing stuff regularly. The amount will  
vary depending on how active you are. Personally? If I am confident a mail  
is ham or spam then I am confident that mail could be used for both bayes  
training and masschecking. I was going to do that classification anyway so  
it's not really any extra for me.


Sure, I could spend more time working to get a few extra samples etc. but  
I have found my personal happy balance in terms of input vs output. You're  
not expected to neglect your pets to make it perfect. Just, if anyone has  
the ability to help out (even a little bit) it might be handy.


What could be just as helpful as actually running masschecks might be  
looking at the current documentation and poking at it with a stick. Maybe  
it does need tweaking to sound less complicated (I think it's improved  
over the years but maybe not enough). It sounds as if there are a couple  
of things that could be looked at, perhaps there are more.


Re: No rule updates since 1/1/17

2017-01-21 Thread Axb

On 01/21/2017 05:35 PM, David Jones wrote:

On Fri, 20 Jan 2017 19:02:09 -, Tom Hendrikx  wrote:



As John has said, diversity makes the rules more accurate for more people.
Also many hands make light work. With more people involved there's not
such a requirement to contribute thousands of messages per person.


I think the "barrier to entry" is too difficult for most.  I would have to
setup a new MX on a domain without MTA checks (DNS and RBL) then
create a honeypot email address to attract spam if I didn't have established
recipient addresses/mailboxes.  Then I would have to setup an SA development
environment with scripts to keep it up-to-date from SVN and compiled regularly.
Finally I would need to manually categorize the ham and spam.

I am capable of doing everything above and really want to help the mass-
checking but it would be better if the "barrier to entry" is lower for everyone.
If there were setup scripts (maybe there already is) or a VM that could be
downloaded ready to use, that would help get more masscheckers going easily.

For example, would it be possible to setup an iRedMail server VM (only takes
a few minutes) that could be turned into this SA development environment for
masschecking?  This would quickly provide a separate mail server with an IMAP
server and webmail interface for easy categorization of spam and ham.  If so,
then could someone point me to some documentation on setting up an SA
development environment for masschecking?

I have a domain that is about to be retired with a lot of addresses on spam
lists that would be very good to attract spam and help the masscheck corpus.



/trunk/masses/contrib/automasscheck-minimal

The Wiki also has a lot of detailed info.




Re: No rule updates since 1/1/17

2017-01-21 Thread David Jones
>On Fri, 20 Jan 2017 19:02:09 -, Tom Hendrikx  wrote:

>As John has said, diversity makes the rules more accurate for more people.
>Also many hands make light work. With more people involved there's not
>such a requirement to contribute thousands of messages per person.

I think the "barrier to entry" is too difficult for most.  I would have to
setup a new MX on a domain without MTA checks (DNS and RBL) then
create a honeypot email address to attract spam if I didn't have established
recipient addresses/mailboxes.  Then I would have to setup an SA development
environment with scripts to keep it up-to-date from SVN and compiled regularly.
Finally I would need to manually categorize the ham and spam.

I am capable of doing everything above and really want to help the mass-
checking but it would be better if the "barrier to entry" is lower for everyone.
If there were setup scripts (maybe there already is) or a VM that could be
downloaded ready to use, that would help get more masscheckers going easily.

For example, would it be possible to setup an iRedMail server VM (only takes
a few minutes) that could be turned into this SA development environment for
masschecking?  This would quickly provide a separate mail server with an IMAP
server and webmail interface for easy categorization of spam and ham.  If so,
then could someone point me to some documentation on setting up an SA
development environment for masschecking?

I have a domain that is about to be retired with a lot of addresses on spam
lists that would be very good to attract spam and help the masscheck corpus.


Re: No rule updates since 1/1/17

2017-01-21 Thread Kevin Golding

On Fri, 20 Jan 2017 19:02:09 -, Tom Hendrikx  wrote:


I think I can say the same about my platform, but since this issue keeps
popping up I just applied for an account just to find out if my
contribution could help. I can't speculate so I'm just gonna try if it
helps :)


Top move, it's definitely worth looking into. The list sees a lot of  
questions about either the scores that are generated or why they haven't  
been generated, and the answer tends to come down to one factor - the  
masscheck team is pretty small.


As John has said, diversity makes the rules more accurate for more people.  
Also many hands make light work. With more people involved there's not  
such a requirement to contribute thousands of messages per person.


Re: No rule updates since 1/1/17

2017-01-20 Thread John Hardin

On Fri, 20 Jan 2017, Bill Keenan wrote:

I am interested/willing to be part of mass check. However, I use spam 
assassin via amavisd-new.


On Fri, 20 Jan 2017, David Jones wrote:


I would like to help with the nightly masscheck but I don't have the
resources to manually check ham and spam.  This also gets into the
grey area of how people define spam.  I also have a very good MTA
setup with RBLs and DNS checks that block most of the spam before
it reaches SA in MailScanner.  My SA only has to block a very small
percentage of my definition of spam so I am not sure how helpful
my mail filtering platform can be even though it's very accurate.



Participating in masscheck is different from merely *using* SpamAssassin. 
The environments will likely not be associated with each other.


You will need to have a complete SA development environment kept 
up-to-date from SVN and compiled regularly so that you're testing the 
correct rules. Alternatively, if your corpora are small and you don't have 
concerns about possible leakage, your corpora *can* be uploaded to the SA 
masscheck server for central scanning. However, distributing the load is 
strongly desired, so this shouldn't be the default method of 
participation.


You will need to have manually-vetted ham and (ideally) spam, though if 
you have a honeypot set up (either mailbox(es) or domain(s)) that you 
***know*** will not receive ham then that can be directly fed into your 
masscheck spam corpus.


How much resources (time, etc.) you can devote to corpora maintenance is a 
large determining factor in the quality of your contribution. Your 
masscheck corpora *must* be clean or the rule scoring will be done poorly, 
perhaps even poisonously.


The masscheck corpora also need to be kept fairly fresh, so it's an 
ongoing process.


Collecting spam after RBL filtering is much less helpful to masscheck. 
Ideally your spam corpus is from a totally unfiltered feed.


However, even if it is filtered and small, it helps, *especially* if the 
ham is not in English - masscheck is perennially starved for non-English 
ham and rule scoring is thus baised against non-English languages to a 
degree.


(however there are some honeypots in Europe feeding masscheck so that may 
actually be less of a problem than I believe it is...)


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The most glaring example of the cognitive dissonance on the left
  is the concept that human beings are inherently good, yet at the
  same time cannot be trusted with any kind of weapon, unless the
  magic fairy dust of government authority gets sprinkled upon them.
   -- Moshe Ben-David
---
 3 days until John Moses Browning's 162nd Birthday


Re: No rule updates since 1/1/17

2017-01-20 Thread Tom Hendrikx
On 20-01-17 19:46, David Jones wrote:
>> From: Kevin Golding <k...@caomhin.org>
>> Sent: Friday, January 20, 2017 11:59 AM
>> To: users@spamassassin.apache.org
>> Subject: Re: No rule updates since 1/1/17
> 
>> On Fri, 20 Jan 2017 17:26:01 -, Bill Keenan  
>> <developerli...@wjkeenan.org> wrote:
> 
>>> What is the fix needed so /usr/bin/sa-update starts getting updates? I  
>>> too have not received an update from updates.spamassassin.org  
>>> <http://updates.spamassassin.org/> since 1-Jan-17.
>>>
>>> Besides updates.spamassassin.org <http://updates.spamassassin.org/>, 
>>> what other rule sets are commonly used? Hundreds of spam messages are  
>>> getting through with only updates.spamassassin.org  
>>> <http://updates.spamassassin.org/> rules.
> 
>> This seems like a good time to mention  
>> https://wiki.apache.org/spamassassin/NightlyMassCheck
> 
>> If more people can contribute, even just a small corpora of mail, then  
>> updates will be published more frequently. At the moment a very small  
>> number of people provide data, meaning there is very little margin for  
>> error.
> 
> I would like to help with the nightly masscheck but I don't have the
> resources to manually check ham and spam.  This also gets into the
> grey area of how people define spam.  I also have a very good MTA
> setup with RBLs and DNS checks that block most of the spam before
> it reaches SA in MailScanner.  My SA only has to block a very small
> percentage of my definition of spam so I am not sure how helpful
> my mail filtering platform can be even though it's very accurate.
> 
> Dave
> 

I think I can say the same about my platform, but since this issue keeps
popping up I just applied for an account just to find out if my
contribution could help. I can't speculate so I'm just gonna try if it
helps :)

Kind regards,
Tom



signature.asc
Description: OpenPGP digital signature


Re: No rule updates since 1/1/17

2017-01-20 Thread David Jones
>From: Kevin Golding <k...@caomhin.org>
>Sent: Friday, January 20, 2017 11:59 AM
>To: users@spamassassin.apache.org
>Subject: Re: No rule updates since 1/1/17
    
>On Fri, 20 Jan 2017 17:26:01 -, Bill Keenan  
><developerli...@wjkeenan.org> wrote:

>> What is the fix needed so /usr/bin/sa-update starts getting updates? I  
>> too have not received an update from updates.spamassassin.org  
>> <http://updates.spamassassin.org/> since 1-Jan-17.
>>
>> Besides updates.spamassassin.org <http://updates.spamassassin.org/>, 
>> what other rule sets are commonly used? Hundreds of spam messages are  
>> getting through with only updates.spamassassin.org  
>> <http://updates.spamassassin.org/> rules.

>This seems like a good time to mention  
>https://wiki.apache.org/spamassassin/NightlyMassCheck

>If more people can contribute, even just a small corpora of mail, then  
>updates will be published more frequently. At the moment a very small  
>number of people provide data, meaning there is very little margin for  
>error.

I would like to help with the nightly masscheck but I don't have the
resources to manually check ham and spam.  This also gets into the
grey area of how people define spam.  I also have a very good MTA
setup with RBLs and DNS checks that block most of the spam before
it reaches SA in MailScanner.  My SA only has to block a very small
percentage of my definition of spam so I am not sure how helpful
my mail filtering platform can be even though it's very accurate.

Dave

Re: No rule updates since 1/1/17

2017-01-20 Thread Bill Keenan
Kevin,

I am interested/willing to be part of mass check. However, I use spam assassin 
via amavisd-new. The wiki references 
http://www.spamtips.org/p/install-procedure.html 
, which is not my form of 
installation.

BillK
 
> On Jan 20, 2017, at 9:59 AM, Kevin Golding  wrote:
> 
> On Fri, 20 Jan 2017 17:26:01 -, Bill Keenan  
> wrote:
> 
>> What is the fix needed so /usr/bin/sa-update starts getting updates? I too 
>> have not received an update from updates.spamassassin.org 
>>  since 1-Jan-17.
>> 
>> Besides updates.spamassassin.org , what 
>> other rule sets are commonly used? Hundreds of spam messages are getting 
>> through with only updates.spamassassin.org 
>>  rules.
> 
> This seems like a good time to mention 
> https://wiki.apache.org/spamassassin/NightlyMassCheck
> 
> If more people can contribute, even just a small corpora of mail, then 
> updates will be published more frequently. At the moment a very small number 
> of people provide data, meaning there is very little margin for error.



Re: No rule updates since 1/1/17

2017-01-20 Thread Kevin Golding
On Fri, 20 Jan 2017 17:26:01 -, Bill Keenan  
 wrote:


What is the fix needed so /usr/bin/sa-update starts getting updates? I  
too have not received an update from updates.spamassassin.org  
 since 1-Jan-17.


Besides updates.spamassassin.org ,  
what other rule sets are commonly used? Hundreds of spam messages are  
getting through with only updates.spamassassin.org  
 rules.


This seems like a good time to mention  
https://wiki.apache.org/spamassassin/NightlyMassCheck


If more people can contribute, even just a small corpora of mail, then  
updates will be published more frequently. At the moment a very small  
number of people provide data, meaning there is very little margin for  
error.


Re: No rule updates since 1/1/17

2017-01-20 Thread Bill Keenan
What is the fix needed so /usr/bin/sa-update starts getting updates? I too have 
not received an update from updates.spamassassin.org 
 since 1-Jan-17.

Besides updates.spamassassin.org , what other 
rule sets are commonly used? Hundreds of spam messages are getting through with 
only updates.spamassassin.org  rules.

BillK

> On Jan 17, 2017, at 3:17 PM, Dave Warren  wrote:
> 
> On Tue, Jan 17, 2017, at 12:51, Axb wrote:
>> On 01/17/2017 09:14 PM, Dave Warren wrote:
>>> On Sun, Jan 15, 2017, at 20:02, Kevin A. McGrail wrote:
 On 1/15/2017 9:21 PM, Chris wrote:
> The last update of rules I've seen is 1/1/17. The attached cron output
> seems to show no problems though. Doesn't seem right no updates for two
> weeks but I guess it's possible.
 
 It's been noted and I think i have the root issue tracked down. Some of
 the checkers are running the wrong SVN checkout and I don't know why so
 they are skipped.  Then we miss the minimum number of masscheckers to
 publish.
>>> 
>>> Have you reached you to any that weren't reporting correctly? Mine went
>>> offline in November and just came back up 1-3 days ago, could you take a
>>> quick look at "dwarren" to see if everything is okay with my
>>> submissions?
>>> 
>> 
>> Dave,
>> 
>> If you look into
>> http://ruleqa.spamassassin.org/
>>  and unfold [+] the "green" (lastest) you should find your "dwarren".
>> If it's not there, could be you submission came in too late.
> 
> I'm in the list on the 15th and 16th, but not the 17th. Not sure what to
> make of that, from what little I can tell, it completed around the same
> time, but I don't have time to dig into it further right now. I was more
> worried about whether I fell into the "wrong SVN checkout" group and was
> ignored for that reason.
> 



Re: No rule updates since 1/1/17

2017-01-17 Thread Dave Warren
On Tue, Jan 17, 2017, at 12:51, Axb wrote:
> On 01/17/2017 09:14 PM, Dave Warren wrote:
> > On Sun, Jan 15, 2017, at 20:02, Kevin A. McGrail wrote:
> >> On 1/15/2017 9:21 PM, Chris wrote:
> >>> The last update of rules I've seen is 1/1/17. The attached cron output
> >>> seems to show no problems though. Doesn't seem right no updates for two
> >>> weeks but I guess it's possible.
> >>
> >> It's been noted and I think i have the root issue tracked down. Some of
> >> the checkers are running the wrong SVN checkout and I don't know why so
> >> they are skipped.  Then we miss the minimum number of masscheckers to
> >> publish.
> >
> > Have you reached you to any that weren't reporting correctly? Mine went
> > offline in November and just came back up 1-3 days ago, could you take a
> > quick look at "dwarren" to see if everything is okay with my
> > submissions?
> >
> 
> Dave,
> 
> If you look into
> http://ruleqa.spamassassin.org/
>   and unfold [+] the "green" (lastest) you should find your "dwarren".
> If it's not there, could be you submission came in too late.

I'm in the list on the 15th and 16th, but not the 17th. Not sure what to
make of that, from what little I can tell, it completed around the same
time, but I don't have time to dig into it further right now. I was more
worried about whether I fell into the "wrong SVN checkout" group and was
ignored for that reason.



Re: No rule updates since 1/1/17

2017-01-17 Thread Axb

On 01/17/2017 09:14 PM, Dave Warren wrote:

On Sun, Jan 15, 2017, at 20:02, Kevin A. McGrail wrote:

On 1/15/2017 9:21 PM, Chris wrote:

The last update of rules I've seen is 1/1/17. The attached cron output
seems to show no problems though. Doesn't seem right no updates for two
weeks but I guess it's possible.


It's been noted and I think i have the root issue tracked down. Some of
the checkers are running the wrong SVN checkout and I don't know why so
they are skipped.  Then we miss the minimum number of masscheckers to
publish.


Have you reached you to any that weren't reporting correctly? Mine went
offline in November and just came back up 1-3 days ago, could you take a
quick look at "dwarren" to see if everything is okay with my
submissions?



Dave,

If you look into
http://ruleqa.spamassassin.org/
 and unfold [+] the "green" (lastest) you should find your "dwarren".
If it's not there, could be you submission came in too late.

Axb




Re: No rule updates since 1/1/17

2017-01-17 Thread Dave Warren
On Sun, Jan 15, 2017, at 20:02, Kevin A. McGrail wrote:
> On 1/15/2017 9:21 PM, Chris wrote:
> > The last update of rules I've seen is 1/1/17. The attached cron output
> > seems to show no problems though. Doesn't seem right no updates for two
> > weeks but I guess it's possible.
> 
> It's been noted and I think i have the root issue tracked down. Some of 
> the checkers are running the wrong SVN checkout and I don't know why so 
> they are skipped.  Then we miss the minimum number of masscheckers to 
> publish.

Have you reached you to any that weren't reporting correctly? Mine went
offline in November and just came back up 1-3 days ago, could you take a
quick look at "dwarren" to see if everything is okay with my
submissions?






Re: No rule updates since 1/1/17

2017-01-16 Thread Chris
On Sun, 2017-01-15 at 23:02 -0500, Kevin A. McGrail wrote:
> On 1/15/2017 9:21 PM, Chris wrote:
> > 
> > The last update of rules I've seen is 1/1/17. The attached cron
> > output
> > seems to show no problems though. Doesn't seem right no updates for
> > two
> > weeks but I guess it's possible.
> 
> It's been noted and I think i have the root issue tracked down. Some
> of 
> the checkers are running the wrong SVN checkout and I don't know why
> so 
> they are skipped.  Then we miss the minimum number of masscheckers
> to 
> publish.
> 
Thanks for the quick reply Kevin, appreciate the feedback.

Chris

-- 
Chris
KeyID 0xE372A7DA98E6705C
31.11972; -97.90167 (Elev. 1092 ft)
08:52:21 up 17:36, 1 user, load average: 6.73, 2.48, 1.06
Ubuntu 16.04.1 LTS, kernel 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6
17:47:47 UTC 2017

signature.asc
Description: This is a digitally signed message part


Re: No rule updates since 1/1/17

2017-01-15 Thread Kevin A. McGrail

On 1/15/2017 9:21 PM, Chris wrote:

The last update of rules I've seen is 1/1/17. The attached cron output
seems to show no problems though. Doesn't seem right no updates for two
weeks but I guess it's possible.


It's been noted and I think i have the root issue tracked down. Some of 
the checkers are running the wrong SVN checkout and I don't know why so 
they are skipped.  Then we miss the minimum number of masscheckers to 
publish.