Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread RW
On Mon, 28 Dec 2015 23:42:03 -0500
Bill Cole wrote:


> Using these facts, my learning script that runs as root and reads
> from multiple real users' Maildirs does this to learn ham:
> 
>for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u
> $SAUSER sa-learn --ham --mbox
> 
> Where $HAMS is the list of ham message files and $SAUSER is the user 
> handling the system-wide BayesDB. I use formail there just to give
> each message a leading 'From ' line (i.e. mbox format) so that the
> whole bunch can be piped into a single sa-learn invocation.

IIRC when you do that sa-learn just creates a temporary file and then
runs on that. 

> The alternative without formail would be to pipe each raw message into
> its own sa-learn. 

The alternative is to give it a directory. It can work out for itself
whether it's maildir or just a directory of files. If you need to train
an arbitrary  selection of files, you could symlink them into a
temporary directory. If you run spamd it's also possible to train via
spamc.


Personally I'd avoid the unforced use of mbox around Bayes without
being sure that "From-escaping" is taken account of . The problem is
that formail will replace a "From" at the beginning of a body line with
">From" which changes the msgid hash and prevents the correct
retraining of mail that was trained without going through formail -
e.g. the correction of autotraining.

I just had a quick look and I can't see any support for this in
SpamAssassin. It's not a major problem, but in this case it's an easily
avoidable one.


Re: URI_NO_WWW_INFO_CGI info

2015-12-29 Thread Axb

On 12/29/2015 04:14 PM, ma...@nucleus.it wrote:

Hi,
some customers triggers this rules in their mails

# bug 3896: URIs in various TLDs, other than 3rd level www
uri
URI_NO_WWW_INFO_CGI 
/^(?:https?:\/\/)?[^\/]+(?http://media.mydomain.info/.
and the rule take a score of 2.299 by itself.

Why a score so high?
Anyone knows why only .info and .biz domains in these rules and not the
others domains (.com etc...)?


because that is the way the rule was designed - as historically both 
TLds get massively abused


The score is set my masschecks

score URI_NO_WWW_INFO_CGI 2.299 2.299 0.292 2.071

http://ruleqa.spamassassin.org/20151228-r1721884-n/URI_NO_WWW_INFO_CGI/detail







Re: TxRep Template Tags staying as tags

2015-12-29 Thread Kevin A. McGrail

On 12/29/2015 7:18 AM, Kevin Golding wrote:
On Mon, 28 Dec 2015 18:21:35 -, Kevin A. McGrail 
 wrote:



On 12/24/2015 10:04 AM, Kevin Golding wrote:
I know I'm a bit weird but I like stuffing headers with all kinds of 
data like I'm stuffing a turkey for Christmas, but I've never been 
able to get anything showing up for TxRep (it is referenced in 
_TESTSSCORES_ but none of the TxRep specific ones seem to convert). 
I feel I can't be the only person trying to use these tags so I 
thought I'd see if anyone had any hints.


I actually just use global storage for TxRep and assign a generic 
username for everything but I've tried it with both G and Y 
modifiers. I've added these lines for testing:


add_header all TxRep-Global _TXREP_EMAIL_G_ _TXREP_EMAIL_IP_G_ 
_TXREP_IP_G_ _TXREP_DOMAIN_G_ _TXREP_HELO_G_
add_header all TxRep-User _TXREP_EMAIL_U_ _TXREP_EMAIL_IP_U_ 
_TXREP_IP_U_ _TXREP_DOMAIN_U_ _TXREP_HELO_U_


But all mails show is:

X-Spam-TxRep-Global: _TXREP_EMAIL_G_ _TXREP_EMAIL_IP_G_ _TXREP_IP_G_ 
_TXREP_DOMAIN_G_ _TXREP_HELO_G_
X-Spam-TxRep-User: _TXREP_EMAIL_U_ _TXREP_EMAIL_IP_U_ _TXREP_IP_U_ 
_TXREP_DOMAIN_U_ _TXREP_HELO_U_


I've tried skipping both U and G since I'm not using dual storage 
but no joy there either. I've tried it with the various sub-tags 
like COUNT and MEAN etc. but again, no conversions.


I've scratched my head over this one before and got nowhere and 
today's attempts don't seem to have got me any closer so hopefully 
someone can spot where my mistake is.


Cheers
Sorry, I think you might be unique.  These aren't listed in the 
Mail::SpamAssassin::Conf TAGS section and grepping trunk shows nothing.


Whilst not in Mail::SpamAssassin::Conf they are in 
Mail::SpamAssassin::Plugin::TxRep


https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Plugin_TxRep.html#template_tags 



The tags don't show up well in grepping the code because the tags are 
defined dynamically as they iterate over the same basic code changing 
just one argument.


Having done a bit more poking I can try to fill in more data because 
I'm still a little puzzled.


The big thing I've confirmed is the _Y part of the tag is redundant 
for me, since I see the following in debug:


dbg: check: tagrun - tag TXREP_EMAIL_IP is now ready, value: 0.0

Since tagrun debug output strips the fore and aft _ I've been running 
with:


add_header TxRep-Email-IP: _TXREP_EMAIL_IP_

I had found the confusion about needing either _U or _G -r neither 
unclear so it was nice to confirm that neither was the best choice for 
me. The problem is I still get:


X-Spam-TxRep-Email-IP: _TXREP_EMAIL_IP_

I've tried this with every TXREP_FOO option that I can find and 
nothing gets translated. All my other template tags work as expected 
though so I feel there's something awry.


I'm happy to fix the small documentation bugs I've already worked out 
and I'm happy to submit patches to move it to Mail::SpamAssassin::Conf 
if desired (although the number of these tags and variations is likely 
why originally left in the plugin documentation and may still be 
preferred) but I'm reluctant to be too gung-ho when I can't get the 
things working myself.
Agreed.  I see the code now at 1485.  Can anyone else replicate the 
issue at which point this becomes a bug?


We always need help with the code and docs so don't be afraid to try.  
You'd open a bug on our bugzilla instance and attach patches there.


My PERSONAL input is that I wouldn't move the tags to the main conf as 
presumably other plugins have tags but they aren't there.  Adding 
something that says plugins may add additional tags might be good though 
because I've worked with SA a long time didn't know the tags could be used!


Regards,
KAM


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Bill Cole

On 29 Dec 2015, at 20:02, Ian Zimmerman wrote:


On 2015-12-29 19:44 -0500, Bill Cole wrote:


On 29 Dec 2015, at 18:54, Ian Zimmerman wrote:

In fact sa-learn accepts multiple named arguments on the command 
line,
so the alternative I use is to go through the spambox N files at a 
time

in a shell loop.  (I have N=100 but obviously this depends.)


Which successfully ignores the original issue of this thread 
completely: that the
user sa-learn must run as cannot read the files being learnt. If you 
pass unreadable
filenames as arguments, sa-learn just whines and fails. Shockingly, 
that is not the

desired result.


Clearly you can do the su magic if needed.


Um, no.

Neither su nor sudo magically changes the permissions or ownership of 
files. If you pass filenames as arguments they must be readable by the 
user actually running sa-learn, which is the *unprivileged* user 
handling the system-wide BayesDB ("amavis" in the case originating this 
thread, but "spamd" and "defang" are other common ones...) In most 
reasonably well-secured systems using Maildir message stores, the 
Maildirs are all owned by individual users or by one user that handles 
delivery to "virtual users" understood by the MTA and IMAP or POP server 
by not by the OS. That is generally NOT the same user running spamd or 
content filters for a system-wide BayesDB. As a result, relearning has 
to be done as root, shuttling data from files owned by one user into a 
process running as another.




The point is that the
overhead which you fear is reduced N times.


And since the sa-learn processes can't read the files it is given as 
arguments, they run with blinding speed, skipping all that costly 
parsing and learning stuff...


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Ian Zimmerman
On 2015-12-29 20:41 -0500, Bill Cole wrote:

> Neither su nor sudo magically changes the permissions or ownership of
> files. If you pass filenames as arguments they must be readable by the
> user actually running sa-learn, which is the *unprivileged* user
> handling the system-wide BayesDB ("amavis" in the case originating
> this thread, but "spamd" and "defang" are other common ones...) In
> most reasonably well-secured systems using Maildir message stores, the
> Maildirs are all owned by individual users or by one user that handles
> delivery to "virtual users" understood by the MTA and IMAP or POP
> server by not by the OS. That is generally NOT the same user running
> spamd or content filters for a system-wide BayesDB. As a result,
> relearning has to be done as root, shuttling data from files owned by
> one user into a process running as another.

You are right.  The reason it works for me is that I don't use a
systemwide DB.

May I ask that you turn down the sarcasm a bit?

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Reindl Harald



Am 30.12.2015 um 03:11 schrieb Ian Zimmerman:

On 2015-12-29 20:41 -0500, Bill Cole wrote:


Neither su nor sudo magically changes the permissions or ownership of
files. If you pass filenames as arguments they must be readable by the
user actually running sa-learn, which is the *unprivileged* user
handling the system-wide BayesDB ("amavis" in the case originating
this thread, but "spamd" and "defang" are other common ones...) In
most reasonably well-secured systems using Maildir message stores, the
Maildirs are all owned by individual users or by one user that handles
delivery to "virtual users" understood by the MTA and IMAP or POP
server by not by the OS. That is generally NOT the same user running
spamd or content filters for a system-wide BayesDB. As a result,
relearning has to be done as root, shuttling data from files owned by
one user into a process running as another.


You are right.  The reason it works for me is that I don't use a
systemwide DB.

May I ask that you turn down the sarcasm a bit?


no idea where you found a piece of sarcasm in the quote above!

"As a result, relearning has to be done as root" is anyways nonsense, 
under no condition learning has to be done as root


at least not in "reasonably well-secured systems" because they are no 
longer reasonable secured when you pass junk and possible malware with 
root permissions to sa-learn


there is a difference collect the files as root, fix permissions and 
invoke sa-learn as restricted user from that script




signature.asc
Description: OpenPGP digital signature


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Ian Zimmerman
On 2015-12-29 19:44 -0500, Bill Cole wrote:

> On 29 Dec 2015, at 18:54, Ian Zimmerman wrote:
> 
> >In fact sa-learn accepts multiple named arguments on the command line,
> >so the alternative I use is to go through the spambox N files at a time
> >in a shell loop.  (I have N=100 but obviously this depends.)
> 
> Which successfully ignores the original issue of this thread completely: that 
> the
> user sa-learn must run as cannot read the files being learnt. If you pass 
> unreadable
> filenames as arguments, sa-learn just whines and fails. Shockingly, that is 
> not the
> desired result.

Clearly you can do the su magic if needed.  The point is that the
overhead which you fear is reduced N times.

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Bill Cole

On 29 Dec 2015, at 13:24, RW wrote:


On Mon, 28 Dec 2015 23:42:03 -0500
Bill Cole wrote:



Using these facts, my learning script that runs as root and reads
from multiple real users' Maildirs does this to learn ham:

 for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u
$SAUSER sa-learn --ham --mbox

Where $HAMS is the list of ham message files and $SAUSER is the user
handling the system-wide BayesDB. I use formail there just to give
each message a leading 'From ' line (i.e. mbox format) so that the
whole bunch can be piped into a single sa-learn invocation.


IIRC when you do that sa-learn just creates a temporary file and then
runs on that.


Yes, with the advantage of using 
Mail::SpamAssassin::Util::secure_tmpfile() rather than whatever I happen 
to roll up in a bit of Q shell that I never get around to reviewing 
for edge cases...


The main reason to do something like that is to avoid the heavyweight 
sudo & load of a Perl script for each message.




The alternative without formail would be to pipe each raw message 
into

its own sa-learn.


The alternative is to give it a directory.


Sure, one can reimplement Mail::SpamAssassin::Util::secure_tmpfile 
and/or Mail::SpamAssassin::Util::secure_tmpdir and use that. One can 
copy files from multiple user Maildirs and maybe error out before 
cleaning up or maybe forget to set perms right or maybe make some 
mistake I haven't thought of.


Or, I could use a tool that's been at least nominally open to review for 
many years across many versions and which stands a strong chance of 
having had at least one set of more competent eyes run across it looking 
for flaws to fix. I'm lazy...



It can work out for itself
whether it's maildir or just a directory of files. If you need to 
train

an arbitrary  selection of files, you could symlink them into a
temporary directory.


Not if the user you want to train as can't read the real files. Symlinks 
don't confer permission to read their targets (that would be very bad.)



If you run spamd it's also possible to train via
spamc.


Yes.  IF you run spamd and it's how your system-wide SA filtering is 
done already, that's arguably the best way to do ad hoc (re)training 
since you can be sure it's hitting the right DB and you can feed it in 
parallel.



Personally I'd avoid the unforced use of mbox around Bayes without
being sure that "From-escaping" is taken account of . The problem is
that formail will replace a "From" at the beginning of a body line 
with

">From" which changes the msgid hash and prevents the correct
retraining of mail that was trained without going through formail -
e.g. the correction of autotraining.


An excellent point, which I had not considered. I'm mildly surprised 
that sa-learn doesn't s/^>From /From /'  each message when disassembling 
the mbox, but only mildly. It seems I've got a script to fix...



I just had a quick look and I can't see any support for this in
SpamAssassin. It's not a major problem, but in this case it's an 
easily

avoidable one.


Yes. Only a small fraction of messages need the escaping at all, but 
it's enough to not use formail & mbox.


There's also the option of using inherited ACLs on Maildirs if they are 
supported on the filesystem being used.


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Ian Zimmerman
On 2015-12-29 17:50 -0500, Bill Cole wrote:

> Yes, with the advantage of using Mail::SpamAssassin::Util::secure_tmpfile() 
> rather
> than whatever I happen to roll up in a bit of Q shell that I never get 
> around to
> reviewing for edge cases...
> 
> The main reason to do something like that is to avoid the heavyweight sudo & 
> load of
> a Perl script for each message.
> 
> >
> >>The alternative without formail would be to pipe each raw message into
> >>its own sa-learn.
> >
> >The alternative is to give it a directory.

In fact sa-learn accepts multiple named arguments on the command line,
so the alternative I use is to go through the spambox N files at a time
in a shell loop.  (I have N=100 but obviously this depends.)

-- 
Please *no* private copies of mailing list or newsgroup messages.
Rule 420: All persons more than eight miles high to leave the court.


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Bill Cole

On 29 Dec 2015, at 18:54, Ian Zimmerman wrote:


In fact sa-learn accepts multiple named arguments on the command line,
so the alternative I use is to go through the spambox N files at a 
time

in a shell loop.  (I have N=100 but obviously this depends.)


Which successfully ignores the original issue of this thread completely: 
that the user sa-learn must run as cannot read the files being learnt. 
If you pass unreadable filenames as arguments, sa-learn just whines and 
fails. Shockingly, that is not the desired result.


Re: Omitting leading whitespace on headers?

2015-12-29 Thread Reindl Harald


Am 29.12.2015 um 21:46 schrieb Philip Prindeville:


On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:


On 12/29/2015 3:38 PM, Philip Prindeville wrote:

Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space…

I'm at a complete loss.  I add plenty of Subject rules with no leading space.  
Never seen this issue.



I had some rules which weren’t firing so I had to change to /^ .../ or else /^ 
?.../ to make them match.

Not sure why.

This is with SA 3.4.1 on Fedora 21


nope, running subject filters from F20 to F23
see example rule from my first answer

what says "spamassassin --lint"?



signature.asc
Description: OpenPGP digital signature


Re: Omitting leading whitespace on headers?

2015-12-29 Thread Philip Prindeville

On Dec 29, 2015, at 2:14 PM, Kevin A. McGrail  wrote:

> On 12/29/2015 3:46 PM, Philip Prindeville wrote:
>> On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:
>> 
>>> On 12/29/2015 3:38 PM, Philip Prindeville wrote:
 Is there a reason that headers are left with leading spaces?
 
 I’ve noticed that I have to write rules as:
 
 Subject =~ /^ Great [Jj]ob [Oo]pportunity/
 
 because of the leading space…
>>> I'm at a complete loss.  I add plenty of Subject rules with no leading 
>>> space.  Never seen this issue.
>> 
>> I had some rules which weren’t firing so I had to change to /^ .../ or else 
>> /^ ?.../ to make them match.
>> 
>> Not sure why.
>> 
>> This is with SA 3.4.1 on Fedora 21.
>> 
>> -Philip
> What's the original Subject header look like from the original mail?
> 
> Regards,
> KAM


This was a while ago.  I’d have to go back and look.  Maybe this one?

Subject: [IDN][#2056301] CareerBuilder: Open position for you





Re: Omitting leading whitespace on headers?

2015-12-29 Thread Philip Prindeville

On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:

> On 12/29/2015 3:38 PM, Philip Prindeville wrote:
>> Is there a reason that headers are left with leading spaces?
>> 
>> I’ve noticed that I have to write rules as:
>> 
>> Subject =~ /^ Great [Jj]ob [Oo]pportunity/
>> 
>> because of the leading space…
> I'm at a complete loss.  I add plenty of Subject rules with no leading space. 
>  Never seen this issue.


I had some rules which weren’t firing so I had to change to /^ .../ or else /^ 
?.../ to make them match.

Not sure why.

This is with SA 3.4.1 on Fedora 21.

-Philip



Re: Omitting leading whitespace on headers?

2015-12-29 Thread Philip Prindeville

On Dec 29, 2015, at 2:39 PM, Kevin A. McGrail  wrote:

> On 12/29/2015 4:29 PM, Philip Prindeville wrote:
>> On Dec 29, 2015, at 2:14 PM, Kevin A. McGrail  wrote:
>> 
>>> On 12/29/2015 3:46 PM, Philip Prindeville wrote:
 On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:
 
> On 12/29/2015 3:38 PM, Philip Prindeville wrote:
>> Is there a reason that headers are left with leading spaces?
>> 
>> I’ve noticed that I have to write rules as:
>> 
>> Subject =~ /^ Great [Jj]ob [Oo]pportunity/
>> 
>> because of the leading space…
> I'm at a complete loss.  I add plenty of Subject rules with no leading 
> space.  Never seen this issue.
 I had some rules which weren’t firing so I had to change to /^ .../ or 
 else /^ ?.../ to make them match.
 
 Not sure why.
 
 This is with SA 3.4.1 on Fedora 21.
 
 -Philip
>>> What's the original Subject header look like from the original mail?
>>> 
>>> Regards,
>>> KAM
>> 
>> This was a while ago.  I’d have to go back and look.  Maybe this one?
>> 
>> Subject: [IDN][#2056301] CareerBuilder: Open position for you
> OK, I was thinking perhaps an alternate charset or something but never run 
> into this issue.
> 
> If you are anchoring your Subject searches, allowing for whitespace, etc. is 
> a decent idea though from Reindl.
> 
> regards,
> KAM


I did recall that I used the patch here:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6360#c4

to be able to debug my rules, using a rule that would match any non-empty 
subject: value to dump out what it was (the “> got hit: “…”” line), and it 
was always showing a leading space…

-Philip



Re: Omitting leading whitespace on headers?

2015-12-29 Thread Kevin A. McGrail

On 12/29/2015 5:12 PM, Philip Prindeville wrote:

I did recall that I used the patch here:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6360#c4

to be able to debug my rules, using a rule that would match any non-empty subject: 
value to dump out what it was (the “> got hit: “…”” line), and it was 
always showing a leading space…

-Philip



Thank goodness.  You had me worried we have something foundational 
processing issue!


Regards,
KAM
--
*Kevin A. McGrail*
CEO

Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422

http://www.pccc.com/

703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-798-0171 (wireless)
kmcgr...@pccc.com 



Re: Omitting leading whitespace on headers?

2015-12-29 Thread Philip Prindeville

On Dec 29, 2015, at 3:15 PM, Kevin A. McGrail  wrote:

> On 12/29/2015 5:12 PM, Philip Prindeville wrote:
>> I did recall that I used the patch here:
>> 
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6360#c4
>> 
>> to be able to debug my rules, using a rule that would match any non-empty 
>> subject: value to dump out what it was (the “> got hit: “…”” line), and 
>> it was always showing a leading space…
>> 
>> -Philip
>> 
> 
> Thank goodness.  You had me worried we have something foundational processing 
> issue!
> 
> Regards,
> KAM
> 

No, I eventually added “^ ?” to all of my Subject rules… but I’m thinking I 
shouldn’t have had to.



Re: Omitting leading whitespace on headers?

2015-12-29 Thread Kevin A. McGrail

On 12/29/2015 5:16 PM, Philip Prindeville wrote:


On Dec 29, 2015, at 3:15 PM, Kevin A. McGrail > wrote:



On 12/29/2015 5:12 PM, Philip Prindeville wrote:

I did recall that I used the patch here:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6360#c4

to be able to debug my rules, using a rule that would match any non-empty subject: 
value to dump out what it was (the “> got hit: “…”” line), and it was 
always showing a leading space…

-Philip



Thank goodness.  You had me worried we have something foundational 
processing issue!


Regards,
KAM



No, I eventually added “^ ?” to all of my Subject rules… but I’m 
thinking I shouldn’t have had to.


You definitely shouldn't have to.  You should try a clean install if you 
can.


Regards,
KAM


Omitting leading whitespace on headers?

2015-12-29 Thread Philip Prindeville
Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space… Given the text of RFC-2822:

NO-WS-CTL   =   %d1-8 / ; US-ASCII control characters
%d11 /  ;  that do not include the
%d12 /  ;  carriage return, line feed,
%d14-31 /   ;  and white space characters
%d127

text=   %d1-9 / ; Characters excluding CR and LF
%d11 /
%d12 /
%d14-127 /
obs-text

FWS =   ([*WSP CRLF] 1*WSP) /   ; Folding white space
obs-FWS

obs-FWS =   1*WSP *(CRLF 1*WSP)

utext   =   NO-WS-CTL / ; Non white space controls
%d33-126 /  ; The rest of US-ASCII
obs-utext

unstructured=   *([FWS] utext) [FWS]

subject =   "Subject:" unstructured CRLF


Might we consider dropping the first instance of “FWS” preceding the first 
instance of “utext” in “unstructured”?

-Philip



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Omitting leading whitespace on headers?

2015-12-29 Thread Kevin A. McGrail

On 12/29/2015 3:46 PM, Philip Prindeville wrote:

On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:


On 12/29/2015 3:38 PM, Philip Prindeville wrote:

Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space…

I'm at a complete loss.  I add plenty of Subject rules with no leading space.  
Never seen this issue.


I had some rules which weren’t firing so I had to change to /^ .../ or else /^ 
?.../ to make them match.

Not sure why.

This is with SA 3.4.1 on Fedora 21.

-Philip

What's the original Subject header look like from the original mail?

Regards,
KAM


Re: Omitting leading whitespace on headers?

2015-12-29 Thread Kevin A. McGrail

On 12/29/2015 4:29 PM, Philip Prindeville wrote:

On Dec 29, 2015, at 2:14 PM, Kevin A. McGrail  wrote:


On 12/29/2015 3:46 PM, Philip Prindeville wrote:

On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:


On 12/29/2015 3:38 PM, Philip Prindeville wrote:

Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space…

I'm at a complete loss.  I add plenty of Subject rules with no leading space.  
Never seen this issue.

I had some rules which weren’t firing so I had to change to /^ .../ or else /^ 
?.../ to make them match.

Not sure why.

This is with SA 3.4.1 on Fedora 21.

-Philip

What's the original Subject header look like from the original mail?

Regards,
KAM


This was a while ago.  I’d have to go back and look.  Maybe this one?

Subject: [IDN][#2056301] CareerBuilder: Open position for you
OK, I was thinking perhaps an alternate charset or something but never 
run into this issue.


If you are anchoring your Subject searches, allowing for whitespace, 
etc. is a decent idea though from Reindl.


regards,
KAM


Re: Omitting leading whitespace on headers?

2015-12-29 Thread Reindl Harald



Am 29.12.2015 um 22:14 schrieb Kevin A. McGrail:

On 12/29/2015 3:46 PM, Philip Prindeville wrote:

On Dec 29, 2015, at 1:42 PM, Kevin A. McGrail  wrote:


On 12/29/2015 3:38 PM, Philip Prindeville wrote:

Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space…

I'm at a complete loss.  I add plenty of Subject rules with no
leading space.  Never seen this issue.


I had some rules which weren’t firing so I had to change to /^ .../ or
else /^ ?.../ to make them match.

Not sure why.

This is with SA 3.4.1 on Fedora 21.


What's the original Subject header look like from the original mail?


with "normalize_charset 1" and replace whitespaces in rules with "\s+" 
one can't trick SA with leading or multiple whitespaces in the junkmail


frankly with body rules i have seen even cases where a line was split in 
two html cells and the SA rule hits because the junk is normalized 
before proceed it against rules




signature.asc
Description: OpenPGP digital signature


Re: Omitting leading whitespace on headers?

2015-12-29 Thread Kevin A. McGrail

On 12/29/2015 3:38 PM, Philip Prindeville wrote:

Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space…
I'm at a complete loss.  I add plenty of Subject rules with no leading 
space.  Never seen this issue.


Omitting leading whitespace on headers?

2015-12-29 Thread Philip Prindeville
Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space… Given the text of RFC-2822:

NO-WS-CTL   =   %d1-8 / ; US-ASCII control characters
   %d11 /  ;  that do not include the
   %d12 /  ;  carriage return, line feed,
   %d14-31 /   ;  and white space characters
   %d127

text=   %d1-9 / ; Characters excluding CR and LF
   %d11 /
   %d12 /
   %d14-127 /
   obs-text

FWS =   ([*WSP CRLF] 1*WSP) /   ; Folding white space
   obs-FWS

obs-FWS =   1*WSP *(CRLF 1*WSP)

utext   =   NO-WS-CTL / ; Non white space controls
   %d33-126 /  ; The rest of US-ASCII
   obs-utext

unstructured=   *([FWS] utext) [FWS]

subject =   "Subject:" unstructured CRLF


Might we consider dropping the first instance of “FWS” preceding the first 
instance of “utext” in “unstructured”?

-Philip



Re: Omitting leading whitespace on headers?

2015-12-29 Thread Reindl Harald



Am 29.12.2015 um 21:38 schrieb Philip Prindeville:

Is there a reason that headers are left with leading spaces?

I’ve noticed that I have to write rules as:

Subject =~ /^ Great [Jj]ob [Oo]pportunity/

because of the leading space… Given the text of RFC-2822


no, we have a ton of subject and body rules and the work fine without a 
leading whitespace


maybe "normalize_charset 1" makes the difference but we had them AFAIR 
also before enable that option


# Subject Begins High
header__CUST_SUBJ_9   Subject =~ 
/^(apply\s+for\s+urgent\s+loan|auslieferungsankündigung\s+betrefend\s+ihre\s+sendung|deutsche\s+bank\:\s+sicherheitssperre\s+ihres\s+kontos|festnetz\-rechnung|have\s+you\s+won|ihr\s+konto\s+wurde\s+begrenzt|ihre\s+festnetz\-rechnung|ihre\s+mobilfunk\-rechnung|investment\s+opportunit|investment\s+quest|kredit|loan|mobilfunk\-rechnung|neue\s+festnetz\-rechnung|neue\s+mobilfunk\-rechnung|proposal|sie\s+haben\s+gewonnen|verify\s+your\s+account|you\s+won|your\s+paypal\s+account\s+has\s+been\s+limited|\[paypal\]\s+check\s+the\s+account\s+paypal).*/i

meta  CUST_SUBJ_9 (__HAS_SUBJECT && __CUST_SUBJ_9)
score CUST_SUBJ_9 3.5
describe  CUST_SUBJ_9 Begins High

[roo[root@mail-gw:~]$ cat maillog | grep CUST_SUBJ_9 | wc -l
16t@mail-gw:~]$ cat maillog | grep CUST_SUBJ_9 | wc -l
16



signature.asc
Description: OpenPGP digital signature


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Bill Cole

On 29 Dec 2015, at 8:28, Jude DaShiell wrote:

With spamassassin, is it possible to have the filter show counts of 
number of messages sent to spam, number of messages sent to ham, and 
total number of messages processed that a user can check?


Since SpamAssassin is a suite of Perl modules and an associated set of 
customizable rules, rather than a single filtering tool, the answer to 
that is both "Yes" and "No."


SA itself does not include substantial record-keeping & reporting 
functionality. However, it is entirely feasible to put together an MTA 
and SA with tools & config that makes it possible to report complex 
stats for the mail SA sees on a per-user basis and make them available 
to each user. I've done that myself multiple times in environments using 
3 different MTAs and 2 radically different modes of plumbing between the 
MTA and SA. As those have all been bespoke implementations (and some I 
don't even have access to any more) I cannot share them.


On the other hand, I see such stats as inherently problematic. Good 
anti-spam in any sort of environment is layered in such a way that large 
percentages (usually overwhelming majorities) of would-be-spam never 
even reaches the point of a SMTP RCPT command, much less a chance to run 
it through a content filter using SA. Depending on the size & diversity 
of the users & their legit mail, only ~3-20% of could-be-incoming mail 
is subjected to full content filtering in a mature system and that 
generally means the vast bulk of both spam and ham going around SA, not 
through it. In my experience, when you start telling users useful and 
meaningful details of how & how well their mail filtering works they 
either lose all interest or freak out. This makes the creation of robust 
user-facing reporting tools worse than a waste of time: crafting an 
excuse for chronic user panics. I'm glad the people who put significant 
time into SA development have not wasted it trying to enable such 
things.


Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Chalmers
Good question. I'd like to know myself


-
From my iPhone.


> On 29 Dec 2015, at 1:28 pm, Jude DaShiell  wrote:
> 
> With spamassassin, is it possible to have the filter show counts of number of 
> messages sent to spam, number of messages sent to ham, and total number of 
> messages processed that a user can check?On Mon, 28 Dec 2015, Bill Cole wrote:
> 
>> Date: Mon, 28 Dec 2015 23:42:03
>> From: Bill Cole 
>> Reply-To: users@spamassassin.apache.org
>> To: users@spamassassin.apache.org
>> Subject: Re: Is BAYES filtering working? Having doubts.
>>> On 28 Dec 2015, at 17:54, Peter L. Berghold wrote:
>>> 
>>> The script that I use to pull the messages out of a
>>> spam bucket invoking sa-learn runs as root which has permissions to read
>>> from anywhere.  The complication is the amavis does not have permissions
>>> to read the Maildir files for trivial users like root does.
>>> That said, I have some thoughts as how to solve that.
>> 
>> In case your ideas don't work out...
>> 
>> Useful facts: sa-learn reads stdin if you don't give it any file arguments 
>> and it can take mbox format as input.
>> 
>> Using these facts, my learning script that runs as root and reads from 
>> multiple real users' Maildirs does this to learn ham:
>> 
>> for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u $SAUSER sa-learn 
>> --ham --mbox
>> 
>> Where $HAMS is the list of ham message files and $SAUSER is the user 
>> handling the system-wide BayesDB. I use formail there just to give each 
>> message a leading 'From ' line (i.e. mbox format) so that the whole bunch 
>> can be piped into a single sa-learn invocation. The alternative without 
>> formail would be to pipe each raw message into its own sa-learn.  If you 
>> don't have sudo installed or don't like letting root use it, you can 
>> replicate the same effect with su in an uglier command line.
> 
> -- 
> 


Re: Looking for a script to extract readable text from emails

2015-12-29 Thread Jude DaShiell
If that problem ever gets solved, blind users of the internet could do 
two useful things; first read things faster, and prevent lots of images 
from taking up user quota space.  Those blind that can hear would not 
want audio content in video or audio files filtered out though.


On Tue, 29 Dec 2015, Bill Cole wrote:


Date: Tue, 29 Dec 2015 01:07:55
From: Bill Cole 
Reply-To: users@spamassassin.apache.org
To: users@spamassassin.apache.org
Subject: Re: Looking for a script to extract readable text from emails

On 28 Dec 2015, at 23:16, Marc Perkel wrote:

I'm looking for a script to extract readable text from emails. I want it 
demimed, ignore html, images, etc. What I'm looking for is just the 
readable text (real words). Mostly just need to extract about the first 200 
characters of real text.


Can someone point me in the right direction?


You might be able to adapt or wrap the mimeprint script from the examples 
includes in the Perl MIME-Tools package. It can disassemble and decode all 
parts of a message for you.


Of course, there's no guarantee that a message *has* a meaningful text body, 
or that the text part of a multipart/alternative message resembles what a 
common MUA will show a user by rendering the HTML part.




--



Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Reindl Harald



Am 29.12.2015 um 05:42 schrieb Bill Cole:

On 28 Dec 2015, at 17:54, Peter L. Berghold wrote:


The script that I use to pull the messages out of a
spam bucket invoking sa-learn runs as root which has permissions to read
from anywhere.  The complication is the amavis does not have permissions
to read the Maildir files for trivial users like root does.

That said, I have some thoughts as how to solve that.


In case your ideas don't work out...

Useful facts: sa-learn reads stdin if you don't give it any file
arguments and it can take mbox format as input.


better write a script which collects the samples as root in a single 
folder, chown/chmod them and then call "sa-learn" with "su" as the 
correct non-root user



Using these facts, my learning script that runs as root and reads from
multiple real users' Maildirs does this to learn ham:

   for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u $SAUSER
sa-learn --ham --mbox

Where $HAMS is the list of ham message files and $SAUSER is the user
handling the system-wide BayesDB. I use formail there just to give each
message a leading 'From ' line (i.e. mbox format) so that the whole
bunch can be piped into a single sa-learn invocation. The alternative
without formail would be to pipe each raw message into its own sa-learn.
 If you don't have sudo installed or don't like letting root use it,
you can replicate the same effect with su in an uglier command line


don't get why "pipe each raw message into its own sa-learn"

tried that and it's terrible slow with no usefull progress display
you don't gain anything with "formail" execpt overhead

sa-learn --max-size=0 --progress --spam /sample-folder/spam/
sa-learn --max-size=0 --progress --ham  /sample-folder/ham/

while both folders contain single eml-files which don't need to have a 
leading 'From' sa-learn is able to display progress including estimated 
time to finish

_

yours:
for SAMPLE_FILE in "$SA_MILTER_HOME"/training/spam/{.,}*; do 
/usr/bin/formail < "$SAMPLE_FILE"; done | /usr/bin/sa-learn --dbpath 
"$BAYES_TEMP/bayes" --max-size=0 --no-sync --progress --spam --mbox


mine for a year now:
/usr/bin/sa-learn --dbpath "$BAYES_TEMP/bayes" --max-size=0 --no-sync 
--progress --spam "$SA_MILTER_HOME/training/spam/"

_

additionally there are warnings like below as well as "Learned tokens 
from 16670 message(s) (16670 message(s) examined)" while with my version 
there are all 57337 messages correctly learned


Parsing of undecoded UTF-8 will give garbage when decoding entities at 
/usr/share/perl5/vendor_perl/Mail/SpamAssassin/HTML.pm line 260





signature.asc
Description: OpenPGP digital signature


URI_NO_WWW_INFO_CGI info

2015-12-29 Thread marco
Hi,
some customers triggers this rules in their mails

# bug 3896: URIs in various TLDs, other than 3rd level www
uri
URI_NO_WWW_INFO_CGI 
/^(?:https?:\/\/)?[^\/]+(?http://media.mydomain.info/.
and the rule take a score of 2.299 by itself.

Why a score so high?
Anyone knows why only .info and .biz domains in these rules and not the
others domains (.com etc...)?

Thanks
Marco





Re: Is BAYES filtering working? Having doubts.

2015-12-29 Thread Jude DaShiell
With spamassassin, is it possible to have the filter show counts of number 
of messages sent to spam, number of messages sent to ham, and total number 
of messages processed that a user can check?On Mon, 28 Dec 2015, Bill Cole 
wrote:



Date: Mon, 28 Dec 2015 23:42:03
From: Bill Cole 
Reply-To: users@spamassassin.apache.org
To: users@spamassassin.apache.org
Subject: Re: Is BAYES filtering working? Having doubts.

On 28 Dec 2015, at 17:54, Peter L. Berghold wrote:


The script that I use to pull the messages out of a
spam bucket invoking sa-learn runs as root which has permissions to read
from anywhere.  The complication is the amavis does not have permissions
to read the Maildir files for trivial users like root does.

That said, I have some thoughts as how to solve that.


In case your ideas don't work out...

Useful facts: sa-learn reads stdin if you don't give it any file arguments 
and it can take mbox format as input.


Using these facts, my learning script that runs as root and reads from 
multiple real users' Maildirs does this to learn ham:


 for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u $SAUSER 
sa-learn --ham --mbox


Where $HAMS is the list of ham message files and $SAUSER is the user handling 
the system-wide BayesDB. I use formail there just to give each message a 
leading 'From ' line (i.e. mbox format) so that the whole bunch can be piped 
into a single sa-learn invocation. The alternative without formail would be 
to pipe each raw message into its own sa-learn.  If you don't have sudo 
installed or don't like letting root use it, you can replicate the same 
effect with su in an uglier command line.




--



Re: TxRep Template Tags staying as tags

2015-12-29 Thread Kevin Golding
On Mon, 28 Dec 2015 18:21:35 -, Kevin A. McGrail   
wrote:



On 12/24/2015 10:04 AM, Kevin Golding wrote:
I know I'm a bit weird but I like stuffing headers with all kinds of  
data like I'm stuffing a turkey for Christmas, but I've never been able  
to get anything showing up for TxRep (it is referenced in _TESTSSCORES_  
but none of the TxRep specific ones seem to convert). I feel I can't be  
the only person trying to use these tags so I thought I'd see if anyone  
had any hints.


I actually just use global storage for TxRep and assign a generic  
username for everything but I've tried it with both G and Y modifiers.  
I've added these lines for testing:


add_header all TxRep-Global _TXREP_EMAIL_G_ _TXREP_EMAIL_IP_G_  
_TXREP_IP_G_ _TXREP_DOMAIN_G_ _TXREP_HELO_G_
add_header all TxRep-User _TXREP_EMAIL_U_ _TXREP_EMAIL_IP_U_  
_TXREP_IP_U_ _TXREP_DOMAIN_U_ _TXREP_HELO_U_


But all mails show is:

X-Spam-TxRep-Global: _TXREP_EMAIL_G_ _TXREP_EMAIL_IP_G_ _TXREP_IP_G_  
_TXREP_DOMAIN_G_ _TXREP_HELO_G_
X-Spam-TxRep-User: _TXREP_EMAIL_U_ _TXREP_EMAIL_IP_U_ _TXREP_IP_U_  
_TXREP_DOMAIN_U_ _TXREP_HELO_U_


I've tried skipping both U and G since I'm not using dual storage but  
no joy there either. I've tried it with the various sub-tags like COUNT  
and MEAN etc. but again, no conversions.


I've scratched my head over this one before and got nowhere and today's  
attempts don't seem to have got me any closer so hopefully someone can  
spot where my mistake is.


Cheers
Sorry, I think you might be unique.  These aren't listed in the  
Mail::SpamAssassin::Conf TAGS section and grepping trunk shows nothing.


Whilst not in Mail::SpamAssassin::Conf they are in  
Mail::SpamAssassin::Plugin::TxRep


https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Plugin_TxRep.html#template_tags

The tags don't show up well in grepping the code because the tags are  
defined dynamically as they iterate over the same basic code changing just  
one argument.


Having done a bit more poking I can try to fill in more data because I'm  
still a little puzzled.


The big thing I've confirmed is the _Y part of the tag is redundant for  
me, since I see the following in debug:


dbg: check: tagrun - tag TXREP_EMAIL_IP is now ready, value: 0.0

Since tagrun debug output strips the fore and aft _ I've been running with:

add_header TxRep-Email-IP: _TXREP_EMAIL_IP_

I had found the confusion about needing either _U or _G -r neither unclear  
so it was nice to confirm that neither was the best choice for me. The  
problem is I still get:


X-Spam-TxRep-Email-IP: _TXREP_EMAIL_IP_

I've tried this with every TXREP_FOO option that I can find and nothing  
gets translated. All my other template tags work as expected though so I  
feel there's something awry.


I'm happy to fix the small documentation bugs I've already worked out and  
I'm happy to submit patches to move it to Mail::SpamAssassin::Conf if  
desired (although the number of these tags and variations is likely why  
originally left in the plugin documentation and may still be preferred)  
but I'm reluctant to be too gung-ho when I can't get the things working  
myself.