Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread John Hardin

On Tue, 8 Feb 2022, Loren Wilton wrote:


 Are you talking about the use of m'' as the regex delimiter?


 Yes.

 It will probably work just fine for the foreseeable future, as long as the
 input validation of rules files is lenient.


I think you may have a very hard time removing the m matching 
delimiters from SA. I suspect there are at least hundreds of rules like that 
in the release database. I have about a hundred local rules of my own that 
use that.


Indeed.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.org pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Journalism is about covering important stories.
  With a pillow, until they stop moving.   -- David Burge
---
 74 more days working to pay your (average) annual US tax bill
 before you're finally working for yourself.


Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread Loren Wilton

Are you talking about the use of m'' as the regex delimiter?


Yes.

It will probably work just fine for the foreseeable future, as long as the 
input validation of rules files is lenient.


I think you may have a very hard time removing the m matching 
delimiters from SA. I suspect there are at least hundreds of rules like that 
in the release database. I have about a hundred local rules of my own that 
use that.


Any time I have more than one backslash in a pattern, I use an alternate 
delimiter (usually single quote) so that I don't have to escape all the 
backslashes in the rule body. I'm not a fan of obfuscated rule bodies where 
it is impossible to tell what it is intended to match. My experience is that 
any time you have to write  or \\ multiple times in a rule body, you 
are almost guaranteed to get the number of backslahses wrong, and the rule 
won't work. But of course it may work in some cases (like the one you used 
to test it) while not working in general.


I don't have time in my life to deal with that sort of thing. It caused me 
enough grief when I started writing rules 20 years ago, which is why I 
started using m'.


BTW, that particular rule dates from RulesEmporium days, which was what, 
2005 or so?


   Loren



Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread Bill Cole
On 2022-02-08 at 13:14:06 UTC-0500 (Tue, 8 Feb 2022 13:14:06 -0500)
Kris Deugau 
is rumored to have said:
[...]
> Are you talking about the use of m'' as the regex delimiter?

Yes.

It will probably work just fine for the foreseeable future, as long as the 
input validation of rules files is lenient.

It isn't beyond the realm of possibility that someday we'll tighten up syntax 
checking. We've had security issues in the past which involved the hypothetical 
potential to sneak in malicious code via rules. I don't expect that we'll have 
another one bad enough to make a rewrite of the config parser justified, but it 
could happen, and I don't think we'd design it today as it was done 20 years 
ago.


-- 
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread Kris Deugau

Bill Cole wrote:

On 2022-02-08 at 04:28:16 UTC-0500 (Tue, 8 Feb 2022 01:28:16 -0800)
Loren Wilton 
is rumored to have said:


No, I added that after observing multiple spams with random garbage after the 
closing HTML tag in the HTML body part. Presumably it was an attempt at Bayes 
poison, checksum avoidance, or some other filter evasion technique.

I'll tighten it up.


FWIW, here is the rule I use. It obviously could be better, but I haven't 
noticed that it misfires.

full __GOODEHTML1 m''i

full __GOODEHTML2 m'(?:\s|=0A){0,50}(?:$|--|=)'is # stop on mime ending 
boundary


TANGENTIAL:

I would advise against using such alternative regex syntax in rules. As you 
obviously figured out, you CAN (for now...) use any valid Perl syntax for 
writing a regex match, but I do not believe that we want to bless that as 
something which will never break.


Maybe it's just inexperience with deep regex voodoo, but I'm not seeing 
anything odd in those.


Are you talking about the use of m'' as the regex delimiter?

-kgd


Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread Bill Cole
On 2022-02-08 at 04:28:16 UTC-0500 (Tue, 8 Feb 2022 01:28:16 -0800)
Loren Wilton 
is rumored to have said:

>> No, I added that after observing multiple spams with random garbage after 
>> the closing HTML tag in the HTML body part. Presumably it was an attempt at 
>> Bayes poison, checksum avoidance, or some other filter evasion technique.
>>
>> I'll tighten it up.
>
> FWIW, here is the rule I use. It obviously could be better, but I haven't 
> noticed that it misfires.
>
> full __GOODEHTML1 m''i
>
> full __GOODEHTML2 m'(?:\s|=0A){0,50}(?:$|--|=)'is # stop on mime 
> ending boundary

TANGENTIAL:

I would advise against using such alternative regex syntax in rules. As you 
obviously figured out, you CAN (for now...) use any valid Perl syntax for 
writing a regex match, but I do not believe that we want to bless that as 
something which will never break.


-- 
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: FROM header obfuscation

2022-02-08 Thread Kris Deugau

Frido Otten wrote:

Hi All,

Recently we're seeing more spam passing our spamfilters using text 
obfuscating in the FROM header. The problem mainly targets users which 
are using mail clients like iPhone Mail which are only displaying the 
display name of the FROM header and not the actual email address which 
was used, bypassing DKIM measures. For example:


From: =?UTF-8?B?0KBvc3RubC5ubCDQoGFra2V0?= 

This is base64 encoded "Рostnl.nl Рakket" and pretends to come from 
Postnl, a dutch snailmail company. However the hexadecimal 
representation of this base64 decoded text differs from that of normal 
ASCII:


Obfuscated:

$ printf "Рostnl.nl Рakket" | od -A n -t x1
  d0 a0 6f 73 74 6e 6c 2e 6e 6c 20 d0 a0 61 6b 6b
  65 74

Plain ASCII:

$ printf "Postnl.nl Pakket" | od -A n -t x1
  50 6f 73 74 6e 6c 2e 6e 6c 20 50 61 6b 6b 65 74

There is no way to tell the difference with the naked eye.


That depends on the font.  Many variations do in fact look different, 
and from some of the FP-approaching "ham" I've seen that abuses this I 
can only conclude that some marketing  person has decided that this 
is Necessary and Required and the tech folks can Go Suck It.


As far as I'm concerned, formatting outside of language accents on 
characters absolutely does NOT belong in either the From: name or 
Subject.  An "a" in the From: name or Subject absolutely MUST be 
presented as a US-ASCII "a", and not some extended UTF8 lookalike 
that's...   oo!  in *italics*!


Naturally the spammers go to various amounts of effort to avoid the ones 
that are clearly different.


Is there any way to detect this type of obfuscation with a spamassassin 
rule?


I have a longish list of rule groups similar to below for different 
extended UTF8 ASCII-lookalike characters and words.  Some are derived 
from rules discussed on this list within the past year or so.


header  __SUSP_NAME_CHAR_01 From:name =~ /(?:\xd0[\xa0-\xbf])/
tflags __SUSP_NAME_CHAR_01 multiple maxhits 10
header  __SUSP_NAME_CHAR_02 From:name =~ 
/(?:\xef\xbc[\x80-\xbf]|\xef\xbd[\x80-\xa0])/

tflags __SUSP_NAME_CHAR_02 multiple maxhits 10
meta__SUSP_NAME_CHAR__SUSP_NAME_CHAR_01 + __SUSP_NAME_CHAR_02
metaSUSP_NAME_CHAR_5__SUSP_NAME_CHAR >= 5
describe SUSP_NAME_CHAR_5   5 or more lookalike characters in the 
From: name

score   SUSP_NAME_CHAR_51.5
metaSUSP_NAME_CHAR_10   __SUSP_NAME_CHAR >= 10
describe SUSP_NAME_CHAR_10  10 or more lookalike characters in the 
From: name

score   SUSP_NAME_CHAR_10   1.75

I've used this tool:

https://www.utf8-chartable.de/

with a bit of effort to take an example character and locate the full 
a-z list of entries for these rules.  (Convert individual characters to 
hex, then flip pages until you've found the fakes.  There are many groups.)


Single characters are trickier;  depending on context I've added rules 
for individual lookalike characters, or whole words with mixed variants 
(and an exclusion for pure ASCII) as I see new runs of FNs.


-kgd


Re: Emails from gmail.com bypassing Spamassassin scoring

2022-02-08 Thread Bill Cole
On 2022-02-07 at 13:43:31 UTC-0500 (Mon, 07 Feb 2022 13:43:31 -0500)
Chad 
is rumored to have said:

> I have been getting numerous emails lately from various gmail.com accounts.   
>They are spam or phishing emails and today I got one that had a subject of 
> RECEIPT 5454 and only a JPG image of an invoice. There was no content in 
> the email.
>
>
>
> It bypassed Spamassassin scoring.  Do you know why or what setting I need 
> to set so EVERY email goes through Spamassassin scoring procedures?
>
>
>
> My email server is:mercury2022.mercuryemail.net
[...]
> Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com 
> [209.85.214.172])
>
> by mercury2022.mercuryemail.net (Postfix) with ESMTPS id 
> A5F7E8043D4A
>
> for ; Mon,  7 Feb 2022 10:44:18 -0500 
> (EST)

OK, so we know that your mail server is running Postfix but not how you've 
integrated SpamAssassin. There are many possibilities, with 2 independent 
attributes:


1. Interface to Postfix:
  a. content_filter setting to pipe mail to a bespoke script (maybe 
distro-provided)
  b. milter (amavis, spamass-milter, mimedefang, etc.)
  c. SMTP Proxy (usually amavis)
  d. FILTER action in an access map to a bespoke script.
  e. NONE: Integrated with a downstream delivery agent (e.g. Dovecot LMTP) or 
MUA.

2. Interface to SA:
  a. Load Mail::SpamAssassin Perl modules and use them directly
  b. Use a spamc binary built from the SA distribution to contact a local spamd 
instance
  c. Use a spamc binary built from the SA distribution to contact a remote 
spamd instance
  d. Use a custom implementation of the spamc protocol to contact a local spamd 
instance
  e. Use a custom implementation of the spamc protocol to contact a remote 
spamd instance
  f. Run the spamassassin script and handle its output.

So, yeah: 30 possible combinations. It is hard to say what is broken without 
knowing how you have SA working when it works. This sort of problem is never 
technically in SpamAssassin itself, as SpamAssassin itself doesn't include any 
software that could act as a gatekeeper.


-- 
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread Greg Troxel

John Hardin  writes:

> On Mon, 7 Feb 2022, Greg Troxel wrote:
>
>> and then I got a reply back with the content he was trying to send etc.
>> But, it had:
>>
>>  *  2.5 CONTENT_AFTER_HTML More content after HTML close tag
>>
>> but one was only text/plain and I could see nothing wrong.   reading
>> 72_active.cf I found:
>>
>>  rawbody__CONTENT_AFTER_HTML/<\/htnl>\s*[a-z0-9]/i
>> which fires on a text/plain part that discusses html formatting!
>
> Ah, I'll see if I can add something to that so it only fires when
> there's an actual HTML body part. Thanks for the report.
>
> Pity there's not an "htmlbody" rule type...

Agreed - I think the way you are trying to tighten is correct.



signature.asc
Description: PGP signature


Re: Errors running SpamAssassin

2022-02-08 Thread Bernard

I'd run "sh -x /etc/cron.daily/spamassassin"
to see what command in that file failed. I assume it is the sa-compile 
command.


I got some more results.
Here are the steps I made:
1. Remove everything from /var/lib/spamassassin
2. Reinstall spamassassin package
3. Recreate /var/lib/spamassassin/compiled directory with 
debian-spamd:debian-spamd ownership

4. Run sudo sh -x /etc/cron.daily/spamassassin

Here is the output I collected:
+ CRON=0
+ test -f /etc/default/spamassassin
+ . /etc/default/spamassassin
+ SPAMD_HOME=/run/spamd/
+ OPTIONS=--create-prefs --max-children 5 --helper-home-dir /run/spamd/ 
--listen /run/spamd/spamd.sock --username debian-spamd -s spamd 
--allow-tell --timeout-child=30

+ PIDFILE=/run/spamd/spamd.pid
+ NICE=--nicelevel 15
+ CRON=1
+ test -x /usr/bin/sa-update
+ test -x /etc/init.d/spamassassin
+ command -v gpg
+ [ 1 = 0 ]
+ [ ! -t 0 ]
+ umask 022
+ env -i LANG=en_GB.UTF-8 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 
http_proxy= start-stop-daemon --chuid debian-spamd:debian-spamd --start 
--exec /usr/bin/sa-update -- --gpghomedir 
/var/lib/spamassassin/sa-update-keys
+ env -i LANG=en_GB.UTF-8 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 
start-stop-daemon --chuid debian-spamd:debian-spamd --start --exec 
/usr/bin/spamassassin -- --lint

+ do_compile
+ [ -x /usr/bin/re2c -a -x /usr/bin/sa-compile ]
+ env -i LANG=en_GB.UTF-8 
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 
start-stop-daemon --chuid debian-spamd:debian-spamd --start --exec 
/usr/bin/sa-compile -- --quiet

chmod: cannot access 'body_0.bs': No such file or directory
make: *** [Makefile:465: body_0.bs] Error 1
command 'make PREFIX=/tmp/.spamassassin11033rJy6vLtmp/ignored 
INSTALLSITEARCH=/var/lib/spamassassin/compiled/5.032/3.004006 
>>/tmp/.spamassassin11033rJy6vLtmp/log' failed: exit 2
+ runuser -u debian-spamd -- chmod -R go-w,go+rX 
/var/lib/spamassassin/compiled

+ reload
+ which invoke-rc.d
+ invoke-rc.d --quiet spamassassin status
+ invoke-rc.d spamassassin reload
+ [ -d /etc/spamassassin/sa-update-hooks.d ]
+ run-parts --lsbsysinit /etc/spamassassin/sa-update-hooks.d

The error in the middle is the one reported daily through the run of the 
cron job.
Might there be something wrong with my environment? Else, what could be 
wrong/needs checking?


Bernard


FROM header obfuscation

2022-02-08 Thread Frido Otten

Hi All,

Recently we're seeing more spam passing our spamfilters using text 
obfuscating in the FROM header. The problem mainly targets users which 
are using mail clients like iPhone Mail which are only displaying the 
display name of the FROM header and not the actual email address which 
was used, bypassing DKIM measures. For example:


From: =?UTF-8?B?0KBvc3RubC5ubCDQoGFra2V0?= 

This is base64 encoded "Рostnl.nl Рakket" and pretends to come from 
Postnl, a dutch snailmail company. However the hexadecimal 
representation of this base64 decoded text differs from that of normal 
ASCII:


Obfuscated:

$ printf "Рostnl.nl Рakket" | od -A n -t x1
 d0 a0 6f 73 74 6e 6c 2e 6e 6c 20 d0 a0 61 6b 6b
 65 74

Plain ASCII:

$ printf "Postnl.nl Pakket" | od -A n -t x1
 50 6f 73 74 6e 6c 2e 6e 6c 20 50 61 6b 6b 65 74

There is no way to tell the difference with the naked eye. You can 
obfuscate text using this online tool: https://obfuscator.uo1.net/


Is there any way to detect this type of obfuscation with a spamassassin 
rule?


Best regards,
Frido Otten



Re: CONTENT_AFTER_HTML: better not discuss formatting!!

2022-02-08 Thread Loren Wilton
No, I added that after observing multiple spams with random garbage after 
the closing HTML tag in the HTML body part. Presumably it was an attempt 
at Bayes poison, checksum avoidance, or some other filter evasion 
technique.


I'll tighten it up.


FWIW, here is the rule I use. It obviously could be better, but I haven't 
noticed that it misfires.


full __GOODEHTML1 m''i

full __GOODEHTML2 m'(?:\s|=0A){0,50}(?:$|--|=)'is # stop on mime 
ending boundary


meta LW_BADEHTML1 (__GOODEHTML1 && !__GOODEHTML2)

describe LW_BADEHTML1 Bad ending - something after 

score LW_BADEHTML1 1