Re: No longer just embedded =9D characters in blackmail emails.

2019-03-22 Thread Savvas Karagiannidis

On 21/3/2019 18:23, John Hardin wrote:

On Thu, 21 Mar 2019, Savvas Karagiannidis wrote:

What should be considered is the message's language. All messages 
that were false positives had the following mime encoding (messages 
were actually in greek):


Content-Type: text/[plain|html]; charset="windows-1253" or
Content-Type: text/[plain|html]; charset="iso-8859-7"

while all messages that were actual spam and were properly detected had:

Content-Type: text/[plain|html]; charset="utf-8"


It should be fairly easy to add an exclusion based on that 
information. However, that information may well be leveraged by 
spammers who are using that obfuscation...


I think the same applies to the rule itself altogether and any other 
rule. As long as the rule is out there, any spammer can incorporate a 
means to avoid it. I guess the selection of "e" as a character is also 
pretty random and avoiding that and applying the same technique to other 
characters (I've already seen it happening) not detected by this rule 
should be no problem for spammers...


--
Savvas Karagiannidis




Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread Fedor Piecka
Hello Bill

I can show a few messages triggering the rule in our case but only for you
to see the use of accented characters in Czech language. I'm unable to
grant you a permission to upload them to masscheck corpus or to any other
public/semipublic database. The messages contain no classified information,
but they include names and email addresses of our users and our business
partners. It would be difficult to anonymize them and still keep their
value.

To be honest, the rule fires at low frequency even in our system which
deals with Slovak and Czech languages all the time. Most people in here
write emails with ASCII characters only and don't use accents etc. However,
some do (e.g. I do - UTF-8 works in most email clients, so why not). I
found the rule while searching through logs for messages which had scored
too close to the spam threshold and are therefore close to being
misclassified.

I can see 9 messages received in the last 10 days firing the rule and all
of them are Atlassian Jira notifications from our Czech business partner.
Although after doing sa-update, they score pretty low as MIXED_ES has now a
score of 0.5.

Shall I provide you the respective messages?

št 21. 3. 2019 o 16:34 Bill Cole 
napísal(a):

> On 21 Mar 2019, at 10:52, John Wilcock wrote:
>
> > Le 21/03/2019 à 14:52, John Wilcock a écrit :
> >> Le 20/03/2019 à 20:19, Bill Cole a écrit :
> >>> I've added these lines to the block that defines MIXED_ES which may
> >>> help some sites:
> >>>
> >>>  lang pl  score MIXED_ES  0.01
> >>>  lang cz  score MIXED_ES  0.01
> >>>  lang sk  score MIXED_ES  0.01
> >>>  lang hr  score MIXED_ES  0.01
> >>>  lang el  score MIXED_ES  0.01
> >>>
> >>> Those should get into the default rules channel within a few days.
> >>
> >> All very well, except [...]
> > Also, there are *lots* of other languages that legitimately use E-like
> > characters that should be added to the list (e.g. there's a Cyrillic
> > "е", so you can add ru, bg, uk, be, bs, sr, kk, ky, mn, tg and
> > others, for a start; ). You'll be fighting a losing battle there...
>
> Actually not a battle I'm fighting...
>
> I have seen direct reports of this rule (which is substantially more
> narrow than just 'has mixed e-like characters') matching ham in the
> above listed languages. I know that on the order of 0.001% of ham in the
> masscheck data submitted to SA Rule QA match the rule and that the bulk
> of that is from a single small corpus (from a Polish source) in which
> ~0.5% of ham matches. It appears that occasionally that match rate
> results in a classification false positive, which is a real but small
> and constraijned problem.
>
> I have never seen an actual ham message matching the rule, much less had
> access to a mail stream including a steady stream of such messages. I
> have only ever seen vague reports of classification FPs, all of which
> cite the score as 3.999, which has not been accurate for most of the
> lifetime of the rule. As such, I have no real weapons in this battle and
> a foe who is invisible but noisy, to overstretch your analogy.
>
> Individual sites are always free to kill or redefine rules from the
> default set or peg their scores to limit FPs.
>
> --
> Bill Cole
> b...@scconsult.com or billc...@apache.org
> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
> Available For Hire: https://linkedin.com/in/billcole
>


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread @lbutlr
On 21 Mar 2019, at 14:27, John Hardin  wrote:
> On Thu, 21 Mar 2019, Martin Gregorie wrote:
>> On Thu, 2019-03-21 at 12:20 -0700, John Hardin wrote:
>>> 
>>> ...wrong thread? :)

>> Unfortunately so. For some reason my mail reader's editor (I use
>> Evolution) locked up on my first attempt to reply and when I got it to
>> respond it again it sent the stupid message containing one blank line.
>> 
>> Then I screwed up by making a second reply attempt to the wrong
>> message.  Must do better  :-/

> ...and it's not even Monday.

It's always Monday somewhere.


-- 
O is for OLIVE run through with an awl
P is for PRUE trampled flat in a brawl





Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread John Hardin

On Thu, 21 Mar 2019, Martin Gregorie wrote:


On Thu, 2019-03-21 at 12:20 -0700, John Hardin wrote:


...wrong thread? :)


Unfortunately so. For some reason my mail reader's editor (I use
Evolution) locked up on my first attempt to reply and when I got it to
respond it again it sent the stupid message containing one blank line.

Then I screwed up by making a second reply attempt to the wrong
message.  Must do better  :-/


...and it's not even Monday.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  How do you argue with people to whom math is an opinion? -- Unknown
---
 721 days since the first commercial re-flight of an orbital booster (SpaceX)


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread Martin Gregorie
On Thu, 2019-03-21 at 12:20 -0700, John Hardin wrote:
>
> ...wrong thread? :)
> 
Unfortunately so. For some reason my mail reader's editor (I use
Evolution) locked up on my first attempt to reply and when I got it to
respond it again it sent the stupid message containing one blank line. 

Then I screwed up by making a second reply attempt to the wrong
message.  Must do better  :-/


Martin




Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread John Hardin

On Thu, 21 Mar 2019, Martin Gregorie wrote:


On Thu, 2019-03-21 at 09:23 -0700, John Hardin wrote:

On Thu, 21 Mar 2019, Savvas Karagiannidis wrote:


What should be considered is the message's language. All messages
that were
false positives had the following mime encoding (messages were
actually in
greek):

Content-Type: text/[plain|html]; charset="windows-1253" or
Content-Type: text/[plain|html]; charset="iso-8859-7"

while all messages that were actual spam and were properly detected
had:

Content-Type: text/[plain|html]; charset="utf-8"


It should be fairly easy to add an exclusion based on that
information.
However, that information may well be leveraged by spammers who are
using that obfuscation...


FWIW roughly 10% of my spam corpus uses  tags to set white text.


...wrong thread? :)

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...the Constitution and the Bill of Rights exists to protect
  the individual, not the mob.  -- Matt Pickering
---
 721 days since the first commercial re-flight of an orbital booster (SpaceX)


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread RW
On Thu, 6 Dec 2018 09:15:59 -0800 (PST)
John Hardin wrote:

> On Wed, 5 Dec 2018, Grant Taylor wrote:

> > Would __UNICODE_TEST_FR run / consume resources even if __LANG_FR
> > evaluates to false?  
> 
> Yes, all the subrules get evaluated. There's no shortcutting because
> a subrule may be used in any number of meta rules.


It's more a case that it's not done because it's not implemented.  __*
rules could, for the most part, be evaluated when they are first needed
in meta rule evaluation and then cached.

IIRC the author of rspamd cites this as the main reason why it's
faster than SpamAssassin. I don't know the details, but I gather it
also does some kind of reordering to minimise the evaluation of
expensive rules.  


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread Martin Gregorie
On Thu, 2019-03-21 at 09:23 -0700, John Hardin wrote:
> On Thu, 21 Mar 2019, Savvas Karagiannidis wrote:
> 
> > What should be considered is the message's language. All messages
> > that were 
> > false positives had the following mime encoding (messages were
> > actually in 
> > greek):
> > 
> > Content-Type: text/[plain|html]; charset="windows-1253" or
> > Content-Type: text/[plain|html]; charset="iso-8859-7"
> > 
> > while all messages that were actual spam and were properly detected
> > had:
> > 
> > Content-Type: text/[plain|html]; charset="utf-8"
> 
> It should be fairly easy to add an exclusion based on that
> information. 
> However, that information may well be leveraged by spammers who are
> using that obfuscation...
> 
FWIW roughly 10% of my spam corpus uses  tags to set white text.
The ratio of using "white" to "#ff" to 1/3 - 2/3. I should say that
some of these messages are quite old - I keep them as test data when
I'm writing new rules: they are NOT used for Bayes training.

My mail archive contains 192540 messages in theory it contains no spam
apart, that is, from a small amount of spam eeled its way in. 145
messages in it contain 'color="white"' and 2293 contain
'color="#ff"' The combination makes up 1.27% of the archived
messages. 

My take is that so it would appear that it may deserve a small score,
but it is probably best used as a subrule.
 

Martin




Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread John Hardin

On Thu, 21 Mar 2019, Savvas Karagiannidis wrote:

What should be considered is the message's language. All messages that were 
false positives had the following mime encoding (messages were actually in 
greek):


Content-Type: text/[plain|html]; charset="windows-1253" or
Content-Type: text/[plain|html]; charset="iso-8859-7"

while all messages that were actual spam and were properly detected had:

Content-Type: text/[plain|html]; charset="utf-8"


It should be fairly easy to add an exclusion based on that information. 
However, that information may well be leveraged by spammers who are using 
that obfuscation...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #2: Anything worth shooting
  is worth shooting twice. Ammo is cheap. Your life is expensive.
---
 721 days since the first commercial re-flight of an orbital booster (SpaceX)


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread Bill Cole

On 21 Mar 2019, at 10:52, John Wilcock wrote:


Le 21/03/2019 à 14:52, John Wilcock a écrit :

Le 20/03/2019 à 20:19, Bill Cole a écrit :
I've added these lines to the block that defines MIXED_ES which may 
help some sites:


 lang pl  score MIXED_ES  0.01
 lang cz  score MIXED_ES  0.01
 lang sk  score MIXED_ES  0.01
 lang hr  score MIXED_ES  0.01
 lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.


All very well, except [...]
Also, there are *lots* of other languages that legitimately use E-like 
characters that should be added to the list (e.g. there's a Cyrillic 
"е", so you can add ru, bg, uk, be, bs, sr, kk, ky, mn, tg and 
others, for a start; ). You'll be fighting a losing battle there...


Actually not a battle I'm fighting...

I have seen direct reports of this rule (which is substantially more 
narrow than just 'has mixed e-like characters') matching ham in the 
above listed languages. I know that on the order of 0.001% of ham in the 
masscheck data submitted to SA Rule QA match the rule and that the bulk 
of that is from a single small corpus (from a Polish source) in which 
~0.5% of ham matches. It appears that occasionally that match rate 
results in a classification false positive, which is a real but small 
and constraijned problem.


I have never seen an actual ham message matching the rule, much less had 
access to a mail stream including a steady stream of such messages. I 
have only ever seen vague reports of classification FPs, all of which 
cite the score as 3.999, which has not been accurate for most of the 
lifetime of the rule. As such, I have no real weapons in this battle and 
a foe who is invisible but noisy, to overstretch your analogy.


Individual sites are always free to kill or redefine rules from the 
default set or peg their scores to limit FPs.


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread Savvas Karagiannidis

Hi all,

I'd like to thank you Bill for looking into this. I was a bit 
disappointed by the way the issue was handled at first on bugzilla.


I must agree that the server's locale could be information to be 
considered but I don't think it solves the issue. I agree that this test 
is effective on catching the type of spam it was intended for. I found a 
number of spam messages caught by this while investigating the issue.


What should be considered is the message's language. All messages that 
were false positives had the following mime encoding (messages were 
actually in greek):


Content-Type: text/[plain|html]; charset="windows-1253" or
Content-Type: text/[plain|html]; charset="iso-8859-7"

while all messages that were actual spam and were properly detected had:

Content-Type: text/[plain|html]; charset="utf-8"

I'm afraid I cannot provide any sample of the false positives at the moment.

Hope the above helps. Spamassassin is a great project and we are trying 
to help improve it


--

Savvas Karagiannidis

On 21/3/2019 16:52, John Wilcock wrote:

Le 21/03/2019 à 14:52, John Wilcock a écrit :

Le 20/03/2019 à 20:19, Bill Cole a écrit :
I've added these lines to the block that defines MIXED_ES which may 
help some sites:


 lang pl  score MIXED_ES  0.01
 lang cz  score MIXED_ES  0.01
 lang sk  score MIXED_ES  0.01
 lang hr  score MIXED_ES  0.01
 lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.


All very well, except [...]
Also, there are *lots* of other languages that legitimately use E-like 
characters that should be added to the list (e.g. there's a Cyrillic 
"е", so you can add ru, bg, uk, be, bs, sr, kk, ky, mn, tg and others, 
for a start; ). You'll be fighting a losing battle there...




Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread John Wilcock

Le 21/03/2019 à 14:52, John Wilcock a écrit :

Le 20/03/2019 à 20:19, Bill Cole a écrit :
I've added these lines to the block that defines MIXED_ES which may 
help some sites:


 lang pl  score MIXED_ES  0.01
 lang cz  score MIXED_ES  0.01
 lang sk  score MIXED_ES  0.01
 lang hr  score MIXED_ES  0.01
 lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.


All very well, except [...]
Also, there are *lots* of other languages that legitimately use E-like 
characters that should be added to the list (e.g. there's a Cyrillic 
"е", so you can add ru, bg, uk, be, bs, sr, kk, ky, mn, tg and others, 
for a start; ). You'll be fighting a losing battle there...


--
John


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread John Wilcock

Le 20/03/2019 à 20:19, Bill Cole a écrit :
I've added these lines to the block that defines MIXED_ES which may help 
some sites:


     lang pl  score MIXED_ES  0.01
     lang cz  score MIXED_ES  0.01
     lang sk  score MIXED_ES  0.01
     lang hr  score MIXED_ES  0.01
     lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.


All very well, except that this makes the score depend on the locale of 
the user running SA, which (depending on the glue used) will typically 
NOT match the language(s) understood by the email recipient, let alone 
the language(s) liable to be used by senders. Why should a perfectly 
legitimate message written in Polish be penalised because the server 
locale is set to English?


[That was a rhetorical question, of course – I realise that any rule 
hitting on "foreign" characters used for obfuscation is inherently risky 
for messages written in languages which actually use those characters]


A slightly better solution might be to modify the scores based on the 
value of ok_locales. That would at least allow administrators some 
control over which languages are most acceptable for their users. 
However, AFAICT, that could only be done today using an eval test.


--
John




Re: No longer just embedded =9D characters in blackmail emails.

2019-03-21 Thread Bill Cole

On 20 Mar 2019, at 18:26, Benny Pedersen wrote:


Bill Cole skrev den 2019-03-20 20:19:


lang pl  score MIXED_ES  0.01
lang cz  score MIXED_ES  0.01
lang sk  score MIXED_ES  0.01
lang hr  score MIXED_ES  0.01
lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.


is lang supporing rule scores at all ?

imho lang only support description


Opinions are things about which it is good to be humble...

Documented and demonstrable functionality is something else. It is 
neither humble nor proud, it just IS:



# grep -r LOCALE_IS_EN /etc/mail/spamassassin/
			/etc/mail/spamassassin/90_lang.cf:lang en describe LOCALE_IS_EN What 
it says

/etc/mail/spamassassin/90_lang.cf:lang en meta  
LOCALE_IS_EN 1
/etc/mail/spamassassin/90_lang.cf:lang en score 
LOCALE_IS_EN 0.001

#  spamassassin -D all < ~/testmail.good 2>&1 |grep 
LOCALE_IS_EN
			Mar 20 19:02:11.266 [59189] dbg: check: 
tests=AWL,BAYES_40,LOCALE_IS_EN,NO_RECEIVED,NO_RELAYS,SCC_DEBUG,SCC_DEBUG_RAW_LINE,SCC_DEBUG_UP,SCC_DEBUG_WL,T_SCC_BODY_TEXT_LINE

*  0.0 LOCALE_IS_EN What it says
			X-Spam-Status: No, score=-1.0 required=5.0 
tests=AWL,BAYES_40,LOCALE_IS_EN,





what score is then for none listed lang ?, default to 1.0 ?


MIXED_ES is scored by Rule QA. See the message you replied to for its 
current score (as of r1855811.)


'perldoc Mail::SpamAssassin::Conf' is your friend, although I guess it 
might seem friendlier were it in Danish...


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-20 Thread Benny Pedersen

Bill Cole skrev den 2019-03-20 20:19:


lang pl  score MIXED_ES  0.01
lang cz  score MIXED_ES  0.01
lang sk  score MIXED_ES  0.01
lang hr  score MIXED_ES  0.01
lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.


is lang supporing rule scores at all ?

imho lang only support description

what score is then for none listed lang ?, default to 1.0 ?


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-20 Thread Bill Cole

On 20 Mar 2019, at 9:04, piecka wrote:


Hello

We've encountered a high false positive rate with MIXED_ES rule for 
emails
written in Czech language. Czech naturally uses all of the e,ě and 
é.


The situation is similar for Slovak language, which includes e and é.

It seems the same with Greek
(https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7691).

Email messages written in one of the above mentioned (probably even 
other)

languages have a much higher false positive rate than I would consider
acceptable.


I apologize for this: I am the instigator of MIXED_ES, which has done a 
good job of catching the extortion spam it was designed from and has an 
additional benefit of targeting a generic tactic rather than the moving 
target of phrasing. I would very much like to minimize how often it 
matches on ham.


Unfortunately, I don't have any examples of FPs, only reports of them. 
This makes targeted mitigation very difficult. The Rule QA system has 
masscheck reports of a steady but small number of hits on ham, almost 
all from a single smallish corpus and no more than one message in any 
recent masscheck actually scoring as spam overall.


I've added these lines to the block that defines MIXED_ES which may help 
some sites:


lang pl  score MIXED_ES  0.01
lang cz  score MIXED_ES  0.01
lang sk  score MIXED_ES  0.01
lang hr  score MIXED_ES  0.01
lang el  score MIXED_ES  0.01

Those should get into the default rules channel within a few days.

Additionally, the default score for the rule is 3.999 which is quite 
high.


The current score quartet (as determined by the Rule QA system) is 
'2.791 2.699 2.791 2.699' and the last time any of those scores was 
3.999 was 3 March. If your system is scoring it at 3.999, you should be 
running sa-update more often.


Also, I think it should be understood that nearly all SA rules with a 
positive score will match some 'ham' messages. These are "false 
positives" for the individual rule, but usually they are NOT false 
positives for SpamAssassin as a whole.




Re: No longer just embedded =9D characters in blackmail emails.

2019-03-20 Thread Grant Taylor

On 3/20/19 7:04 AM, piecka wrote:

We've encountered a high false positive rate with MIXED_ES rule for emails
written in Czech language … Slovak … Greek …


Do the MIME headers have any indication of the language?

Can you use create a __test rule that is then used in a meta rule with 
MIXED_ES?




--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-20 Thread Marcin Mirosław
W dniu 20.03.2019 o 15:27, Dominic Raferd pisze:
> On Wed, 20 Mar 2019 at 13:14, piecka  wrote:
>>
>> Hello
>>
>> We've encountered a high false positive rate with MIXED_ES rule for emails
>> written in Czech language. Czech naturally uses all of the e,ě and é.
>>
>> The situation is similar for Slovak language, which includes e and é.
>>
>> It seems the same with Greek
>> (https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7691).
>>
>> Email messages written in one of the above mentioned (probably even other)
>> languages have a much higher false positive rate than I would consider
>> acceptable.
>>
>> Additionally, the default score for the rule is 3.999 which is quite high.
>>
>> I don't think the rule is suitable for the default ruleset in the current
>> form.
> 
> I have seen similar problems and agree. I reduced its score with this
> line in /etc/spamassassin/local.cf:
> score MIXED_ES 0.499
> 


MIXED_ES has hits in ham in masscheck
https://ruleqa.spamassassin.org/20190317-r1855682-n/MIXED_ES/detail
part of ham mails in corpus which trigger MIXED_ES is in polish language.






Re: No longer just embedded =9D characters in blackmail emails.

2019-03-20 Thread Dominic Raferd
On Wed, 20 Mar 2019 at 13:14, piecka  wrote:
>
> Hello
>
> We've encountered a high false positive rate with MIXED_ES rule for emails
> written in Czech language. Czech naturally uses all of the e,ě and é.
>
> The situation is similar for Slovak language, which includes e and é.
>
> It seems the same with Greek
> (https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7691).
>
> Email messages written in one of the above mentioned (probably even other)
> languages have a much higher false positive rate than I would consider
> acceptable.
>
> Additionally, the default score for the rule is 3.999 which is quite high.
>
> I don't think the rule is suitable for the default ruleset in the current
> form.

I have seen similar problems and agree. I reduced its score with this
line in /etc/spamassassin/local.cf:
score MIXED_ES 0.499


Re: No longer just embedded =9D characters in blackmail emails.

2019-03-20 Thread piecka
Hello

We've encountered a high false positive rate with MIXED_ES rule for emails
written in Czech language. Czech naturally uses all of the e,ě and é.

The situation is similar for Slovak language, which includes e and é.

It seems the same with Greek
(https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7691).

Email messages written in one of the above mentioned (probably even other)
languages have a much higher false positive rate than I would consider
acceptable.

Additionally, the default score for the rule is 3.999 which is quite high.

I don't think the rule is suitable for the default ruleset in the current
form.



--
Sent from: http://spamassassin.1065346.n5.nabble.com/SpamAssassin-Users-f3.html


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-06 Thread John Hardin

On Wed, 5 Dec 2018, Grant Taylor wrote:


On 12/5/18 5:43 PM, John Hardin wrote:
Potentially, but it's hard to use something like that in regular rule REs. 
That sort of smarts would probably need to be in a plugin.


Maybe (from my naive point of view) if not probably (from your more 
experienced point of view).


I would think that it would be possible to build meta rules that would depend 
on a __LANG_EN or __LANG_FR type rule that would then apply the various logic 
about the different ratios of different permutations.


Hm. Possibly, yes.

Which, brings up a question:  Do all (sub)rules get evaluated in a meta rule? 
Or is there an order of operations to them?  I.e.:


meta   MY_META_RULE   (__LANG_FR && __UNICODE_TEST_FR)

Would __UNICODE_TEST_FR run / consume resources even if __LANG_FR evaluates 
to false?


Yes, all the subrules get evaluated. There's no shortcutting because a 
subrule may be used in any number of meta rules.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 Tomorrow: The 77th anniversary of Pearl Harbor


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Bill Cole
On 5 Dec 2018, at 22:29, Grant Taylor wrote:

> On 12/5/18 7:55 PM, Bill Cole wrote:
>> Yes. There is no automatic 'shortcircuiting' of rules.
>
> Okay.
>
> You say "automatic".  Is there a "non-automatic" way?  :-)

perldoc Mail::SpamAssassin::Plugin::Shortcircuit


-- 
Bill Cole


signature.asc
Description: OpenPGP digital signature


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Grant Taylor

On 12/5/18 7:55 PM, Bill Cole wrote:

Yes. There is no automatic 'shortcircuiting' of rules.


Okay.

You say "automatic".  Is there a "non-automatic" way?  :-)



--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Bill Cole
On 5 Dec 2018, at 20:37, Grant Taylor wrote:

> On 12/5/18 5:43 PM, John Hardin wrote:
>> Potentially, but it's hard to use something like that in regular rule REs. 
>> That sort of smarts would probably need to be in a plugin.
>
> Maybe (from my naive point of view) if not probably (from your more 
> experienced point of view).
>
> I would think that it would be possible to build meta rules that would depend 
> on a __LANG_EN or __LANG_FR type rule that would then apply the various logic 
> about the different ratios of different permutations.
>
> Which, brings up a question:  Do all (sub)rules get evaluated in a meta rule?

Yes. There is no automatic 'shortcircuiting' of rules.

-- 
Bill Cole


signature.asc
Description: OpenPGP digital signature


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Grant Taylor

On 12/5/18 5:43 PM, John Hardin wrote:
Potentially, but it's hard to use something like that in regular rule 
REs. That sort of smarts would probably need to be in a plugin.


Maybe (from my naive point of view) if not probably (from your more 
experienced point of view).


I would think that it would be possible to build meta rules that would 
depend on a __LANG_EN or __LANG_FR type rule that would then apply the 
various logic about the different ratios of different permutations.


Which, brings up a question:  Do all (sub)rules get evaluated in a meta 
rule?  Or is there an order of operations to them?  I.e.:


meta   MY_META_RULE   (__LANG_FR && __UNICODE_TEST_FR)

Would __UNICODE_TEST_FR run / consume resources even if __LANG_FR 
evaluates to false?




--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread John Hardin

On Wed, 5 Dec 2018, Grant Taylor wrote:


On 12/05/2018 03:27 PM, John Hardin wrote:

Take a look at replace_rules in the repo (both standard and sandboxes).


Thank you for the reference.  replace_rules look very intriguing.

Link - Mail::SpamAssassin::Plugin::ReplaceTags - tags for SpamAssassin rules
- 
https://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin_ReplaceTags.html


I could see myself using this for a number of things.  (If / when there was 
sufficient spam to warrant.)


The unicode replacements are fairly stable, it's looking for specific 
obfuscated words (like "bitcoin") that's whack-a-mole.


I'll have to research this.

The problem there is, that's really strongly biased towards English text. 
Spanish and French, for example, would have ASCII, but it would also have a 
fairly high proportion of accented characters.


Fair concern.  I'm going to say that I am (more than) a bit naive about that. 
I thought there was something that included a language in a header (possible 
one of the MIME headers) that could be used to refine the logic.


Potentially, but it's hard to use something like that in regular rule REs. 
That sort of smarts would probably need to be in a plugin.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The yardstick you should use when considering whether to support a
  given piece of legislation is "what if my worst enemy is chosen to
  administer this law?"
---
 2 days until The 77th anniversary of Pearl Harbor


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Grant Taylor

On 12/05/2018 03:27 PM, John Hardin wrote:

Take a look at replace_rules in the repo (both standard and sandboxes).


Thank you for the reference.  replace_rules look very intriguing.

Link - Mail::SpamAssassin::Plugin::ReplaceTags - tags for SpamAssassin rules
 - 
https://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Plugin_ReplaceTags.html


I could see myself using this for a number of things.  (If / when there 
was sufficient spam to warrant.)


The unicode replacements are fairly stable, it's looking for specific 
obfuscated words (like "bitcoin") that's whack-a-mole.


I'll have to research this.

The problem there is, that's really strongly based towards English text. 
Spanish and French, for example, would have ASCII, but it would also 
have a fairly high proportion of accented characters.


Fair concern.  I'm going to say that I am (more than) a bit naive about 
that.  I thought there was something that included a language in a 
header (possible one of the MIME headers) that could be used to refine 
the logic.




--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread John Hardin

On Wed, 5 Dec 2018, Bill Cole wrote:


On 5 Dec 2018, at 16:45, John Hardin wrote:

Those aren't zero-width, those are just standard Unicode obfuscations of 
regular ASCII text.


Not precisely. In this case they seem to all be Cyrillic characters which 
happen to look like Latin characters that have ASCII encodings. It's not 
possible to obfuscate actual printables in UTF-8 encoding because the UTF-8 
encoding of any ASCII printable is the same 8 bits as its ASCII encoding.


Sorry, I was sloppy in my wording. What you described is what I was 
referring to - obfuscating ASCII text with non-ASCII glyphs that look 
sufficiently similar to not interrupt the flow of reading.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #12: Have a plan.
  USMC Rules of Gunfighting #13: Have a back-up plan, because
  the first one won't work.
---
 2 days until The 77th anniversary of Pearl Harbor


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Bill Cole

On 5 Dec 2018, at 16:45, John Hardin wrote:

Those aren't zero-width, those are just standard Unicode obfuscations 
of regular ASCII text.


Not precisely. In this case they seem to all be Cyrillic characters 
which happen to look like Latin characters that have ASCII encodings. 
It's not possible to obfuscate actual printables in UTF-8 encoding 
because the UTF-8 encoding of any ASCII printable is the same 8 bits as 
its ASCII encoding.


--
Bill Cole
Noted Pedant


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread John Hardin

On Wed, 5 Dec 2018, Grant Taylor wrote:


On 12/05/2018 02:45 PM, John Hardin wrote:
I've added a "too many [ascii][unicode][ascii]" rule based on that but I 
suspect it will be pretty FP-prone and will be pretty large if we want to 
avoid whack-a-mole syndrome. For this, normalize + bayes is probably the 
best bet.


Is it possible to detect when a Unicode code point is being used in place of 
an ASCII / ANSI character specifically to avoid pattern detection?  I.e. 
multiple Unicode code points that represent or are otherwise a stand in for 
an ASCII / ANSI "a"?


Take a look at replace_rules in the repo (both standard and sandboxes).


Or is keeping up with this list tantamount to whack-a-mole?


The unicode replacements are fairly stable, it's looking for specific 
obfuscated words (like "bitcoin") that's whack-a-mole.


I would think that too high of a percentage of Unicode when bog standard 
ASCII / ANSI would suffice would be an indication in and of itself.  I'm not 
seeing how legitimate (non-spam) email would trigger a false positive if the 
percentage was tuned correctly.


The problem there is, that's really strongly based towards English text. 
Spanish and French, for example, would have ASCII, but it would also have 
a fairly high proportion of accented characters.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The problem is when people look at Yahoo, slashdot, or groklaw and
  jump from obvious and correct observations like "Oh my God, this
  place is teeming with utter morons" to incorrect conclusions like
  "there's nothing of value here".-- Al Petrofsky, in Y! SCOX
---
 2 days until The 77th anniversary of Pearl Harbor


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Kevin A. McGrail
On 12/5/2018 4:50 PM, Grant Taylor wrote:
> On 12/05/2018 02:45 PM, John Hardin wrote:
>> I've added a "too many [ascii][unicode][ascii]" rule based on that
>> but I suspect it will be pretty FP-prone and will be pretty large if
>> we want to avoid whack-a-mole syndrome. For this, normalize + bayes
>> is probably the best bet.
>
> Is it possible to detect when a Unicode code point is being used in
> place of an ASCII / ANSI character specifically to avoid pattern
> detection?  I.e. multiple Unicode code points that represent or are
> otherwise a stand in for an ASCII / ANSI "a"?
>
> Or is keeping up with this list tantamount to whack-a-mole?
>
> I would think that too high of a percentage of Unicode when bog
> standard ASCII / ANSI would suffice would be an indication in and of
> itself.  I'm not seeing how legitimate (non-spam) email would trigger
> a false positive if the percentage was tuned correctly.
>
Yes, look at KAM.cf and the Replace Tags feature used in KAM_CRIM rules
as well as the SCC_SHORT_WORDS.  Both rules are designed to catch these
exact type of obfuscation.  One more specific (CRIM) and one more
generic (SHORT_WORDS).


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Grant Taylor

On 12/05/2018 02:45 PM, John Hardin wrote:
I've added a "too many [ascii][unicode][ascii]" rule based on that but I 
suspect it will be pretty FP-prone and will be pretty large if we want 
to avoid whack-a-mole syndrome. For this, normalize + bayes is probably 
the best bet.


Is it possible to detect when a Unicode code point is being used in 
place of an ASCII / ANSI character specifically to avoid pattern 
detection?  I.e. multiple Unicode code points that represent or are 
otherwise a stand in for an ASCII / ANSI "a"?


Or is keeping up with this list tantamount to whack-a-mole?

I would think that too high of a percentage of Unicode when bog standard 
ASCII / ANSI would suffice would be an indication in and of itself.  I'm 
not seeing how legitimate (non-spam) email would trigger a false 
positive if the percentage was tuned correctly.




--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread John Hardin

On Wed, 5 Dec 2018, Mark London wrote:



No longer just embedded =9D characters.

From: =?utf-8?B?bmlnaHRt0LByZQ==?= 
To: 
Subject: You are my  victim.
Date: Tue, 4 Dec 2018 15:56:36 -0800
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="a0d0993ce53319101c19af03d5311b0976b26b"
X-Scanned-By: MIMEDefang 2.79 on 18.18.166.11

--a0d0993ce53319101c19af03d5311b0976b26b
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Hi, my pr=D0=B5y.

This is my last warning.

I write you inasmuch as I put a virus on the web page with porno which yo=
u have viewed.
My tr=D0=BEjan c=D0=B0=D1=80tured all y=D0=BEur =D1=80rivat=D0=B5 dat=D0=B0=
=D0=B0nd switched on your c=D0=B0mer=D0=B0 which r=D0=B5=D1=81=D0=BErded=


...etc

Those aren't zero-width, those are just standard Unicode obfuscations of 
regular ASCII text. The _ZW rule isn't intended to catch that.


I've added a "too many [ascii][unicode][ascii]" rule based on that but I 
suspect it will be pretty FP-prone and will be pretty large if we want to 
avoid whack-a-mole syndrome. For this, normalize + bayes is probably the 
best bet.


I've added some of the new phrases from that to the bitcoin extort 
components.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The call to let 16-year-olds vote is a call to amplify the votes
  of teachers' unions. If you think political indoctrination in the
  schools is bad now, wait until it has the direct power to tip
  election results.   -- Robert Tracinski
---
 2 days until The 77th anniversary of Pearl Harbor


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Bill Cole

On 5 Dec 2018, at 11:45, Mark London wrote:

The __UNICODE_OBFU_ZW rule is not being triggered on this email. Maybe 
it needs updating? - Mark


FWIW, I just added a "MIXED_ES" rule to my sandbox which does catch on 
anything with a suspiciously large number of characters that are 
visually like 'e' but are not ASCII 'e' in mail with a significant 
number of literal 'e' characters. Once we've got the recent masscheck 
trouble ironed out, I expect this could make it to the default 
ruleset...




Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread John Hardin

On Wed, 5 Dec 2018, Mark London wrote:

The __UNICODE_OBFU_ZW rule is not being triggered on this email. Maybe it 
needs updating? - Mark


Will do, I don't have a zero response time as much as I wish I did... :)


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
 2 days until The 77th anniversary of Pearl Harbor


Re: No longer just embedded =9D characters in blackmail emails.

2018-12-05 Thread Mark London
The __UNICODE_OBFU_ZW rule is not being triggered on this email. Maybe 
it needs updating? - Mark


On 12/5/2018 11:19 AM, Mark London wrote:

No longer just embedded =9D characters.

From: =?utf-8?B?bmlnaHRt0LByZQ==?= 
To: 
Subject: You are my  victim.
Date: Tue, 4 Dec 2018 15:56:36 -0800
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="a0d0993ce53319101c19af03d5311b0976b26b"
X-Scanned-By: MIMEDefang 2.79 on 18.18.166.11

--a0d0993ce53319101c19af03d5311b0976b26b
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Hi, my pr=D0=B5y.

This is my last warning.

I write you inasmuch as I put a virus on the web page with porno which yo=
u have viewed.
My tr=D0=BEjan c=D0=B0=D1=80tured all y=D0=BEur =D1=80rivat=D0=B5 dat=D0=B0=
  =D0=B0nd switched on your c=D0=B0mer=D0=B0 which r=D0=B5=D1=81=D0=BErded=
  the =D0=B0=D1=81t of your solit=D0=B0ry s=D0=B5x. Just aft=D0=B5r that t=
h=D0=B5 troj=D0=B0n saved y=D0=BEur =D1=81=D0=BEnt=D0=B0=D1=81t list.
I will =D0=B5r=D0=B0se th=D0=B5 com=D1=80romising vide=D0=BE r=D0=B5c=D0=BE=
rds and inf=D0=BErmati=D0=BEn if you s=D0=B5nd me 444 EURO in bitcoin.
This is addr=D0=B5ss for p=D0=B0yment :=C2=A0 1HpREEx9iJ9gK3Xk5vVs9R1XBEm=
2hrCZp7

I give y=D0=BEu 30 h=D0=BEurs aft=D0=B5r y=D0=BEu =D0=BEpen my m=D0=B5ss=D0=
=B0ge for m=D0=B0king the =D1=80=D0=B0ym=D0=B5nt.
As s=D0=BEon =D0=B0s you read th=D0=B5 mess=D0=B0ge I'll s=D0=B5e it righ=
t aw=D0=B0y.
It is not ne=D1=81=D0=B5ss=D0=B0ry to t=D0=B5ll m=D0=B5 that you h=D0=B0v=
=D0=B5 s=D0=B5nt money to me. This =D0=B0ddress is conn=D0=B5cted t=D0=BE=
  you, my syst=D0=B5m will =D0=B5rased aut=D0=BEmatic=D0=B0lly =D0=B0fter =
tr=D0=B0nsfer confirmati=D0=BEn.
If you n=D0=B5ed 48h just =D0=9Epen the =D1=81alcul=D0=B0tor =D0=BEn y=D0=
=BEur d=D0=B5skto=D1=80 =D0=B0nd =D1=80r=D0=B5ss +++
If y=D0=BEu don't =D1=80=D0=B0y, I'll send dirt t=D0=BE all y=D0=BEur c=D0=
=BEnta=D1=81ts.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0
L=D0=B5t m=D0=B5 r=D0=B5mind y=D0=BEu-I s=D0=B5=D0=B5 wh=D0=B0t you're do=
ing!
Y=D0=BEu c=D0=B0n visit th=D0=B5 poli=D1=81e =D0=BEffice but =D0=B0nyb=D0=
=BEdy =D1=81an't h=D0=B5lp y=D0=BEu.
If you try t=D0=BE dec=D0=B5iv=D0=B5 me , I'll kn=D0=BEw it immediat=D0=B5=
ly!
I d=D0=BEn't liv=D0=B5 in your c=D0=BEuntry. S=D0=BE anyone =D1=81=D0=B0n=
  n=D0=BEt track my l=D0=BE=D1=81ati=D0=BEn =D0=B5ven f=D0=BEr 9 months.
by=D0=B5. D=D0=BEn't forget about th=D0=B5 sh=D0=B0m=D0=B5 =D0=B0nd t=D0=BE=
  ignor=D0=B5, Y=D0=BEur life =D1=81=D0=B0n be ruined.

_=