Re: Detect Emoticons in Subject

2021-05-21 Thread RW
On Thu, 20 May 2021 19:39:06 +0100
RW wrote:

> 
> /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/


This includes the block mentioned by Bill Cole and and is simplified a
bit


/\xF0\x9F[\x98-\x99\xA4-\xA7\x8C-\x97][\x80-\x8F]|\xE2\x98[\xB9-\xBB]/


However, if you don't expect to get any legitimate mail with Asian
languages in the subject, you can probably get away with including all
4-byte UTF-8. Those code points are dominated by CJK, symbols, emojis
and dead languages.


/[\xF0-\xF7][\x80-\xBF]{3}|\xE2\x98[\xB9-\xBB]/


Re: Detect Emoticons in Subject

2021-05-21 Thread Henrik K
On Fri, May 21, 2021 at 09:53:36AM +0200, Tom Hendrikx wrote:
>
> Can someone explain why SA cannot support this type of syntax, or what would
> be needed to get it supported? IMHO it makes it a lot easier for end-users
> to understand a rule, and for rule developers to write or even contribute
> new UTF-8-related rules, so it might be worth the effort to get it
> supported?

Perl strings internally would have to be UTF8.  Mandatory prerequisite would
be normalize_charset 1 in SA.  Could be some cases where SA can't decode
mails properly to UTF8, so it's a question mark what happens then.

Some changes are coming already in 4.0, for example normalize_charset 1 will
be default.  But more complex internal/rule changes require a lot of thought
on how to maintain backwards compatibility.  I'm sure some people will still
run 3.4 for years to come.

Sorry to say but there are too few developers right now.  It's up to the
community to pick up the pace.



Re: Detect Emoticons in Subject

2021-05-21 Thread Tom Hendrikx

On 20-05-2021 18:19, RW wrote:

On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:


Hi,

I've been using SA a long time.  Lately, I'm getting more and more
spam with emoticons in the subject line.  I'd say about 90% of my
emails with emoticons in the subject are spam.  I'd like to create a
local rule which scores email with emoticons in the subject.



# Local Rule for Emoticons in subject
subjectEMOTICON_IN_SUBJECT  Subject =~ /\p{Emoticons}/


The rule should start with "header", that's what's causing the lint
failure.

However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.



I'm not a real fan of very complex regular expressions, as they tend to 
get hard to read/understand very quickly. This thread is a perfect 
example: the syntax that the OP proposed (/\p{Emoticons}/) seems 
perfectly readable, and all the actually working alternatives are, with 
all respect to the authors, a nightmare to decipher. Especially for 
users not really proficient in regular expressions, the OP's syntax is 
perfectly understandable and all the alternatives aren't.


I'm not really into the regex engine of perl/SA, so please correct if 
I'm wrong. The /\p{Emoticons}/ syntax seems to me a builtin feature of 
the regex spec/perl (as opposed to pseudo-code, displaying something 
that actually doesn't exist).


Can someone explain why SA cannot support this type of syntax, or what 
would be needed to get it supported? IMHO it makes it a lot easier for 
end-users to understand a rule, and for rule developers to write or even 
contribute new UTF-8-related rules, so it might be worth the effort to 
get it supported?


Thanks in advance,
Tom


Re: Detect Emoticons in Subject: CHAOS

2021-05-20 Thread Benny Pedersen

On 2021-05-20 22:33, Clive Jacques wrote:

Here is a good example of such an email (attached, stripped of
identifying info).


This attachment is suspicious because its type doesn't match the type 
declared in the message. If you do not trust the sender, you shouldn't 
open it in the browser because it may contain malicious contents.


Expected: text/plain (.txt); found: message/rfc822 (.eml)

should i ignore roundcube warnings ? :)


Re: Detect Emoticons in Subject: CHAOS

2021-05-20 Thread RW
On Thu, 20 May 2021 15:35:21 -0400
Jared Hall wrote:

> Clive Jacques wrote:

> > # Local Rule for Emoticons in subject
> > subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/

> 
> The following regex will detect a good amount of Emojis:
> 
> |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug
>  
> |
That doesn't work in SA for the same reason that \p{Emoticons}
doesn't work.


Re: Detect Emoticons in Subject: CHAOS

2021-05-20 Thread Jared Hall

Clive Jacques wrote:

Hi,

I've been using SA a long time.  Lately, I'm getting more and more 
spam with emoticons in the subject line.  I'd say about 90% of my 
emails with emoticons in the subject are spam.  I'd like to create a 
local rule which scores email with emoticons in the subject.  I saw a 
previous discussion on this in the archive, but it was focused on 
whether such emails were /always /spam.  I think an emoticon rule, in 
combination with other rules, will help my installation.  I've tried 
to match as follows, but it won't lint.  I'm not really a perl 
programmer.  I've written several other more conventional local rules, 
but here I'm a bit out of my depth.  I'd appreciate some guidance.


# Local Rule for Emoticons in subject
subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/
score          EMOTICON_IN_SUBJECT      3.0
describe        EMOTICON_IN_SUBJECT     Subject Line Has Emoticons

-CJ


The following regex will detect a good amount of Emojis:

|/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug 
|



Ref: 
https://stackoverflow.com/questions/43242440/javascript-unicode-emoji-regular-expressions/45138005#45138005


But it is not the greatest thing if you want to get a count out of that.


However, I may have a solution for you with the CHAOS plugin:

https://github.com/telecom2k3/CHAOS

You can get (but shouldn't) Emojis even in From names, like this actual one:

DHL☺com

CHAOS will also help you with Unicode Character spoofs, via its 
UniBabble rulesets:


ᴀмαzσи ᴘ픯픦픪ё
혼픪픞혻홤혯 혾혶혴황홤혮혦혳 홎픢혳홫혪혤픢
Amαzoɴ Priⅿë
 
퐀퐦퐚퐳퐨퐧 퐍퐨퐭퐢퐜퐞
...
...

CHAOS will run on PERL 5.18 and later.




-- Jared Hall



Re: Detect Emoticons in Subject

2021-05-20 Thread Clive Jacques
That's fine - I'm not saying all email containing emojis in the subject (or
elsewhere) *is *spam - just that it's uncommon and right now, about 90% of
the time it is *for me*.  I just want to score it as part of the greater
constellation of factors (just like DKIM, SPF etc.).

On Thu, May 20, 2021 at 2:48 PM Bill Cole <
sausers-20150...@billmail.scconsult.com> wrote:

>
> People send wanted mail with all sorts of weirdness.
>
>


Re: Detect Emoticons in Subject

2021-05-20 Thread Bill Cole

On 2021-05-20 at 13:44:43 UTC-0400 (Thu, 20 May 2021 18:44:43 +0100)
RW 
is rumored to have said:


On Thu, 20 May 2021 18:30:03 +0100
RW wrote:



Try this:


header  EMOTICON_IN_SUBJECT  Subject =~
/\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/



Actually that's only the original block, but it probably works most of
the time


Not so sure about that...

I regularly get mail from Patreon with emoji in the encoded header which 
don't match that pattern:



# grep '^Subject: ' /tmp/ham |cut -d? -f4 |decode-base64 |hexdump -C
  f0 9f 8e 89 20 50 61 74  72 69 63 6b 20 57 61 72  | 
Patrick War|
0010  64 6c 65 20 6a 75 73 74  20 73 68 61 72 65 64 20  |dle just 
shared |

0020  22 f0 9f 93 9d 20 4e  |" N|
0027

People send wanted mail with all sorts of weirdness.

Looking at the full set 
(https://www.unicode.org/emoji/charts/full-emoji-list.html) I can 
understand why \p{Emoticons} would be so much better than trying to 
define them all in a regex of hex bytes in UTF-8 form.


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire


Re: Detect Emoticons in Subject

2021-05-20 Thread RW
On Thu, 20 May 2021 19:26:30 +0100
RW wrote:

> On Thu, 20 May 2021 18:44:43 +0100
> RW wrote:
> 
> > On Thu, 20 May 2021 18:30:03 +0100
> > RW wrote:
> > 
> >   
> > > Try this:
> > > 
> > > 
> > > header  EMOTICON_IN_SUBJECT  Subject =~
> > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> > > 
> > 
> > Actually that's only the original block, but it probably works most
> > of the time  
> 
> This extends it to Supplemental Symbols and Pictographs and
> adds the three original faces from Miscellaneous Symbols
> 
> 
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
> 
> it also fixes a minor problem with a continuation bytes in the
> original.
> 
I still didn't get continuity bytes right, I forgot that bit 6 is always
0 - it's a long time since I've done this.

/\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/


Re: Detect Emoticons in Subject

2021-05-20 Thread RW
On Thu, 20 May 2021 18:44:43 +0100
RW wrote:

> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
> 
> 
> > Try this:
> > 
> > 
> > header  EMOTICON_IN_SUBJECT  Subject =~
> > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> >   
> 
> Actually that's only the original block, but it probably works most of
> the time

This extends it to Supplemental Symbols and Pictographs and
adds the three original faces from Miscellaneous Symbols


/\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/

it also fixes a minor problem with a continuation bytes in the original.



Re: Detect Emoticons in Subject

2021-05-20 Thread RW
On Thu, 20 May 2021 18:30:03 +0100
RW wrote:


> Try this:
> 
> 
> header  EMOTICON_IN_SUBJECT  Subject =~
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> 

Actually that's only the original block, but it probably works most of
the time


Re: Detect Emoticons in Subject

2021-05-20 Thread RW
On Thu, 20 May 2021 18:34:54 +0200
Bert Van de Poel wrote:

> We've started getting lots of spam with emoji in the subject too the 
> past few weeks, so I've looked into this as well. As mentioned by RW, 
> you would need to create some kind of UTF8 regex header Subject rule.
> As I'm not too excited about writing such a regex, it's way at the
> bottom of my todo list to contemplate whether an SA plugin could be
> written for that and to then reach out to the SA developers to see
> whether that would be something upstream would accept. But honestly,
> I won't be able to any time soon (I don't have the time). Still,
> thought I'd mention it, since it might be relevant to your question.
> If you do end up figuring out a regex that works out and isn't an
> extreme length, I think plenty of people on this list would love to
> know!

Try this:


header  EMOTICON_IN_SUBJECT  Subject =~ 
/\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/



Re: Detect Emoticons in Subject

2021-05-20 Thread Martin Gregorie
On Thu, 2021-05-20 at 18:34 +0200, Bert Van de Poel wrote:
> We've started getting lots of spam with emoji in the subject too the 
> past few weeks, so I've looked into this as well. As mentioned by RW, 
> you would need to create some kind of UTF8 regex header Subject rule. As
> I'm not too excited about writing such a regex, it's way at the bottom
> of my todo list 
>
Should be easy enough - IsASCII is just a name for [\x00-\x7f] and
IsXDigit is [0-9a-fA-F], so the same logic can be applied to define a
regex that triggers on any character within the three Unicode emoji
ranges. See Wikipedia doe more detail: 

https://en.wikipedia.org/wiki/Emoticon#Unicode

I haven't yet seen any emojis in Subject lines, regardless of whether
the message was spam or not, or I'd probably have already written such a
rule and given it a minimal score so it can be used in a more spam-
specific meta rule.

Martin





Re: Detect Emoticons in Subject

2021-05-20 Thread Bert Van de Poel
We've started getting lots of spam with emoji in the subject too the 
past few weeks, so I've looked into this as well. As mentioned by RW, 
you would need to create some kind of UTF8 regex header Subject rule. As 
I'm not too excited about writing such a regex, it's way at the bottom 
of my todo list to contemplate whether an SA plugin could be written for 
that and to then reach out to the SA developers to see whether that 
would be something upstream would accept. But honestly, I won't be able 
to any time soon (I don't have the time). Still, thought I'd mention it, 
since it might be relevant to your question. If you do end up figuring 
out a regex that works out and isn't an extreme length, I think plenty 
of people on this list would love to know!


Bert

On 20/05/2021 18:19, RW wrote:

On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:


Hi,

I've been using SA a long time.  Lately, I'm getting more and more
spam with emoticons in the subject line.  I'd say about 90% of my
emails with emoticons in the subject are spam.  I'd like to create a
local rule which scores email with emoticons in the subject.
# Local Rule for Emoticons in subject
subjectEMOTICON_IN_SUBJECT  Subject =~ /\p{Emoticons}/

The rule should start with "header", that's what's causing the lint
failure.

However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.




Re: Detect Emoticons in Subject

2021-05-20 Thread RW
On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:

> Hi,
> 
> I've been using SA a long time.  Lately, I'm getting more and more
> spam with emoticons in the subject line.  I'd say about 90% of my
> emails with emoticons in the subject are spam.  I'd like to create a
> local rule which scores email with emoticons in the subject. 

> # Local Rule for Emoticons in subject
> subjectEMOTICON_IN_SUBJECT  Subject =~ /\p{Emoticons}/

The rule should start with "header", that's what's causing the lint
failure. 

However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.


Detect Emoticons in Subject

2021-05-20 Thread Clive Jacques
Hi,

I've been using SA a long time.  Lately, I'm getting more and more spam
with emoticons in the subject line.  I'd say about 90% of my emails with
emoticons in the subject are spam.  I'd like to create a local rule which
scores email with emoticons in the subject.  I saw a previous discussion on
this in the archive, but it was focused on whether such emails were *always
*spam.  I think an emoticon rule, in combination with other rules, will
help my installation.  I've tried to match as follows, but it won't lint.
I'm not really a perl programmer.  I've written several other more
conventional local rules, but here I'm a bit out of my depth.  I'd
appreciate some guidance.

# Local Rule for Emoticons in subject
subjectEMOTICON_IN_SUBJECT  Subject =~ /\p{Emoticons}/
score  EMOTICON_IN_SUBJECT  3.0
describeEMOTICON_IN_SUBJECT Subject Line Has Emoticons

-CJ