Hello Bill I can show a few messages triggering the rule in our case but only for you to see the use of accented characters in Czech language. I'm unable to grant you a permission to upload them to masscheck corpus or to any other public/semipublic database. The messages contain no classified information, but they include names and email addresses of our users and our business partners. It would be difficult to anonymize them and still keep their value.
To be honest, the rule fires at low frequency even in our system which deals with Slovak and Czech languages all the time. Most people in here write emails with ASCII characters only and don't use accents etc. However, some do (e.g. I do - UTF-8 works in most email clients, so why not). I found the rule while searching through logs for messages which had scored too close to the spam threshold and are therefore close to being misclassified. I can see 9 messages received in the last 10 days firing the rule and all of them are Atlassian Jira notifications from our Czech business partner. Although after doing sa-update, they score pretty low as MIXED_ES has now a score of 0.5. Shall I provide you the respective messages? št 21. 3. 2019 o 16:34 Bill Cole <sausers-20150...@billmail.scconsult.com> napísal(a): > On 21 Mar 2019, at 10:52, John Wilcock wrote: > > > Le 21/03/2019 à 14:52, John Wilcock a écrit : > >> Le 20/03/2019 à 20:19, Bill Cole a écrit : > >>> I've added these lines to the block that defines MIXED_ES which may > >>> help some sites: > >>> > >>> lang pl score MIXED_ES 0.01 > >>> lang cz score MIXED_ES 0.01 > >>> lang sk score MIXED_ES 0.01 > >>> lang hr score MIXED_ES 0.01 > >>> lang el score MIXED_ES 0.01 > >>> > >>> Those should get into the default rules channel within a few days. > >> > >> All very well, except [...] > > Also, there are *lots* of other languages that legitimately use E-like > > characters that should be added to the list (e.g. there's a Cyrillic > > "е", so you can add ru, bg, uk, be, bs, sr, kk, ky, mn, tg and > > others, for a start; ). You'll be fighting a losing battle there... > > Actually not a battle I'm fighting... > > I have seen direct reports of this rule (which is substantially more > narrow than just 'has mixed e-like characters') matching ham in the > above listed languages. I know that on the order of 0.001% of ham in the > masscheck data submitted to SA Rule QA match the rule and that the bulk > of that is from a single small corpus (from a Polish source) in which > ~0.5% of ham matches. It appears that occasionally that match rate > results in a classification false positive, which is a real but small > and constraijned problem. > > I have never seen an actual ham message matching the rule, much less had > access to a mail stream including a steady stream of such messages. I > have only ever seen vague reports of classification FPs, all of which > cite the score as 3.999, which has not been accurate for most of the > lifetime of the rule. As such, I have no real weapons in this battle and > a foe who is invisible but noisy, to overstretch your analogy. > > Individual sites are always free to kill or redefine rules from the > default set or peg their scores to limit FPs. > > -- > Bill Cole > b...@scconsult.com or billc...@apache.org > (AKA @grumpybozo and many *@billmail.scconsult.com addresses) > Available For Hire: https://linkedin.com/in/billcole >