On 21 Mar 2019, at 10:52, John Wilcock wrote:
Le 21/03/2019 à 14:52, John Wilcock a écrit :
Le 20/03/2019 à 20:19, Bill Cole a écrit :
I've added these lines to the block that defines MIXED_ES which may
help some sites:
lang pl score MIXED_ES 0.01
lang cz score MIXED_ES 0.01
lang sk score MIXED_ES 0.01
lang hr score MIXED_ES 0.01
lang el score MIXED_ES 0.01
Those should get into the default rules channel within a few days.
All very well, except [...]
Also, there are *lots* of other languages that legitimately use E-like
characters that should be added to the list (e.g. there's a Cyrillic
"е", so you can add ru, bg, uk, be, bs, sr, kk, ky, mn, tg and
others, for a start; ). You'll be fighting a losing battle there...
Actually not a battle I'm fighting...
I have seen direct reports of this rule (which is substantially more
narrow than just 'has mixed e-like characters') matching ham in the
above listed languages. I know that on the order of 0.001% of ham in the
masscheck data submitted to SA Rule QA match the rule and that the bulk
of that is from a single small corpus (from a Polish source) in which
~0.5% of ham matches. It appears that occasionally that match rate
results in a classification false positive, which is a real but small
and constraijned problem.
I have never seen an actual ham message matching the rule, much less had
access to a mail stream including a steady stream of such messages. I
have only ever seen vague reports of classification FPs, all of which
cite the score as 3.999, which has not been accurate for most of the
lifetime of the rule. As such, I have no real weapons in this battle and
a foe who is invisible but noisy, to overstretch your analogy.
Individual sites are always free to kill or redefine rules from the
default set or peg their scores to limit FPs.
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole