On Sat, 21 Jan 2017 19:08:39 -0000, Jari Fredriksson <ja...@iki.fi> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Hardin kirjoitti 20.1.2017 22:38:

Collecting spam after RBL filtering is much less helpful to masscheck.
Ideally your spam corpus is from a totally unfiltered feed.

However, even if it is filtered and small, it helps, *especially* if
the ham is not in English - masscheck is perennially starved for
non-English ham and rule scoring is thus baised against non-English
languages to a degree.

This is NOT what I have learned from SA lists. I used to do this, but
learned in SA discussions that it is *harmful* to pass such spam to
masscheck. That it harms the SA users doing proper pre SA filtering.

We do *need* an official policy! What are we going to do with mixed
messages like this??

It was written down once. I saw the unfiltered thing again when I looked earlier today, but I can't spot it just now. I believe I was also told by someone who knows this stuff that it wasn't a requirement, more an ideal.

However looking for that comment again just now I registered another discrepancy on the wiki:

https://wiki.apache.org/spamassassin/CorpusCleaning - no spam older than 2 months

https://wiki.apache.org/spamassassin/HandClassifiedCorpora - no spam older than 6 months

I don't think either are actually strict rules. It will help lower the barrier to entry if we can make this stuff more uniform. It could also be argued that having two such similar pages is somewhat redundant actually.

Reply via email to