Re: Corpus of Spam/Ham headers(Source IP) for research

Rob McEwen Wed, 29 Jun 2016 06:51:13 -0700

On 6/29/2016 1:00 AM, Shivram Krishnan wrote:

Thank you so much for your views. I agree that your customers would not
like it if you share information. But Oliver suggested , I need only the
source IP addresses of the Spam and Ham emails , which can even be
anonymized in the last octet.

Unfortunately, accuracy and credibility goes down since there then isn'tany easy way to audit or double-check the root cause of the classification.

For example, some people classify spam as "what our filter said wasspam" and ham as "what our filter said was ham". For most well-runsystems, that is going to be overall very accurate. But there can stillbe egregious mistakes. And assuming that the existing filter is 100%accurate leaves no room for improvement. It also has the unfortunateside effect of rubber stamping the most elusive spams, sent by theshrewdest of spammers, as ham.

If an anti-spam blacklist comes along that is very good at blockingmessages that are unsolicited and not desired by end users... but sentby the most shrewd spammer who evade lists like SpamHaus and SURBL (atleast for some time)... and where the collateral damage for listing suchdomains and sending IPs is non-existent... such a blacklist might STILLfare badly in such a rating system... which would then MISTAKENLY assumethat such a blacklist has many False Positives.

Stats collected from user complaints about False Negatives can also behelpful. However, for snowshoe spam, that is often a laggingindicator... sometimes days behind reality--where the spammer hasalready moved to new domains/IPs--but such could help such a ratingssystem to make wise adjustments to past ham/spam stats.

Hijacked IP and domains is another sticky issue. Over the past severalyears, this has become epidemic! If the volume of legit usage isrelatively low, and the IP or domain has been hijacked by a spammer...then at SOME point, an anti-spam blacklist should not be penalized forlisting such. In fact, Spamhaus does this frequently (lists hijackeddomains/IPs where the cost/benefit ratio for that listing is welljustified). Some other lists also blacklist hijacked domains/IPs... butare often not as good at making proper cost/benefit ratio decisions...where they list somewhat large senders who had a somewhat small andshort-lived spam outbreak. Finding a way to penalize or reward the liststhat block hijacked domains/IPs that Spamhaus misses, based on whetherthey do (or don't do) a good job of making overall good decisions aboutthe cost/benefit ration of a potential listing's collateral damage... isalso tricky.

My main point is... how to reward blacklists that are more accurate, butwithout penalizing them for not being a redundant copy of Zen. It isn'tas easy as it sounds in a ratings system. (even if real life usage ofsuch by a hoster or ISP can quickly lead to fewer complains fromcustomers about about FP and FNs)


--
Rob McEwen

Re: Corpus of Spam/Ham headers(Source IP) for research

Reply via email to