Dallas L. Engelken wrote:
>> -----Original Message-----
>> From: Matt Kettler [mailto:[EMAIL PROTECTED] 
>> Sent: Thursday, February 16, 2006 22:50
>> To: Chris Santerre
>> Cc: users@spamassassin.apache.org
>> Subject: Re: Over-scoring of SURBL lists...
>>
>> Chris Santerre wrote:
>>     
>>> Matt Kettler wrote:
>>>       
>>>> My FPs fall into two categories:
>>>>
>>>> 1) URIs that would likely never appear outside of a specialty 
>>>> newsletter. I've had lots of hits on things like:
>>>> -Authors of programmer's tools
>>>> -producers of electronic parts
>>>> -producers of embedded computer systems (Note: embedded, 
>>>>         
>> not normal 
>>     
>>>> computers..
>>>> companies like versalogic.com that make parts that only a kiosk 
>>>> manufacturer or extreme geek would use)
>>>>         
>>> Agreed. And we have seen these be more JoeJobs. But some 
>>>       
>> are not. Some 
>>     
>>> simply hire mass emailers thinking they are legit, only to find out 
>>> they are not. Just because they are legit for you, doesn't 
>>>       
>> mean they 
>>     
>>> haven't spammed someone else. You ask, we remove.
>>>       
>> Yes, the only problem is that I'm getting tired of having to 
>> track down sample emails for FPs so I can find which URI a 
>> URIBL FPed on.
>>
>> But really, how often or not a URIBL FP's isn't really the 
>> point. The point is they DO FP, and it's really quite common 
>> for FP's to be multi-listed. That multi-listing wields some 
>> hefty score biases, way beyond the power of any other rule in 
>> spamassassin other than BLACKLIST_* and GTUBE.
>>
>> I merely find it to be a big problem that URIBLs on the 
>> general whole are rather FP prone, and prone to "cascades" of 
>> FPs which unleashes havoc from the strong scores the 
>> perceptron gave them.
>>
>> I think the reason the perceptron gave them such high scores 
>> is that a lot of URIBL FP problems get fixed fairly quickly, 
>> within a matter of hours. Ditto for a lot of FN problems.
>>
>> By the time the mass-checks are run, the URI's in the corpus 
>> emails are likely well sorted by the reports given to the URIBLs.
>>
>>     
>
> Sounds like someone's having a bad day ;)
>
>
>   

First, a pre-statement:

I'm only presenting evidence of accuracy problems in relation to why the
URIBLs collectively wield a great deal of power in SpamAssassin scoring.
I'm not really complaining about uribl.com, I'm complaining about URIBLs
as a whole. That's both uribl.com and surbl. Whenever I use the term
URIBL in all caps, I mean all URI dns-based blacklists. If you prefer,
I'll retract my uribl.com example, and point out that less than an hour
later, I got a ws.surbl.org FP.

And let me remind you.

Let me remind you, 

1) you control which uribl's you run
2) you control how they score


1)  I'm talking about the default setup of SA 3.1.0 and the perceptron
assigned default scores for the URIBLs it uses.. Not customization.
Default, Stock ,SA 3.1.0 setup. Note that doesn't really involve
uribl.com, but does involve surbl and sbl.

2) I do have serious concerns about the accuracy problems of both
surbl.org and uribl.com. Particularly in light of #2. uribl.com presents
a larger portion of this problem at my site, but surbl has the same
basic problems.

3) I'm even more concerned about the monoculure of the URIBLs.
uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all
more-or-less the same list. Paul argued against that statement, but in
my mind his arguments are weak at best. There IS considerable overlap
between these lists. Contrary Paul's statements, you only need to be
reported once by a spamcop spamtrap or trusted feed to be on SC. JP
monitors 18,000 domains, not just two people. AB accepts feeds directly
from spamcop and does different analysis on them. Ultimately it is
possible for a single copy of an email to cause a listing in
uribl_black, SC, WS, JP, and OB all at the same time. It might be
possible for that one email to list in AB via spamcop, but I'm not sure
if they have a multi-report requirement or not. Sure it's unlikely, but
there is enough overlap to have it be possible. If that one email is
mis-classified you have a whopper of a FP problem to deal with.


Combinining 1-3 you have a serious problem. Due to 2 FPs are relatively
commonplace, and due to 3 any FPs tend to cascade quickly into multiple
URIBLs. Due to 1, these rules wield considerable power (> +12) that even
BAYES_00 can't put a dent in (-2.599)

Ultimately my major problem isn't with the URIBLs themselves. My problem
is with the structure of the rules in SA 3.1.0 and the outrageously high
scores they have in SA 3.1.0.

Really, I think Chris S had a good idea earlier when he suggested just
rolling all of surbl into one rule. Ditto for uribl.com, but it's only
got one list worth rolling up. (grey is interesting, but I don't think
you'd want to aggregate grey and black into a single rule. The FP rate
of grey would hurt black's score potential). Collectively, these two
rules should have less than 5.0 as a total score.

This is a stark contrast to a default SA 3.1.0, where the URIBL's from
surbl.org collectively total 19.715 points by themselves, and 21.354
when you factor in sbl too.





Reply via email to