Dallas L. Engelken wrote: >> -----Original Message----- >> From: Matt Kettler [mailto:[EMAIL PROTECTED] >> Sent: Thursday, February 16, 2006 22:50 >> To: Chris Santerre >> Cc: users@spamassassin.apache.org >> Subject: Re: Over-scoring of SURBL lists... >> >> Chris Santerre wrote: >> >>> Matt Kettler wrote: >>> >>>> My FPs fall into two categories: >>>> >>>> 1) URIs that would likely never appear outside of a specialty >>>> newsletter. I've had lots of hits on things like: >>>> -Authors of programmer's tools >>>> -producers of electronic parts >>>> -producers of embedded computer systems (Note: embedded, >>>> >> not normal >> >>>> computers.. >>>> companies like versalogic.com that make parts that only a kiosk >>>> manufacturer or extreme geek would use) >>>> >>> Agreed. And we have seen these be more JoeJobs. But some >>> >> are not. Some >> >>> simply hire mass emailers thinking they are legit, only to find out >>> they are not. Just because they are legit for you, doesn't >>> >> mean they >> >>> haven't spammed someone else. You ask, we remove. >>> >> Yes, the only problem is that I'm getting tired of having to >> track down sample emails for FPs so I can find which URI a >> URIBL FPed on. >> >> But really, how often or not a URIBL FP's isn't really the >> point. The point is they DO FP, and it's really quite common >> for FP's to be multi-listed. That multi-listing wields some >> hefty score biases, way beyond the power of any other rule in >> spamassassin other than BLACKLIST_* and GTUBE. >> >> I merely find it to be a big problem that URIBLs on the >> general whole are rather FP prone, and prone to "cascades" of >> FPs which unleashes havoc from the strong scores the >> perceptron gave them. >> >> I think the reason the perceptron gave them such high scores >> is that a lot of URIBL FP problems get fixed fairly quickly, >> within a matter of hours. Ditto for a lot of FN problems. >> >> By the time the mass-checks are run, the URI's in the corpus >> emails are likely well sorted by the reports given to the URIBLs. >> >> > > Sounds like someone's having a bad day ;) > > >
First, a pre-statement: I'm only presenting evidence of accuracy problems in relation to why the URIBLs collectively wield a great deal of power in SpamAssassin scoring. I'm not really complaining about uribl.com, I'm complaining about URIBLs as a whole. That's both uribl.com and surbl. Whenever I use the term URIBL in all caps, I mean all URI dns-based blacklists. If you prefer, I'll retract my uribl.com example, and point out that less than an hour later, I got a ws.surbl.org FP. And let me remind you. Let me remind you, 1) you control which uribl's you run 2) you control how they score 1) I'm talking about the default setup of SA 3.1.0 and the perceptron assigned default scores for the URIBLs it uses.. Not customization. Default, Stock ,SA 3.1.0 setup. Note that doesn't really involve uribl.com, but does involve surbl and sbl. 2) I do have serious concerns about the accuracy problems of both surbl.org and uribl.com. Particularly in light of #2. uribl.com presents a larger portion of this problem at my site, but surbl has the same basic problems. 3) I'm even more concerned about the monoculure of the URIBLs. uribl.com's black, surbl.org's ws, sc, jp, ab and ob are all more-or-less the same list. Paul argued against that statement, but in my mind his arguments are weak at best. There IS considerable overlap between these lists. Contrary Paul's statements, you only need to be reported once by a spamcop spamtrap or trusted feed to be on SC. JP monitors 18,000 domains, not just two people. AB accepts feeds directly from spamcop and does different analysis on them. Ultimately it is possible for a single copy of an email to cause a listing in uribl_black, SC, WS, JP, and OB all at the same time. It might be possible for that one email to list in AB via spamcop, but I'm not sure if they have a multi-report requirement or not. Sure it's unlikely, but there is enough overlap to have it be possible. If that one email is mis-classified you have a whopper of a FP problem to deal with. Combinining 1-3 you have a serious problem. Due to 2 FPs are relatively commonplace, and due to 3 any FPs tend to cascade quickly into multiple URIBLs. Due to 1, these rules wield considerable power (> +12) that even BAYES_00 can't put a dent in (-2.599) Ultimately my major problem isn't with the URIBLs themselves. My problem is with the structure of the rules in SA 3.1.0 and the outrageously high scores they have in SA 3.1.0. Really, I think Chris S had a good idea earlier when he suggested just rolling all of surbl into one rule. Ditto for uribl.com, but it's only got one list worth rolling up. (grey is interesting, but I don't think you'd want to aggregate grey and black into a single rule. The FP rate of grey would hurt black's score potential). Collectively, these two rules should have less than 5.0 as a total score. This is a stark contrast to a default SA 3.1.0, where the URIBL's from surbl.org collectively total 19.715 points by themselves, and 21.354 when you factor in sbl too.