can't type much as i've broken my elbow (oh noes!) -- but we talked in the past about using an LR engine for rescoring. not sure if that got anywhere though.
btw be aware also that there was a perceptron rescorer, but it produced more fragile scores than the ga; see 3.2.0 rescoring ticket for history --j On Tue, Nov 17, 2009 at 03:22, Warren Togami <wtog...@redhat.com> wrote: > On 11/16/2009 07:26 PM, Adam Katz wrote: >> >> My hypothesis, which I've anecdotally proven on my own deployment, is >> that the flaws are repeated as well. Spammers that trigger spamtraps >> on multiple DNSBLs (and URIBLs) may be sending from (or linking to) >> servers that also deal with legitimate traffic. This means that >> thanks to these similar indexing techniques, DNSBL overlap from >> spammers' abuse of a non-spam-exclusive server can single-handedly >> mark a ham as spam. >> >> My "solution" is to counter-intuitively *remove* points from message >> that hit too many DNSBLs. They still net quite a positive score, but >> that score is effectively capped at something not quite high enough to >> kill a ham with DNSBLs alone. >> >> A more elegant version of this, which Karsten and I theorize might >> even happen automatically (as scored by the GA) if I were to check my >> adjustor into SVN, would be to reduce most of the points on the DNSBLs >> and add them back with a meta rule containing a union of the DNSBL >> rules (without a "multiple" tflag). > > I think there is a lot of merit to this approach, and it might even be a > great idea. But I spoke with a machine learning expert and heard some > interesting things on this topic. > > We held a small workshop yesterday in which she explained Logistic > Regression and how it might be applied to automated rescoring of > spamassassin's rules. The most intriguing aspect of her explanation was the > suggestion of using a logarithmic function in weight scoring. I asked > specifically about this issue of overlap (like BRBL_LASTEXT with every other > list) and she suggested this particular method of rescoring wouldn't have an > issue with overlap. > > I believe you mentioned logarithmic scoring in an earlier discussion? > > It appears that we have a few very smart people interested in implementing > an alternative rescorer using Logistic Regression. We plan on using an > existing library for the bulk of this implementation. > > I think we should proceed with our current generated scores for 3.3.0. After > that we can compare the effectiveness of different approaches including your > proposal. > > Specifically on the issue of overlapping DNSBL's, there might be a few > possibilities: > > * Overlapping DNSBL's really is no problem with any method of scoring. > * Overlapping DNSBL's is only a slight problem with any method of scoring, > but if a host is blacklisted with more than one major DNSBL they have > serious issues they need to fix and we shouldn't try to workaround for their > benefit. > * Overlapping DNSBL's is a real problem, but logarithmic scoring avoids it > as an issue. > > rulesrc/sandbox/jm/20_bug_5984.cf:# score RCVD_IN_BRBL_LASTEXT 2.0 > > This apparently was set manually. It appears that spamassassin-3.2.x was > not scored when BRBL existed as a rule. Meanwhile our new GA scores > resulted in: > > score RCVD_IN_BRBL_LASTEXT 0 1.644 0 1.449 # n=0 n=2 > > This is relatively modest. This combined with one other DNSBL alone will > not push it clearly above 5 points. I might suggest manually adjusting down > BRBL or PBL so it requires one additional tiny score to push it over the > edge. I'm personally comfortable enough to outright reject mail from a > Spamhaus listed host. Given this bias, it is sufficiently cautious in my > book to accept PBL + BRBL as insufficient. > > Warren Togami > wtog...@redhat.com > > -- --j.