can't type much as i've broken my elbow (oh noes!) -- but we talked in
the past about using an LR engine for rescoring.  not sure if that got
anywhere though.

btw be aware also that there was a perceptron rescorer, but it
produced more fragile scores than the ga; see 3.2.0 rescoring ticket
for history

--j

On Tue, Nov 17, 2009 at 03:22, Warren Togami <wtog...@redhat.com> wrote:
> On 11/16/2009 07:26 PM, Adam Katz wrote:
>>
>> My hypothesis, which I've anecdotally proven on my own deployment, is
>> that the flaws are repeated as well.  Spammers that trigger spamtraps
>> on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
>> servers that also deal with legitimate traffic.  This means that
>> thanks to these similar indexing techniques, DNSBL overlap from
>> spammers' abuse of a non-spam-exclusive server can single-handedly
>> mark a ham as spam.
>>
>> My "solution" is to counter-intuitively *remove* points from message
>> that hit too many DNSBLs.  They still net quite a positive score, but
>> that score is effectively capped at something not quite high enough to
>> kill a ham with DNSBLs alone.
>>
>> A more elegant version of this, which Karsten and I theorize might
>> even happen automatically (as scored by the GA) if I were to check my
>> adjustor into SVN, would be to reduce most of the points on the DNSBLs
>> and add them back with a meta rule containing a union of the DNSBL
>> rules (without a "multiple" tflag).
>
> I think there is a lot of merit to this approach, and it might even be a
> great idea.  But I spoke with a machine learning expert and heard some
> interesting things on this topic.
>
> We held a small workshop yesterday in which she explained Logistic
> Regression and how it might be applied to automated rescoring of
> spamassassin's rules.  The most intriguing aspect of her explanation was the
> suggestion of using a logarithmic function in weight scoring.  I asked
> specifically about this issue of overlap (like BRBL_LASTEXT with every other
> list) and she suggested this particular method of rescoring wouldn't have an
> issue with overlap.
>
> I believe you mentioned logarithmic scoring in an earlier discussion?
>
> It appears that we have a few very smart people interested in implementing
> an alternative rescorer using Logistic Regression.  We plan on using an
> existing library for the bulk of this implementation.
>
> I think we should proceed with our current generated scores for 3.3.0. After
> that we can compare the effectiveness of different approaches including your
> proposal.
>
> Specifically on the issue of overlapping DNSBL's, there might be a few
> possibilities:
>
> * Overlapping DNSBL's really is no problem with any method of scoring.
> * Overlapping DNSBL's is only a slight problem with any method of scoring,
> but if a host is blacklisted with more than one major DNSBL they have
> serious issues they need to fix and we shouldn't try to workaround for their
> benefit.
> * Overlapping DNSBL's is a real problem, but logarithmic scoring avoids it
> as an issue.
>
> rulesrc/sandbox/jm/20_bug_5984.cf:# score RCVD_IN_BRBL_LASTEXT 2.0
>
> This apparently was set manually.  It appears that spamassassin-3.2.x was
> not scored when BRBL existed as a rule.  Meanwhile our new GA scores
> resulted in:
>
> score RCVD_IN_BRBL_LASTEXT 0 1.644 0 1.449 # n=0 n=2
>
> This is relatively modest.  This combined with one other DNSBL alone will
> not push it clearly above 5 points.  I might suggest manually adjusting down
> BRBL or PBL so it requires one additional tiny score to push it over the
> edge.  I'm personally comfortable enough to outright reject mail from a
> Spamhaus listed host.  Given this bias, it is sufficiently cautious in my
> book to accept PBL + BRBL as insufficient.
>
> Warren Togami
> wtog...@redhat.com
>
>



-- 
--j.

Reply via email to