-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Bear in mind, the TCR figure that's output to the user in
"fp-fn-statistics" output is mostly useful to compare against published
algorithms, since it's the de-facto std of effectiveness in the academic
lit on spam-filtering.
But we shouldn't use it ourselves internally as an effectiveness metric,
because I don't think it's trustworthy (see below).
To remind us what they represent in Ion's papers:
lambda=1: filing into a "spam" folder
lambda=9: bouncing back to sender saying "your mail was spam"
lambda=100: silent disposal
We should really be a lambda of 1, given that; but since SpamAssassin is
also used in other systems (e.g. with a system-wide quarantine,
unavailable to the end user), and because it was getting crazily-good
efficiency figures (like TCR > 100) at l=1, I picked a compromise l=5.
But if you're keen to change it, I'd say a TCR lambda of 9 would be OK.
For end-user display only though.
Regarding what's used as a balancing factor in the perceptron, use
whatever value works well, but don't consider it a TCR lambda in terms of
the figures output to the end-user. Just keep it inside the perceptron
code. For *that* use, 100 or even higher would be good IMO, because we
really want to avoid FPs -- in other words, *our* perception of the FP
cost is higher than Ion's assumptions.
to reiterate prev mail: I *don't* think TCR is a good single-figure
representation of spamfilter efficiency. I used to, but since then,
I've occasionally run into results where the FP/FN figures are lousy,
but the TCR is good; generally when the corpora are out of balance
and the FP figures are high, but the FNs are "good enough" to outweigh
the crappy FP rate.
IMO a better metric would be to pick a desired FP rate, and then use
FN as a single-figure metric given that FP rate. Or vice versa.
Basically lock down a desired FP or FN rate and allow the perceptron
to find its "best" rate for the other figure.
- --j.
Daniel Quinlan writes:
> I think a TCR lambda of 5 is too low for us. This means that we
> consider 5 FNs to have about the same "cost" as 1 FP, right (reference:
> http://www.ics.forth.gr/~potamias/mlnia/paper_2.pdf)? I think we have
> managed okay until now with using such a small value because the score
> optimizer hasn't really changed in terms of balancing FPs vs. FNs until
> now.
>
> I think the value should be somewhere between 10 and 500. I'm using 50
> for the moment.
>
> The balance is all wrong in the perceptron (too many FPs per FN), but I
> believe I found a reasonably good way to fix it (having the perceptron
> optimize around a lower threshold than 5.0). Using a lambda of 5.0, I
> can't really prove it, but when I eyeballed these FP/FN numbers, they
> seemed much better to me and *are* better with a TCR of 50 (which I
> think is closer away).
>
> Another data point, Craig Hughes used to talk about having a FP-to-FN
> ratio of 100 as a goal. I think a lambda of 100 is closer to what we
> want than 5. I realize the Androutsopoulos papers seem to imply a lower
> number is correct (although I could make a case that they actually don't
> because foldering is actually worse than sending TMDA-style bounces
> **once your accuracy reaches the level we're now at**), but I think we
> need to go with our gut here until someone whips out some real economics
> research. :-)
>
> Daniel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFAyfBmQTcbUG5Y7woRAgyBAJ94PGg3y0NI3jux+4i1Wi59oCK9cQCgj6wP
sEOlwAWWPtYiN0E5quz0uWw=
=5zW1
-----END PGP SIGNATURE-----