PCR(f) = (#tp/#spam) / ((#fn/#spam) + f(#fp/#ham))
f could just be a linear transformation (f(Lambda,x) = Lambda*x), but we might find that it is more useful to use a nonlinear transformation. I'd start with a linear one.
The PCR with a linear function is just a normalized TCR. That would solve the training set imbalance problem.
Henry
Justin Mason wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1Daniel Quinlan writes:[EMAIL PROTECTED] (Justin Mason) writes:eh, why drop the FP/FN from the summary line? and it's missing a newline ;)Because those FP and FN numbers are the ones relative to the total number of messages rather than the amount of spam or ham.They should be made relative to spam/ham, and reinstated. otherwise I'm -1 on that change. Judging effectiveness by TCR alone is *not* a good idea. TCR is sensitive to the relative sizes of the spam/ham corpus if I recall correctly, and also does not give a good idea of overall effectiveness as a single figure. for example, very high FP will get a high TCR if the FNs are low enough, whereas in real-world use, high FP is always to be avoided. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAye4bQTcbUG5Y7woRAop8AKCcXoczqQc1Z+XX9sn/C4QtNmVY1wCePd46 fGrcM1oZeyvQ28cFwlbXnWo= =8yYC -----END PGP SIGNATURE-----
