The real problem with TCR is that it does not take into account the relative sizes of the corpora.  I'm going to define a new measure, called the proportional cost ratio (PCR).

PCR(f) = (#tp/#spam) / ((#fn/#spam) + f(#fp/#ham))

f could just be a linear transformation (f(Lambda,x) = Lambda*x), but we might find that it is more useful to use a nonlinear transformation.  I'd start with a linear one.

The PCR with a linear function is just a normalized TCR.  That would solve the training set imbalance problem.

Henry

Justin Mason wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Daniel Quinlan writes:
  
[EMAIL PROTECTED] (Justin Mason) writes:

    
eh, why drop the FP/FN from the summary line?  and it's missing a
newline ;)
      
Because those FP and FN numbers are the ones relative to the total
number of messages rather than the amount of spam or ham.
    

They should be made relative to spam/ham, and reinstated.  otherwise
I'm -1 on that change.

Judging effectiveness by TCR alone is *not* a good idea.  TCR is sensitive
to the relative sizes of the spam/ham corpus if I recall correctly, and
also does not give a good idea of overall effectiveness as a single
figure. for example, very high FP will get a high TCR if the FNs
are low enough, whereas in real-world use, high FP is always to be
avoided.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAye4bQTcbUG5Y7woRAop8AKCcXoczqQc1Z+XX9sn/C4QtNmVY1wCePd46
fGrcM1oZeyvQ28cFwlbXnWo=
=8yYC
-----END PGP SIGNATURE-----
  

Reply via email to