-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sidney Markowitz writes: >Justin Mason wrote: >> They are frequently very strong tokens, too, making them useful and part >> of the top N tokens (150 btw) included in the calculation > >I'm talking about ranking tokens by strength. It would not matter how >common they are. What percent of all tokens in the db get picked as >being in the top 15 (or whatever we use) of any of the messages that are >looked at? How would it affect accuracy by not having the weakest N% of >tokens in the db available during the calculations? > >> the most common tokens are often common in both ham and spam, making >> them useless for scanning purposes. > >Exactly. If they are useless why do we need them in the db that is used >when we are scanning? They are of course needed during training. OK, that's an interesting idea. hmm... I've never tested that. >> Well, we are trying to avoid "batch modes" ;) > >Sonic.net already has to do that to some degree to attempt to deal with >I/O requirements. Messages are tokenized, messages in the form of token >summaries are written to a spool, and then a separate process does the >learning. It should be optional, but for scalability it should be easy >to separate the processes of scanning for spam and doing the training. >Whether or not you call it a "batch mode". Yes, maybe for super-high-volume setups a batch mode is unavoidable. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFATmw/QTcbUG5Y7woRAp69AKDYZpo42fk8oR/tX1PVQIx2MNLYlQCgunNC XEkWtY3I6DMrYkiPIyRH2/A= =CJcm -----END PGP SIGNATURE-----
