just found this when looking up details of "train on error" --
http://mail.python.org/pipermail/spambayes/2002-December/002440.html :
> Are you hashing tokens? spambayes does not, CRM114 does. Bill
> generates about 16 hash codes per input token, and with just a million
> hash buckets, collision rates zoom quickly if you train on everything.
Understood. We don't hash tokens, and I agree that the sentence you
quoted is misleading; I should have said something like "bogofilter's
current tokenization and the R-F classification method." I didn't try
any of bogofilter's other calculation methods.
> The experiments spambayes did with CRM114-like schemes were a disaster
> due to this -- we continued to train on everything, with hashing but
> without any bounds on bucket count, and the hash collisions quickly
> caused outrageously bad classification mistakes. Removing the hashing
> cured that, but then the database size goes through the roof (when
> generating ~16 "exact strings" per input token, and training on
> everything).
Yup.
> Training-on-error helps Bill because it slashes hash collisions,
> simply via producing far fewer hash codes than does training on
> everything.
I didn't mean to imply otherwise, and your correction of my sloppy
wording is appreciated.
So worth noting -- CRM-114 has had to adopt special training strategies
to avoid collisions when using hashed multiword tokens.
--j.