a note on multiword hashed tokens and collisions (fwd)

Justin Mason 20 Apr 2004 05:25:22 -0000

just found this when looking up details of "train on error" -- 
http://mail.python.org/pipermail/spambayes/2002-December/002440.html :


  > Are you hashing tokens?  spambayes does not, CRM114 does.  Bill
  > generates about 16 hash codes per input token, and with just a million
  > hash buckets, collision rates zoom quickly if you train on everything.

  Understood.  We don't hash tokens, and I agree that the sentence you
  quoted is misleading; I should have said something like "bogofilter's
  current tokenization and the R-F classification method."  I didn't try
  any of bogofilter's other calculation methods.

  > The experiments spambayes did with CRM114-like schemes were a disaster
  > due to this -- we continued to train on everything, with hashing but
  > without any bounds on bucket count, and the hash collisions quickly
  > caused outrageously bad classification mistakes. Removing the hashing
  > cured that, but then the database size goes through the roof (when
  > generating ~16 "exact strings" per input token, and training on
  > everything).

  Yup.
    
  > Training-on-error helps Bill because it slashes hash collisions,
  > simply via producing far fewer hash codes than does training on
  > everything.

  I didn't mean to imply otherwise, and your correction of my sloppy
  wording is appreciated.


So worth noting -- CRM-114 has had to adopt special training strategies
to avoid collisions when using hashed multiword tokens.

--j.

a note on multiword hashed tokens and collisions (fwd)

Reply via email to