On Mon, Dec 06, 2004 at 01:28:23AM -0000, Gray, Richard wrote: > > So, what happens when you take these two overlapping databases and > > combine them is that certain tokens (those that have overlap) are then > > double counted. This makes the database, at least according to the > > bayes model SA is using, statistically invalid. > > Using this reasoning, the tokens that overlap are going to be > identified as being related to the same message based on the same > hashes. Therfore it should be possible to detect the tokens that are > being double counted, and to dismiss them when they do. > > If you can do this then surely the database remains statistically > correct and can be safely merged? >
It is impossible to determine, after the fact, which tokens go with which message. That information is not tracked, so unless you are talking about something else, then no it is not possible. Michael
pgpJH5OIDFMNc.pgp
Description: PGP signature