And the question is, when does the posining begin? Has anyone some reliable information about the approximate ham:spam ration at which poising would take place?
That is a function of both the ratio AND the spam itself.
Really, I think for "pure" spam and ham, you could have a ratio of 10,000:1 and be fine.
The problem isn't so much self poisoning, as it is weakening yourself to intentional poisoning on the part of the spammer. If you have a heavily off-balance training ratio and a lot of spam containing intentional bayes poison, you can run into FP problems on the ham side because the poison tokens are going to start drowning everything out. Conversely if your ratio is heavily off-balance towards the ham side, spam containing poison will be more likely to evade the bayes filter.
Effectively this is a function of the tokens, not the emails, so it's a function of about 100,000 variables, thus it'd be hard to boil it down to anything as simple as a "dangerous ratio".
I suppose you could do a measurement for a given pile of spam and ham, but since spam constantly changes it's behaviors the "danger" level is going change constantly as well.
My ballpark guess, based on my experience is that a bayes DB with decent volume of training (at least 100 emails a day) would likely start to have noticeable bayes misclassification problems somewhere near spam:ham ratios of 100:1 or 1:50.
