[Skip Montanaro] > > Sure, but constructing a suitable ham/spam corpus > from scratch is a non-trivial task, as you no doubt > remember.
Ah - but we had a much subtler task then: trying to construct a classifier that was _useful_. Your current task is much clearer: > ... I am looking to insure that a Py3 port of SpamBayes > works the same as the Py2 code. For _that_ purpose, you can take any pile of email at all; split it into "ham" and "spam" at random, and "just" ensure you get the same results from the older and newer code. Your criterion for success isn't "closeness to human value judgment", but "same output". For that purpose, you could synthesize gibberish email from random header & sentence generators. Although it would be easier to use real email ;-) The point is that you don't have to worry at all about whether this or that is "really ham" or "really spam" or "really unsure" - it was making those value judgments that consumed lots of human time when building the old curated data sets.
_______________________________________________ spambayes-dev mailing list spambayes-dev@python.org https://mail.python.org/mailman/listinfo/spambayes-dev