[Tony Meyer,regarding classifying image-based spam] >> FWIW, I'm both itching and scratching. [...] >> (I am documenting the failures, and will put that somewhere at some >> point).
[Amedee Van Gasse] > As another Open Source adage goes... do only one thing, and do that > thing > perfectly. > Please don't overload spambayes with ocr capabilities or other > kitchensinks. I think it's pretty clear that I have no intention of doing that. If I did, then I've had plenty of opportunity to do so already, and haven't done so. There is a considerable difference between tokenizing an image and "overloading" spambayes. I have no intention to use any sort of OCR (partly because it doesn't really work well). As Tim said, you can read discussion about that in the past (it was rejected as an idea then, too). I think there's even a sourceforge ticket open regarding image-based spam. > All spambayes should do (imho) is tokenize an email, and > give a score to each token. If that's all you want spambayes to do, then just get the classifier.py and tokenizer.py files, and leave the rest alone. However, for spambayes to be of use to most people, there needs to be more (some sort of persistent database, options handling, a convenient training interface, a non-intrusive classification system, etc). IAC, I am not talking about anything other than this. However, it seems clear to me that valuable tokens could be generated from images, rather than simply ignoring them as spambayes currently does. If I do find something that appears to work well (across multiple corpora), then it will be checked in as an experimental option, so that users can try it out. If it helps most people, then it will probably make it out of experimental status. If it appears to help everyone, without hurting anyone (or fairly close), then it might get turned on by default. That's a *long* way off, however. The position of the spambayes development is pretty clear on this. In the words of our fearless leader: """ A subtler point is that you should never keep a change that doesn't *prove* itself a winner: neutral changes bloat your code with proven irrelevancies that will come back to make your life harder later, in part because they'll randomly interfere with future changes in ways that make it harder to recognize a significant change when you stumble into one. """ (from TESTING.txt in the distribution). If you look through the spambayes-dev archives (or [email protected] archives before that list existed), you'll find plenty of changes that were considered and rejected. > What I think *might* be an interesting approach, is chaining different > software together. But only for those extremely rare cases when > spambayes > doesn't get enough information from the headers. It's an interesting > thought experiment - but ony that: a thought experiment. Plenty of other people already do this. If you are interested, there's plenty in the archives about this, too. The most obvious example can be found in the FAQ about adding white/blacklisting, but there's also doing DNS blacklist checks, and so forth. > After several months of using spambayes I don't feel any itch at > all... But after several years, I do, so I'm scratching. If you see something checked in that you think is some sort of bloat, then feel free to argue that on spambayes-dev. Reverting a change isn't hard, and if someone is willing to take the time to test a change I make to find evidence against it to counter my evidence, I'm more than happy to discuss it and revert it if necessary. =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
