[Tony Meyer,regarding classifying image-based spam]
>> FWIW, I'm both itching and scratching.
[...]
>> (I am documenting the failures, and will put that somewhere at some
>> point).

[Amedee Van Gasse]
> As another Open Source adage goes... do only one thing, and do that  
> thing
> perfectly.
> Please don't overload spambayes with ocr capabilities or other
> kitchensinks.

I think it's pretty clear that I have no intention of doing that.  If  
I did, then I've had plenty of opportunity to do so already, and  
haven't done so.

There is a considerable difference between tokenizing an image and  
"overloading" spambayes.  I have no intention to use any sort of OCR  
(partly because it doesn't really work well).  As Tim said, you can  
read discussion about that in the past (it was rejected as an idea  
then, too).  I think there's even a sourceforge ticket open regarding  
image-based spam.

> All spambayes should do (imho) is tokenize an email, and
> give a score to each token.

If that's all you want spambayes to do, then just get the  
classifier.py and tokenizer.py files, and leave the rest alone.   
However, for spambayes to be of use to most people, there needs to be  
more (some sort of persistent database, options handling, a  
convenient training interface, a non-intrusive classification system,  
etc).

IAC, I am not talking about anything other than this.  However, it  
seems clear to me that valuable tokens could be generated from  
images, rather than simply ignoring them as spambayes currently  
does.  If I do find something that appears to work well (across  
multiple corpora), then it will be checked in as an experimental  
option, so that users can try it out.  If it helps most people, then  
it will probably make it out of experimental status.  If it appears  
to help everyone, without hurting anyone (or fairly close), then it  
might get turned on by default.  That's a *long* way off, however.

The position of the spambayes development is pretty clear on this.   
In the words of our fearless leader:

"""
A subtler point is that you should never keep
a change that doesn't *prove* itself a winner:  neutral changes bloat  
your
code with proven irrelevancies that will come back to make your life  
harder
later, in part because they'll randomly interfere with future changes in
ways that make it harder to recognize a significant change when you  
stumble
into one.
"""

(from TESTING.txt in the distribution).

If you look through the spambayes-dev archives (or  
[email protected] archives before that list existed), you'll find  
plenty of changes that were considered and rejected.

> What I think *might* be an interesting approach, is chaining different
> software together. But only for those extremely rare cases when  
> spambayes
> doesn't get enough information from the headers. It's an interesting
> thought experiment - but ony that: a thought experiment.

Plenty of other people already do this.  If you are interested,  
there's plenty in the archives about this, too.  The most obvious  
example can be found in the FAQ about adding white/blacklisting, but  
there's also doing DNS blacklist checks, and so forth.

> After several months of using spambayes I don't feel any itch at  
> all...

But after several years, I do, so I'm scratching.  If you see  
something checked in that you think is some sort of bloat, then feel  
free to argue that on spambayes-dev.  Reverting a change isn't hard,  
and if someone is willing to take the time to test a change I make to  
find evidence against it to counter my evidence, I'm more than happy  
to discuss it and revert it if necessary.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.


_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to