[EMAIL PROTECTED] wrote on Thursday, February 01, 2007 11:27 AM -0600: > As to the "creating more synthetic tokens", I'm open to suggestions. > Ignoring its OCR features, I think SpamBayes currently identifies > that an image is present, its mime type (distinguishing gif spams > from Grandma's jpeg photos for example) the log of its size. Maybe > it could generate clues related to the image's dimensions, the total > number of images in the email or number of distinct colors. Do you > have other suggestions?
Exactly which clues are significant is the $64,000 question, just as it is with word frequencies. The approach that SpamBayes took with that problem may be applicable here. Rather than try to imagine which clues will be definitive, I was thinking out loud if we might provide a large number of seemingly unrelated clues and letting the Bayesian classifier look for correlations. We can't guess in advance what those clues should be, so the idea is to provide as many different ones as possible and hope that Spambayes finds some useful. Maybe things like animation rate, contrast ratio, color bias, ... any actual piece of information that varies from one image to the next. There are probably a lot of metrics available to people who are expert in image processing. Then there are the email specific ones like content transfer encoding of each MIME part, total characters in each MIME part, character set, etc. -- Seth Goodman _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
