On Wed, 20 Jan 2016, Marc Perkel wrote:

Maybe I should call it a new plan for spam?

Perhaps FUSSP? (Sorry... You're so rah rah about this I couldn't resist... :) )

So - how do I get a list of words and phrases never used in spam? I create a list of words and phrases that are used in spam and check to see if it's *not on the list*.

So it still needs to be trained, at least initially, with a manually-vetted corpus. If not, how do you propose to do the initial classification of messages for training?

Do you envision it being self-training past that point? What if it goes off the rails? How would you keep it from going off the rails?

If it's not self-training then you have the same issues with the reliability of the people feeding the training corpus.

So I'm not just tokenizing the subject. Also the first 25 words of the message

OK, good. I was thinking it would be *really* sensitive to "bayes poisoning". Ignoring all but the first part of the body helps.

I assume you're only considering the portion that would render as visible to the recipient. Of course, that brings in all the logic regarding "what is visible to the recipient?" and all the HTML obfuscation we're already seeing to get around Bayes and "only scan the first part of the message".

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Insofar as the police deter by their presence, they are very, very
  good. Criminals take great pains not to commit a crime in front of
  them.                                             -- Jeffrey Snyder
-----------------------------------------------------------------------
 3 days until John Moses Browning's 161st Birthday

Reply via email to