Hi, I'm doing a research project on bayesian spam filtering and I had a few questions regarding spambayes. I'm trying to write a script that creates a db in which all the words that I give it as input are put into the db with nham=0 and nspam=0 set for each of the word's wordinfos. Currently, my plan to do this is to take the set of words and put them in an mbox with the "to" and "subject" headers set to some arbitrary value and the message set to the words I gave it as input. I then pass this mbox to sbmboxtrain as the spam/ham file, creating the db. Then I iterate through each of the words and set each of the word's nham and nspam to 0, remembering to get rid of the arbitrary to and subject header tokens. Would this work? Is there an easier way to do this? I'm pretty sure that using "h", the output of hammie.open() and could probably make this much easier but tracing through the code is a bit hard. Is there an easy way to create a blank db and add new wordinfos into them? Further, I'm not sure how header files are tokenized. From the output I usually see, it seems that they're tokenized as header:headername:headercontent. If in the body of the message has the same header:headername:headercontent, would this be seen to spambayes as the same as a header with the same header name and header content? Thanks for your time, Anthony
_______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev