Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Marc Perkel Thu, 05 Mar 2009 07:38:29 -0800


decoder wrote:

Marc Perkel wrote:
Good work so far but sounds like you need to throw more data at it.Also even though you indicate "over 99% accuracy" can you break thatdown better? 99.9% is 10 times as accurate as 99%.
What do you mean by more data? Of course, some additional data mighthelp. One should consider that _most_ of the SA rules are designed toscore on spam. For an SVM, you can use more general data like "Mailhas property XYZ" although you don't know what this property means(ham or spam) or if it is even suitable to classify anything. This isof course an advantage.
With respect to the numbers:
I repeated the experiments today with slight modifications to providea more solid setup:
The input is again the dataset I used yesterday. In one run, Ipermutate the dataset, then split it (2/3 training vs. 1/3 testing,not stratified).Then the training set is used to train an SVM, and it is applied tothe 1/3 testing set and additionally to my false negatives set.
The SVM outputs an accuracy value, but I wrote a tool that calculatesprecision and recall by hand because these values are more interesting as
1 - Precision = False Positive Rate (which is an important factor in SA)
1 - Recall = False Negative Rate (or, consider recall as the detectionrate)
I ran this 5 times, the output is attached as text file, there youwill see the exact numbers :)
Taking the mean over the 5 runs:


False positive rate: 0.37908199952036 %
Detection Rate: 99.18104855859372 %
Detection Rate on False Negatives (my SA has 0% on this set):31.7821782178218 %
One should consider that my dataset might not be 100% accurate. It iscombined from my inbox and my spam folder. Of course my spam folder isunlikely to contain ham, but it is surely possible that I forgot todelete one or another false negative from my inbox. I'm lookingforward to get Justin's set :)
Also - when it identifies messages do the numbers on the spam scoresgo up and ham goes down? If so that makes it more solid and starvesthe middle. I'm encouraged that the initial results are good.
What do you mean by that question, I don't really understand it :)
My feeling is that if this works that it will work better if we havemore informational tokens. For example - is the from address afreemail address. Does the message contain a freemail address. Bythemselves these wouldn't score points. But spam coming from yahoo,hotmail, gmail, etc. is a different kind of spam than spam comingfrom spambots. Maybe country tokens from the received lines would beuseful. Maybe names of banks in the message would be useful. Forexample Bank of America + Nigeria = spam.
Yes, this is exactly what I meant above. These tokens are of limiteduse for SA currently, but an SVM might be able to use them :)
Cheers,


Chris

I suppose what I was thinking was that you still used the SA result butadded or subtracted from the SA result based on your SVM code, sort ofthe way bayes does. Or are you letting SVM make the final determination?

In my SA processing I'm used to getting numbers back and processingdifferent based on the grade of spam/ham. I was envisioning that thisnew process would increase the accuracy and starve the middle pushingthe result into bigger ham/spam numbers.

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to