decoder wrote:
Marc Perkel wrote:

Good work so far but sounds like you need to throw more data at it. Also even though you indicate "over 99% accuracy" can you break that down better? 99.9% is 10 times as accurate as 99%.
What do you mean by more data? Of course, some additional data might help. One should consider that _most_ of the SA rules are designed to score on spam. For an SVM, you can use more general data like "Mail has property XYZ" although you don't know what this property means (ham or spam) or if it is even suitable to classify anything. This is of course an advantage.


With respect to the numbers:

I repeated the experiments today with slight modifications to provide a more solid setup:

The input is again the dataset I used yesterday. In one run, I permutate the dataset, then split it (2/3 training vs. 1/3 testing, not stratified). Then the training set is used to train an SVM, and it is applied to the 1/3 testing set and additionally to my false negatives set.

The SVM outputs an accuracy value, but I wrote a tool that calculates precision and recall by hand because these values are more interesting as

1 - Precision = False Positive Rate (which is an important factor in SA)
1 - Recall = False Negative Rate (or, consider recall as the detection rate)


I ran this 5 times, the output is attached as text file, there you will see the exact numbers :)

Taking the mean over the 5 runs:


False positive rate: 0.37908199952036 %
Detection Rate: 99.18104855859372 %

Detection Rate on False Negatives (my SA has 0% on this set): 31.7821782178218 %


One should consider that my dataset might not be 100% accurate. It is combined from my inbox and my spam folder. Of course my spam folder is unlikely to contain ham, but it is surely possible that I forgot to delete one or another false negative from my inbox. I'm looking forward to get Justin's set :)



Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good.
What do you mean by that question, I don't really understand it :)

My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam.
Yes, this is exactly what I meant above. These tokens are of limited use for SA currently, but an SVM might be able to use them :)


Cheers,


Chris

I suppose what I was thinking was that you still used the SA result but added or subtracted from the SA result based on your SVM code, sort of the way bayes does. Or are you letting SVM make the final determination?

In my SA processing I'm used to getting numbers back and processing different based on the grade of spam/ham. I was envisioning that this new process would increase the accuracy and starve the middle pushing the result into bigger ham/spam numbers.

Reply via email to