Re: Bayesian Analysis for v3

Josip Almasi Thu, 25 Oct 2012 17:54:05 -0700

David Legg wrote:

Hi Josip,


Thanks for your comments.

On 24/10/12 15:42, Josip Almasi wrote:


I think I'll wait till it works with java 7. (workaround didn't work for me)


I didn't know that.  I'm Ok with Java 6 for the moment as that is the default 
with Ubuntu 12.04.  Still not quite comfortable with this iced tea business 
though... I prefer 100% Java beans :-)


Well, new JAXB broke more applications. Right now I can't remember exactly 
which ones, but I had to go back to JDK 6.

So my first plan is to make the tokenizer more intelligent.  It should 
carefully extract far more meta-data from the email.


Wrote some mail parsing code, parses plain text and html, ignores other MIME 
types. For others, I guess only headers should be taken into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid them - 
number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should 
be limited, by max allowed time and/or number of tokens.


That's very interesting.  Did you use the Mime4J library to do the heavy 
lifting or did you parse all the message yourself?


I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.

It's for my mail archiver, not (yet) having anything to do with JAMES:
http://sf.net/projects/mar

So I did sort of opposite of what antispam is intended to do: I captured only 
'good' keywords.

That's a good point about malformed MIMEs.  Even with the relatively small 
number of spams I've collected I noticed a number of deviant practices.


Tell me about it, one even managed to produce StackOverflowError in html 
parser:>

Not so sure about ignoring numbers though.  Certainly, need to capture IP 
addresses, HTML and CSS colour settings and also domain names.  I can see there 
will be a lot of tweaking involved.


Ah CSS, I forgot about it completelly. True, has to be analyzed.
Uh, HTML... right, for antispam purposes, tags need to be saved too.

The catch with numbers is, I recieved some CSV files, containing database table 
dumps, hundereds of thousands of lines, each containing unique codes.
And of course, many many smaller ones, with various server logs etc.
Best being left alone.

IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP address becomes a 
token, and gets own weight. Much the same with domains.
Bayes should take care of rest.
IP addresses are relatively rare in mails, domains being much more important.
Now, should we tokenize www.spammer.com, then weight www, spammer, and com, or 
should we store domain as it is?
I think - tokenize.
It's just a bit more processing, but possibly much less storage:
- one "www" and "com" instead of zillion stored
- two dots less
- "spammer" is just another keyword stored, weighted, possible to occur in 
other mails containing no domain www.spammer.com
(this is all about message content of course, headers should not be tokenized)

Anyway, here's my delimiter list:
" ,./<>?`~!@#$%^&*()_+=-{}|[]\\;':\"\r\n\t1234567890"
Though numbers should probably be excluded:)
Watching parsing time and keyword number should eliminate problems with numbers.

I'm keen to capture phrases (ie. capturing two or more sequential words) as 
I've heard they improve detection at the expense of a larger token database.


Any pointers?

I don't know... quite complicated.

Though some lexical comparison might make sense. Here I wrote some examples, 
but that got 7.1 spam score and was returned back to me:)

Image info needs extracting too.  So things like the width, height, bit depth, 
type of encoding, Exif data and any tags should all be captured.
I quite often get large (several megabyte) emails from China containing 
pictures of products for me and the
current James setup gives up with messages of that size.  Or rather it creates 
thousands of random tokens full of base64 segments!


That's interesting, I don't get these. At least not as single part messages. So 
bayes probably picked up other keywords of text/html part, and headers.
So I think that's too much effort for small gain.

Anyway, what would you use to extract image info?

Regards...


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Bayesian Analysis for v3

Reply via email to