David Legg wrote:
Hi Josip,
Thanks for your comments.
On 24/10/12 15:42, Josip Almasi wrote:
I think I'll wait till it works with java 7. (workaround didn't work for me)
I didn't know that. I'm Ok with Java 6 for the moment as that is the default
with Ubuntu 12.04. Still not quite comfortable with this iced tea business
though... I prefer 100% Java beans :-)
Well, new JAXB broke more applications. Right now I can't remember exactly
which ones, but I had to go back to JDK 6.
So my first plan is to make the tokenizer more intelligent. It should
carefully extract far more meta-data from the email.
Wrote some mail parsing code, parses plain text and html, ignores other MIME
types. For others, I guess only headers should be taken into account.
Malformed MIMEs are real issue there. So I used heuristics to avoid them -
number of tokens and size of tokens.
Also, better ignore numbers, or use them as delimiters.
Of course, all message parts need to be processed. That's not cheap, and should
be limited, by max allowed time and/or number of tokens.
That's very interesting. Did you use the Mime4J library to do the heavy
lifting or did you parse all the message yourself?
I used javax.mail, started from a good mail parsing example included.
Parsed html with javax.swing.text.html.HTMLEditorKit.
It's for my mail archiver, not (yet) having anything to do with JAMES:
http://sf.net/projects/mar
So I did sort of opposite of what antispam is intended to do: I captured only
'good' keywords.
That's a good point about malformed MIMEs. Even with the relatively small
number of spams I've collected I noticed a number of deviant practices.
Tell me about it, one even managed to produce StackOverflowError in html
parser:>
Not so sure about ignoring numbers though. Certainly, need to capture IP
addresses, HTML and CSS colour settings and also domain names. I can see there
will be a lot of tweaking involved.
Ah CSS, I forgot about it completelly. True, has to be analyzed.
Uh, HTML... right, for antispam purposes, tags need to be saved too.
The catch with numbers is, I recieved some CSV files, containing database table
dumps, hundereds of thousands of lines, each containing unique codes.
And of course, many many smaller ones, with various server logs etc.
Best being left alone.
IP and domain names, I don't think so.
Suppose you use dot as delimiter. Then, each byte of IP address becomes a
token, and gets own weight. Much the same with domains.
Bayes should take care of rest.
IP addresses are relatively rare in mails, domains being much more important.
Now, should we tokenize www.spammer.com, then weight www, spammer, and com, or
should we store domain as it is?
I think - tokenize.
It's just a bit more processing, but possibly much less storage:
- one "www" and "com" instead of zillion stored
- two dots less
- "spammer" is just another keyword stored, weighted, possible to occur in
other mails containing no domain www.spammer.com
(this is all about message content of course, headers should not be tokenized)
Anyway, here's my delimiter list:
" ,./<>?`~!@#$%^&*()_+=-{}|[]\\;':\"\r\n\t1234567890"
Though numbers should probably be excluded:)
Watching parsing time and keyword number should eliminate problems with numbers.
I'm keen to capture phrases (ie. capturing two or more sequential words) as
I've heard they improve detection at the expense of a larger token database.
Any pointers?
I don't know... quite complicated.
Though some lexical comparison might make sense. Here I wrote some examples,
but that got 7.1 spam score and was returned back to me:)
Image info needs extracting too. So things like the width, height, bit depth,
type of encoding, Exif data and any tags should all be captured.
I quite often get large (several megabyte) emails from China containing
pictures of products for me and the
current James setup gives up with messages of that size. Or rather it creates
thousands of random tokens full of base64 segments!
That's interesting, I don't get these. At least not as single part messages. So
bayes probably picked up other keywords of text/html part, and headers.
So I think that's too much effort for small gain.
Anyway, what would you use to extract image info?
Regards...
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]