Jason Haar wrote:
>They aren't triggering (enough) network rule matches, contain a
>bayes-killer, and even FuzzyOCR can't manage the swirly image trick
>they pull. Has anyone come up with a way to fight these?

Jason, thanks for the cheerful Subject.  I needed that today. :)

I'm catching all of these, with decent scores (15+).

Here's a few easy things you might score on (up to about 2.5 each):

1. non-huge image which does _NOT_ have an HTML part
   (this will also help with the "lonely girl" spams; it's highly
    unusual for images to be attached to pure text emails; usually
    only Nerds send pure text, and our most typical image attachment
    is a GIF/PNG screenshot, or a somewhat large JPEG)
2. metas for images that have hit any reliable blocklist
   (I have found Barracuda very helpful - it definitely has a high
    FP rate, so score low if you don't have a decent false positives
    pipeline)
3. botnet test
4. metas for images sent from/thru "unusual" nations

These may not be as easy, however may be of :) interest to our
resident developers:

5. all of these have a real name in the From header, with most being
   a single word, which is very unusual
   (note also that _NONE_ have a real name in the To header, which I
    do score, but that has a high FP rate so I can not recommend it
    unless you have a solid FP pipeline)
6. size of the JPEG header (this may be easy to add to ImageInfo)

I just noticed #6 now, after dumping some image properties for wavy vs
non-wavy spam images, and was surprised by it.  It never occurred to me
to export file hdr size - by now, I :) should have KNOWN better, and
should have added export of ALL properties to my image properties
test last time this sort of thing happened.  I'll fix that next
version. :)

Here's the properties of my last few days of "wavy" images:
  1 MP#2(jpeg): Area=100804 Density=9.85 bytes=10854(hdr:623,dat:10231) 
(319x316)
  1 MP#2(jpeg): Area=103152 Density=9.32 bytes=11688(hdr:623,dat:11065) 
(336x307)
  1 MP#2(jpeg): Area=103206 Density=5.05 bytes=21045(hdr:623,dat:20422) 
(309x334)
  1 MP#2(jpeg): Area=104304 Density=5.33 bytes=20176(hdr:623,dat:19553) 
(318x328)
  1 MP#2(jpeg): Area=107584 Density=5.58 bytes=19896(hdr:623,dat:19273) 
(328x328)
  1 MP#2(jpeg): Area=108072 Density=9.51 bytes=11982(hdr:623,dat:11359) 
(342x316)
  1 MP#2(jpeg): Area=109472 Density=5.24 bytes=21501(hdr:623,dat:20878) 
(352x311)
  1 MP#2(jpeg): Area= 81104 Density=4.40 bytes=19067(hdr:623,dat:18444) 
(296x274)
  1 MP#2(jpeg): Area= 87809 Density=5.69 bytes=16064(hdr:623,dat:15441) 
(317x277)
  1 MP#2(jpeg): Area= 95142 Density=5.41 bytes=18223(hdr:623,dat:17600) 
(303x314)
  1 MP#2(jpeg): Area= 97148 Density=4.96 bytes=20208(hdr:623,dat:19585) 
(326x298)
The interesting column is "hdr:623".
If you're using ImageInfo, the other numbers are useful for limiting
your metas to the total size range typical of these.
The first column is the number of occurrences.

Here's the properties of all NON-wavy spam images from the same period:
  3 MP#2(jpeg): Area=115062 Density= 4.36 bytes=27110(hdr:735,dat:26375) 
(254x453)
  1 MP#2(jpeg): Area=120300 Density= 6.40 bytes=19185(hdr:387,dat:18798) 
(300x401)
  2 MP#2(jpeg): Area=166410 Density=11.62 bytes=14700(hdr:383,dat:14317) 
(430x387)
  1 MP#2(jpeg): Area=166704 Density= 8.55 bytes=19891(hdr:398,dat:19493) 
(453x368)
  1 MP#2(jpeg): Area=197735 Density=13.10 bytes=15476(hdr:380,dat:15096) 
(355x557)
  1 MP#2(jpeg): Area=240800 Density=14.59 bytes=16901(hdr:392,dat:16509) 
(700x344)
  1 MP#3(jpeg): Area=197735 Density=13.10 bytes=15476(hdr:380,dat:15096) 
(355x557)
 17 MP#3(jpeg): Area=239500 Density= 5.53 bytes=43685(hdr:406,dat:43279) 
(479x500)

I dumped the last month's worth of ham image properties from my most
diverse domain, and did find a handful which had that same hdr size
("623"), however they all had vastly different areas and/or occurred
with multiple images.

I'll check a few more domains and months' worth, before using that
for real.  I expect to score this in the 2 to 3 range.


Mike Cardwell wrote: 
>Presently it renders them as plain text. I'm fully aware of the
>potential problems with it. Ideally I'd like to be able to render
>those parts as HTML, but I need to be 100% sure that I've stripped
>out anything dangerous (including embedded remote content by
>default) first. It's on the "ToDo List" page.

Nice job Mike! :)

I wrestled with that same issue when I added direct viewing of HTML
content to my offline analysis/FP-pipeline/MassChecks tool.

Originally, I was using an ActiveX wrapper around IE, which (of
course) made me nervous.  I added some VERY simple, crude tag
stripping (script, iframe, style), but was never happy with it.
I ended up switching to an open source HTML rendering component
which :) lacked support for all the scary stuff.

Whatever you decide to do, please do post more about it, and q'pla!

>I'm also aware of the issues surrounding people potentially
>uploading images and then linking to them from spam websites or
>spam. That's why I've put http referer restrictions in place.

Perhaps redirecting to an image saying something like
"this is spam"? :)

What about requiring registration?  Yes, it's not enough to
stop the most determined, but will whittle it down to the least
stupid.
        - "Chip"


Reply via email to