Re: [Clamav-users] Complexity limit on (custom) signatures?
Kris Deugau wrote: From the problems I'm having with supposedly malformed signatures, it looks like there's an effective complexity limit; from the problems in *matching* a signature that's finally been found to be acceptable, it looks like there's a (lower) limit on what Clam can actually use in matching. Any suggestions on what I might be doing wrong? Just to try to bring this interesting discussion back to the problem I'm having... g - Image-based spam is slipping past the existing spam detection tool. Upgrading said tool is Not Possible to system load, and the fact that this system is due to be retired about eight months ago. - I already do virus scans with a fairly stock ClamAV install... the meat of the spams that are getting through is embedded in an image file... So I'll create signatures for these files. - Due to the variety of hiding techniques used, it's rare to find two identical image files, therefore MD5 sums are mostly useless. (On a *very* large scale, there might be enough duplication for effective use of MD5 sigs.) - Hex dumps of a collection of these image files shows *some* similarity that could be used with the extended signature format. - Scripts have been created to munge this data into what are supposedly valid signatures. - These supposedly-valid signatures are either: a) Rejected outright by Clam as malformed b) Accepted, but don't actually match on any of the files that were used to create them. As I said originally, it looks like there is a limit somewhere on how complex a signature Clam can accept, and a lower limit on what it can use effectively. Am I just seeing things, or am I triggering an odd corner-case bug in Clam's signature handling? (Or just tripping over a designed limit?) I would guess that it's rare for viruses to be quite as mutable as these image spams, so where a pair of 30-character hex strings separated by 30-50 unknown characters may easily identify a virus, along with 3 or 4 variants (and continues to do so for the in-the-wild life of the virus), that wouldn't identify very many imagespam images for very long. -kgd ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Mon, 30 Oct 2006 19:35:13 +0100 aCaB [EMAIL PROTECTED] wrote: So, this: 474946383761??(01|00)??0044 Should really read: 47494638376144 Or even 474946383761??0(0|1)??0044 ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Tue, 31 Oct 2006 07:48:46 +1300 Steve Holdoway [EMAIL PROTECTED] wrote: On Mon, 30 Oct 2006 19:35:13 +0100 aCaB [EMAIL PROTECTED] wrote: So, this: 474946383761??(01|00)??0044 Should really read: 47494638376144 Or even 474946383761??0(0|1)??0044 Sorry, scrap that. No coffee yet this morning (: ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Mon, 30 Oct 2006 19:35:13 +0100 aCaB [EMAIL PROTECTED] wrote: Kris Deugau wrote: ImgSpam.Misc.5:0:0:474946383761??(01|00)??00442c??(01|00)??0084(00|48|53)(00|15)(00|30|1c)f0f0f0(f0|e0|c0)f0(e0|b0|f0|d0|c0)f0(00|f0|40)(00|d0|e0|60|70)(f0|90|00|c0)(e0|90|00|b0|70)f0??(00|90|40|7d|10)(f0|ea)??(f0|00|e0|d0|46) Hi Kris, There are a number of problems with your sample sig. The most important rules you should obey are: A few corrections :-) 1) you always need at least 2 static bytes before and after a wildcard (though a serie of ?? is fine) with 0.9x it's enough to have a block of 2 static bytes somewhere in a part of signature (by 'part' I mean a sequence delimited by range wildcards (*, {})). 2) a static block must not start with 00 it can start with 00 :-) 3) The alt syntax is (aa|bb), not (aa|bb|cc..) (aa|bb|cc..) is just fine So, just looking at the begin of your sig: The above sig looks OK to me. -- oo. Tomasz Kojm [EMAIL PROTECTED] (\/)\. http://www.ClamAV.net/gpg/tkojm.gpg \..._ 0DCA5A08407D5288279DB43454822DC8985A444B //\ /\ Mon Oct 30 19:53:57 CET 2006 ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Steve Holdoway wrote: Or even 474946383761??0(0|1)??0044 Nope! Bytes only, no nibbles. ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Tomasz Kojm wrote: with 0.9x Indeed! :) ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
aCaB wrote: Kris Deugau wrote: ImgSpam.Misc.5:0:0:474946383761??(01|00)??00442c??(01|00)??0084(00|48|53)(00|15)(00|30|1c)f0f0f0(f0|e0|c0)f0(e0|b0|f0|d0|c0)f0(00|f0|40)(00|d0|e0|60|70)(f0|90|00|c0)(e0|90|00|b0|70)f0??(00|90|40|7d|10)(f0|ea)??(f0|00|e0|d0|46) Hi Kris, There are a number of problems with your sample sig. Advice appreciated... but I'm not sure you're at all correct. As a matter of fact, that's about a *third* of an active, live sig that Clam seems to be using quite happily right now. ;) The most important rules you should obey are: 1) you always need at least 2 static bytes before and after a wildcard (though a serie of ?? is fine) Ick. Clam doesn't seem to complain consistently about this, though. Just for clarification, you mean: Valid: ...0ef2bc34... Invalid: ...(00|01)f3(89|bc)... or ...??f4(00|01)b4??... ? 2) a static block must not start with 00 Ick again. Clam doesn't consistently complain about this, either. Again, for clarity: Valid: ...??0101ab... or ...(a3|45)ff003400... Invalid: ...??0001ab... or ...(a3|45)3400... 3) The alt syntax is (aa|bb), not (aa|bb|cc..) Ick cubed. g Clam happily *accepts* (aa|bb|cc..), and in fact I think it's working just fine. Except when it doesn't. :( In short: If it's invalid, why doesn't Clam complain? and: If it's valid, why doesn't it work? G Let's take a new example. This is one I've just pushed out to production now: (This is also one of the simpler ones I've generated!) test.test:0:*:2c??01??010003ff48badcfe30ca49abbd38ebcdbbff60288e64699e68aaae6cebbe702ccf746ddf78aeef7cefffc0a070482c1a8fc8a472c96c3a9fd0a8744aad5aaf(58|15|ed|e8)(70|2c|0a|7a)(b7|ba|16|05)(de|dc|8b)(6f|2f|2e|af)(97|d8|38|37)(20|4b|fc|2c)(1e|18|25|06)(9f|93|90|13)(cc|4f|cb|ca)(5a|e7|27|e6)(99|ad|34|53){180}ff{37} Clam complains about this... but once I trim the trailing {180}ff{37} (something like that usually comes up), Clam is happy (and this time, not only accepts the sig as valid, but tags the files I used to generate it - which, for this class of imagespam, are almost disturbingly *regular*). However, quite often I have to keep trimming bits off the end, with NO pattern I can see (your rules don't seem to apply, if memory serves from past attempts). Eventually I reach a point where a) the sig is accepted as valid by Clam, and b) Clam tags the source files using it. b) is quite often a much shorter sig than a). Just to thoroughly confuse things, here's a sig that Clam doesn't complain about, which *still* violates all three of your rules above (I think...): testsig:0:0:0aefbf??(00|01){12}(00|01|02|03)??ff I don't know if it would actually match on any files though. (Just working on a quick hack to test this now.) So, just looking at the begin of your sig: 474946383761 = all fine, static ?? = wildcard following a static block, fine (01|00) = wildcard not following a static block, bad ?? = wildcard not following a static block, bad 0044 = static block starting with 00, bad So, this: 474946383761??(01|00)??0044 Should really read: 47494638376144 The problem with doing that is that I end up with something like: 474946383761??01ae{185}(ae|01){200}(0e|f0) Or worse: 474946383761{400} (I've come pretty close to that - *big*, *long* strings of anything with maybe one or two solitary static bytes.) If more than two possiblities for any given byte (and that's pretty much normal for these images) have to be turned into ??, I generally end up with a VERY long string of ??, which compresses down to {nn}. Which doesn't make a very useful signature. :/ -kgd ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Tomasz Kojm wrote: A few corrections :-) Ah! The Voice of Authority! g aCaB [EMAIL PROTECTED] wrote: 1) you always need at least 2 static bytes before and after a wildcard (though a serie of ?? is fine) with 0.9x it's enough to have a block of 2 static bytes somewhere in a part of signature (by 'part' I mean a sequence delimited by range wildcards (*, {})). FWIW, the sig above is working fine, with 0.88.2 (haven't been inspired enough to get 0.88.5 backported to Debian woody). The above sig looks OK to me. And aside from the fact that it's not hitting new traffic any more, Clam has been happy with it too. I'm still curious about what seem to be inconsistencies in what's valid and what's not, though. -kgd ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Sat, Oct 28, 2006 at 04:28:47PM -0700, Dennis Peterson wrote: I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) You can even easily create separate scanning queue for OCR, so it doesn't interfere with normal traffic. You may have missed that I'm in the image industry - a great deal of what we do is imagery including imagery with text in it, and as we have to scan all images over a particular size, it would require more cpu than is worth it. Ok that's fair. But you probably meant: scan everything _under_ SpamAssassin scan size. That's only whole messages less than ~256kB to be scanned by default in most software. I guess if you get images from all over, you can't whitelist etc then. Cheers, Henrik ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Henrik Krohns wrote: On Sat, Oct 28, 2006 at 04:28:47PM -0700, Dennis Peterson wrote: I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) You can even easily create separate scanning queue for OCR, so it doesn't interfere with normal traffic. You may have missed that I'm in the image industry - a great deal of what we do is imagery including imagery with text in it, and as we have to scan all images over a particular size, it would require more cpu than is worth it. Ok that's fair. But you probably meant: scan everything _under_ SpamAssassin scan size. That's only whole messages less than ~256kB to be scanned by default in most software. I guess if you get images from all over, you can't whitelist etc then. Lemme run it past you one more time - images are money in my world. I can't make mistakes. The right image is worth millions of dollars. Blocking such an image is something that's going on my resume'. Nobody knows where the next big image is coming from, so the rule is caution, caution, caution. It does not apply to everyone, certainly. I envy others who can bitch slap image spam vendors with little regard. That would be cool. I can't do it. I know how but don't dare. It's probably why I get pissy :) dp ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Friday October 27, 2006 at 08:42:34 (PM) Dennis Peterson wrote: Not to change the direction on you, but you might want to take advantage of the work Steve Basford is doing at http://www.sanesecurity.com/clamav/ for phishing problems, and also look at http://www.msrbl.com/site/stats for image and spam solutions. Both sites are providing excellent results on systems I'm running. The patterns are downloadable and very up to date. I've not had a single complaint of false positives, and the number of patterns provided is quite large. Steve has also written a very useable how-to for creating these patterns. Steve has done a remarkable job with his 'sig' files. He is constantly updating them. I know because I use them. they are always catching 'phishing' threats' on my PC. He also has two automated installers for downloading and installing his signature files. I wrote the 'script' version. There is also a Perl version available on his site. -- Gerard There is nothing wrong with making love with the light on. Just make sure the car door is closed. George Burns ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Kris Deugau wrote: The stock and pill spams that I'm trying to tag, however, have images that have *very small* variations message-to-message, but over a larger sample there's really very little that can be seen as common across the whole set - or even a significant part of the set. Automating the process of finding all possible values for the byte at this position is the only way I can usefully get anywhere. I did a binary diff and md5 checksums on hundreds of the stock and pill images and never found any two to be the same. They use a random noise generator to sprinkle the images with enough debris to prevent analysis, so even splitting the files into 128 and 512 byte slices and checking each of the slices was not helpful. Even when you convert the image to black and white to remove the color element there's still sufficient randomness to prevent go-nogo certainty. I've explored OCR on both color and de-colorized images and there have been successes, but not enough to warrant turning it on in production. It is very cpu intensive. I attempted to see if there were any digital watermarks in these images and found nothing although the math for doing this pushes my limits. I work in the image industry so have to be more careful than most regarding these, so others may have better luck than I which is another way of saying acceptable risk is site dependent. I'd be very interested in any headway you make. FWIW, I checked my current logs and found the MSRBL sigs blocked over 6,000 images in a two week period. The Sanesecurity filters stopped an additional 4,000. There were a total of 16383 messages blocked using all ClamAV filters, and many more thousands found by various milters and RBL/SURBL scans. This is on one of the smaller servers I run. The bigger mail farms are magnitudes greater for all categories. I mention this only because the out of pocket cost for these successes was $0.00 USD and very little time invested. Which reminds me, I should send some donation money to all the great folks who made these success possible. dp ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Sat, Oct 28, 2006 at 09:20:55AM -0700, Dennis Peterson wrote: I've explored OCR on both color and de-colorized images and there have been successes, but not enough to warrant turning it on in production. It is very cpu intensive. I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) You can even easily create separate scanning queue for OCR, so it doesn't interfere with normal traffic. Cheers, Henrik ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Henrik Krohns wrote: I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) Well... yeah. g The basic problem is that all the other garbage (with the occasional inevitable exception) is getting caught by Clam (viruses and most phishes) or SpamAssassin (all but a few text-based spams. I've found *enough* similarities in the raw binary image data to usefully make signatures for a lot of what is otherwise getting through; at the moment this is just a stopgap until these machines can be retired. However, in the long run, OCR to feed the text to SpamAssassin's other rules is a better solution; it's much more flexible. -kgd ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote: Henrik Krohns wrote: I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) Well... yeah. g The basic problem is that all the other garbage (with the occasional inevitable exception) is getting caught by Clam (viruses and most phishes) or SpamAssassin (all but a few text-based spams. I've found *enough* similarities in the raw binary image data to usefully make signatures for a lot of what is otherwise getting through; at the moment this is just a stopgap until these machines can be retired. However, in the long run, OCR to feed the text to SpamAssassin's other rules is a better solution; it's much more flexible. Indeed. For those interested in the topic of OCR to feed SpamAssassin, there's an active project with its own mailing list that does just this. It turns out to be a non-trivial task because many of these image spam are animated gifs, so you need to find the right frame to pass to the OCR program. Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then subscribe to the Devel-Spam mailing list (there's a link on that page). -Bill ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Henrik Krohns wrote: On Sat, Oct 28, 2006 at 09:20:55AM -0700, Dennis Peterson wrote: I've explored OCR on both color and de-colorized images and there have been successes, but not enough to warrant turning it on in production. It is very cpu intensive. I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) You can even easily create separate scanning queue for OCR, so it doesn't interfere with normal traffic. You may have missed that I'm in the image industry - a great deal of what we do is imagery including imagery with text in it, and as we have to scan all images over a particular size, it would require more cpu than is worth it. And when you consider repeating it all at a disaster recovery site it's starting to be a lot of computer power with a high false positive probability. You cannot count on the image spam being gif as png images are showing up now as are jpg, and animated gifs are also out there. OCR isn't practical for me but may be for others for a while - at least until they start to use CAPTCHA technology to get around it. dp ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Bill Randle wrote: On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote: Henrik Krohns wrote: I don't get it.. unless you have some big honeypot, maybe 5% of traffic contain small images to be OCRd. If your server can't handle that, I guess it's running out of juice anyway. :) Well... yeah. g The basic problem is that all the other garbage (with the occasional inevitable exception) is getting caught by Clam (viruses and most phishes) or SpamAssassin (all but a few text-based spams. I've found *enough* similarities in the raw binary image data to usefully make signatures for a lot of what is otherwise getting through; at the moment this is just a stopgap until these machines can be retired. However, in the long run, OCR to feed the text to SpamAssassin's other rules is a better solution; it's much more flexible. Indeed. For those interested in the topic of OCR to feed SpamAssassin, there's an active project with its own mailing list that does just this. It turns out to be a non-trivial task because many of these image spam are animated gifs, so you need to find the right frame to pass to the OCR program. Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then subscribe to the Devel-Spam mailing list (there's a link on that page). You might want to consider the next level of image spam before you go too far down the OCR path: http://www.iss.net/threats/Animated%20GIF.html dp ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
On Sat, 2006-10-28 at 16:21 -0700, Dennis Peterson wrote: Bill Randle wrote: On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote: However, in the long run, OCR to feed the text to SpamAssassin's other rules is a better solution; it's much more flexible. Indeed. For those interested in the topic of OCR to feed SpamAssassin, there's an active project with its own mailing list that does just this. It turns out to be a non-trivial task because many of these image spam are animated gifs, so you need to find the right frame to pass to the OCR program. Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then subscribe to the Devel-Spam mailing list (there's a link on that page). You might want to consider the next level of image spam before you go too far down the OCR path: http://www.iss.net/threats/Animated%20GIF.html Actually, the FuzzyOCR plugin already handles animated gifs using various techniques to extract the hidden text. It also is able to decode png and jpeg files. -Bill ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Bill Randle wrote: On Sat, 2006-10-28 at 16:21 -0700, Dennis Peterson wrote: Actually, the FuzzyOCR plugin already handles animated gifs using various techniques to extract the hidden text. It also is able to decode png and jpeg files. Ah - so it does. I hadn't looked at v. 2.3. I'll have another look. Thanks, Bill. dp ___ http://lurker.clamav.net/list/clamav-users.html
[Clamav-users] Complexity limit on (custom) signatures?
I've been attempting to lighten the load for SpamAssassin a little by creating signatures for the stock and pill spams that are flooding in these days. More specifically, I'm creating signatures for the attached images in the spams. (Upgrading SA, to be able to use OCR plugins and so on, is not really possible, mostly due to system load.) However, I'm having some odd problems with signatures that, so far as I can tell, are *legitimate*, if perhaps a bit long. Here's what I'm doing to create signatures: I take a set of images, manually sorted for rough similarity, and run them through a script that calls sigtool --hex-dump, and picks out a segment of the data. (I started with just the first 400 characters of hex, and pushed it up to 600; with the current set I'm picking out ~600 characters starting with 2c from anywhere.) I further sort the resulting data by hand to find similar data, and then feed that through another script that splits each line up into octets and notes which octet has been seen in which position for the entire data set. It then constructs what should be a correct signature that will match each line of the input according to the rules for ClamAV signatures. (More than 5 different octets at a position get converted to ??, and finally long segments of ??... get converted to {nn}.) However, far too often, ClamAv rejects it as a malformed signature. Chopping {nn} bits off the end often fixes that issue, but not always; in some cases I've had to trim further (aa|bb|cc) blocks, along with trailing {nn} and/or ?? segments that may get exposed at the end. That still doesn't make a good signature for my purposes; I often have to trim *further* to get a signature that actually matches on the image files I started with. Manually spreading the data out shows it *should* match fine before I've done any trimming. From the problems I'm having with supposedly malformed signatures, it looks like there's an effective complexity limit; from the problems in *matching* a signature that's finally been found to be acceptable, it looks like there's a (lower) limit on what Clam can actually use in matching. Any suggestions on what I might be doing wrong? I can post the scripts and some example signatures if needed. -kgd ___ http://lurker.clamav.net/list/clamav-users.html
Re: [Clamav-users] Complexity limit on (custom) signatures?
Kris Deugau wrote: From the problems I'm having with supposedly malformed signatures, it looks like there's an effective complexity limit; from the problems in *matching* a signature that's finally been found to be acceptable, it looks like there's a (lower) limit on what Clam can actually use in matching. Any suggestions on what I might be doing wrong? Not to change the direction on you, but you might want to take advantage of the work Steve Basford is doing at http://www.sanesecurity.com/clamav/ for phishing problems, and also look at http://www.msrbl.com/site/stats for image and spam solutions. Both sites are providing excellent results on systems I'm running. The patterns are downloadable and very up to date. I've not had a single complaint of false positives, and the number of patterns provided is quite large. Steve has also written a very useable how-to for creating these patterns. dp ___ http://lurker.clamav.net/list/clamav-users.html