Re: [sniffer] mini-obfuscation
Pete, Doesnt Sniffer have a certain level of support for regex's? I know we have had good luck with regex's like this which catch obfuscation techniques with viagra with Declude. We found it easier to use regex's than to list all of the different variations. (?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a4 [EMAIL PROTECTED],3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,[EMAIL PROTECTED], 3}x?[_\W]{0,3}(?:\b|\s) Darrell Check out http://www.invariantsystems.com for utilities for Declude And Imail. IMail/Declude Overflow Queue Monitoring, SURBL/URI integration, MRTG Integration, and Log Parsers. Pete McNeil writes: On Tuesday, March 22, 2005, 8:31:07 PM, Andrew wrote: snip/ CA How many times have we all been frustrated that a piece of spam ending CA up in *OUR* mailbox that was s close in content to spam we whacked CA yesterday? CA I thought the top n obfuscations might be interesting to look at, and CA perhaps a shortcut (temporary, albeit) for spam catching. I thought we CA might see whether, for example, broken URLs, fake comments, or high-bit CA ASCII character substitutions were the obfuscation technique du jour. Here you hit it IMHO. The reality appears to be, from my experience, that small domains of obfuscation patterns rise and fall like swells on the ocean. That is, stability tends to arise in one domain of message characteristics and then fall to rise in another domain. Sometimes the domain is well understood and sometimes it is entirely new and forces us to think differently about what a feature really is. By domain I mean things like message structure, word obfuscation techniques, phrase based swapping, html exploitation, etc... The du jour part of your statement is a key element to the problem. Defining and re-defining the conceptual framework that describes feature domains in the spam is the other key element. Put more simply - knowing what to look for is a basic element, but it gets you nowhere on it's own. Knowing (recognizing) when to look for the what is the key that makes the problem workable. CA I while back curiousity got the better of me (it was raining, and CA I had a few days off) and I did a few grep sweeps on a warm spam CA corpus. CA I was disappointed in my success rate for: CA v.?i.?a.?g.?r.?a.? CA and similar queries with deliberately substitutions (e.g. using a 1 CA for i). I started writing a grep-generating-permutation engine and CA decided my time was better spent on scritching my cat under his chin. That is a nifty direction that I wish I had more time for. Perhaps I will some day soon when Sniffer get's slashdotted and sales go through the roof! --- meantime, back on this planet, I suggested a very similar thing to Paul Graham at the first spam conference at MIT. As I recall he said it was ambitious - a description that I have learned has a special meaning in scientific circles. Something having to do with avian swine and snowballs that have successful careers as tour guides in hell. One of these days I think I might do it anyway, just to prove the point, but in the mean time I too prefer to spend more time with my cat. ;-) Don't get me wrong - I strongly believe it can be done this way, but it requires much more than good technology. It runs right into one of the biggest problems with AI and, perhaps more importantly, people's expectations of AI. No matter how good the pattern learning system might be it will always lack the human experience. Computers don't date or gain weight - so they have a hard time understanding what much of the spam is about simply by looking at the patterns. That's why the Message Sniffer process is designed with people tightly integrated into the system. _M This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html
Re[2]: [sniffer] mini-obfuscation
On Wednesday, March 23, 2005, 6:04:10 PM, Darrell wrote: Dsic Pete, Dsic Doesnt Sniffer have a certain level of support for regex's? I know we have Dsic had good luck with regex's like this which catch obfuscation techniques with Dsic viagra with Declude. We found it easier to use regex's than to list all of Dsic the different variations. Dsic (?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a4 Dsic [EMAIL PROTECTED],3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,[EMAIL PROTECTED], Dsic 3}x?[_\W]{0,3}(?:\b|\s) The compiler and scanner we use has a limited regex capability. Some of the features you've used here were kept out of the engine on purpose. Later versions of the engine (under development) will have some more of these features - eventually including all of the features found on most regex systems, and then moving beyond them. Slick regex patterns like the one you have here are often useful for describing patterns, but not always as useful for rapidly developing and modifying dynamic pattern matching schemes. For example - the regex you have stated here will match a wide range of permutations in a single statement. That is, after all, a strength of regex. However in practice it is often found that most of the possible patterns simply are never seen in the wild or that some specific variations might be problematic... In these cases it is better to use a small catalog of simpler patterns because they can be implemented and understood incrementally, and they can be very easily excluded on a one-by-one basis if needed. Adding that kind of flexibility to the regex you have here could make it even more difficult to understand and correctly encode --- since we have a very small staff creating and modifying hundreds of rules per day seconds count. I have to admit that it would take me a few minutes to completely understand what the above regex really does - and chances are that if I modified it I would be much more likely to introduce an error than I would using our more simplified coding scheme. That's not to say that we won't be introducing more complex pattern matching capabilities - we certainly will. However, the syntax for these rules will be less concerned with an economy of keystrokes and more concerned with reliable, rapid generation and modification. For example, the coding system we have planned will be able to break down the pattern you've represented into a number of functional units that can be mixed and re-used in a hierarchical structure. This will allow both the robots and the humans to understand and manipulate the patterns very easily. Regex (as written) is a good way to represent some patterns efficiently - but it has the down side that the syntax can be arbitrarily difficult and that does not naturally represent conceptual structures that might be found in the patterns to be matched and readily reused. Best, _M This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html
[sniffer] mini-obfuscation
Wow, Pete! Wow. I didn't feel I could measure up to adding on to that thread, so I started over. Although the search space is theoretically huge (you pointed out the marketecture of large numbers), in practice, the spammers mostly use the grains quite close to the marble and use the grains over again for a while. How many times have we all been frustrated that a piece of spam ending up in *OUR* mailbox that was s close in content to spam we whacked yesterday? I thought the top n obfuscations might be interesting to look at, and perhaps a shortcut (temporary, albeit) for spam catching. I thought we might see whether, for example, broken URLs, fake comments, or high-bit ASCII character substitutions were the obfuscation technique du jour. I while back curiousity got the better of me (it was raining, and I had a few days off) and I did a few grep sweeps on a warm spam corpus. I was disappointed in my success rate for: v.?i.?a.?g.?r.?a.? and similar queries with deliberately substitutions (e.g. using a 1 for i). I started writing a grep-generating-permutation engine and decided my time was better spent on scritching my cat under his chin. Of course, I have a lot more time for my cat since I implemented Sniffer. Andrew 8) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Pete McNeil Sent: Tuesday, March 22, 2005 4:37 PM To: Colbeck, Andrew Subject: Re: [sniffer] Money, drugs, and sex On Tuesday, March 22, 2005, 4:47:30 PM, Andrew wrote: CA http://www.sophos.com/spaminfo/articles/spamwords.html CA Interesting, but a pity they didn't publish a list of, say, their CA 1,000 most popular obfuscations. If you do the math then 1000 wouldn't even scratch it. One way to attack this ( at least one of the ways we do it in Message Sniffer ) is to apply some obfuscation algorithms to each word in the list using some generic expansion patterns -- this helps to simplify the problem a bit. For example, one obfuscation algorithm is to insert a single extra character in the word. If you take the word obfuscation and apply this expansion algorithm you get something like: o~bfuscation ob~fuscation obf~uscation ... obfuscatio~n where ~ represents any random character. Then think about adding two characters... ... ob~fusc~ation ... Then think about breaking the word with an empty anchor at any of the places where you would insert a character... ... obfusa href=http://yo-mama.it;/acation ... and so on... Of course, you can't simply apply all of the possible obfuscation algorithms, and you can't completely exercise each one that you do try... you have to pick and choose and learn as you go because otherwise you would simply never finish the job. *** If you iterate through all of the permutations and count them then the numbers become astronomical... as in viagra can be obfuscated (and detected by their fine software) more than 5,600,000,000 different ways ahem. That's market speak for look how powerful our software is -whoooah! This is similar to a lot of other AI problems too and it's probably why I'm involved since I love AI work. In most AI problems if you add up all of the possible solutions to the problem you usually come up with a number you couldn't possibly write down without writing the formula instead. That is, the number would be so large that you would probably die of old age before you actually finished writing all the digits. In the AI world we talk about this huge sea of possibilities as a solution space. If you tried to check every possible solution one by one until you found the best answer it would take you forever. This is called a brute force attack. It's also what makes the big numbers seem impressive, and what makes most encryption schemes work.### Since we don't usually have forever, we do something else in the AI world. We use algorithms to search the solution space for the best answer. That is, rather than just going through the possible solutions one at a time as we come to them (brute force) we try to figure out which ones to look at and which ones to skip. The way we make that decision is to use an algorithm that leverages special rules of thumb (heuristics) to help us search the solution space more efficiently. This effectively reduces the solution space and makes it possible to come up with an answer that is good enough+++ within the time we have. So, when they talk about recognizing more than 5 billion different obfuscated forms of the word viagra they are really just estimating how many of the permutations their heuristics are able to eliminate from the solution space. (A more accurate way to think about it might be that a single heuristic for a particular obfuscated word covers a large amount of the solution space all at once. Since it's already been covered it doesn't have to be searched -- the extra work is eliminated as compared to a brute-force attack.) For example: Suppose you have a sandbox into which someone has
Re: [sniffer] mini-obfuscation
On Tuesday, March 22, 2005, 8:31:07 PM, Andrew wrote: snip/ CA How many times have we all been frustrated that a piece of spam ending CA up in *OUR* mailbox that was s close in content to spam we whacked CA yesterday? CA I thought the top n obfuscations might be interesting to look at, and CA perhaps a shortcut (temporary, albeit) for spam catching. I thought we CA might see whether, for example, broken URLs, fake comments, or high-bit CA ASCII character substitutions were the obfuscation technique du jour. Here you hit it IMHO. The reality appears to be, from my experience, that small domains of obfuscation patterns rise and fall like swells on the ocean. That is, stability tends to arise in one domain of message characteristics and then fall to rise in another domain. Sometimes the domain is well understood and sometimes it is entirely new and forces us to think differently about what a feature really is. By domain I mean things like message structure, word obfuscation techniques, phrase based swapping, html exploitation, etc... The du jour part of your statement is a key element to the problem. Defining and re-defining the conceptual framework that describes feature domains in the spam is the other key element. Put more simply - knowing what to look for is a basic element, but it gets you nowhere on it's own. Knowing (recognizing) when to look for the what is the key that makes the problem workable. CA I while back curiousity got the better of me (it was raining, and CA I had a few days off) and I did a few grep sweeps on a warm spam CA corpus. CA I was disappointed in my success rate for: CA v.?i.?a.?g.?r.?a.? CA and similar queries with deliberately substitutions (e.g. using a 1 CA for i). I started writing a grep-generating-permutation engine and CA decided my time was better spent on scritching my cat under his chin. That is a nifty direction that I wish I had more time for. Perhaps I will some day soon when Sniffer get's slashdotted and sales go through the roof! --- meantime, back on this planet, I suggested a very similar thing to Paul Graham at the first spam conference at MIT. As I recall he said it was ambitious - a description that I have learned has a special meaning in scientific circles. Something having to do with avian swine and snowballs that have successful careers as tour guides in hell. One of these days I think I might do it anyway, just to prove the point, but in the mean time I too prefer to spend more time with my cat. ;-) Don't get me wrong - I strongly believe it can be done this way, but it requires much more than good technology. It runs right into one of the biggest problems with AI and, perhaps more importantly, people's expectations of AI. No matter how good the pattern learning system might be it will always lack the human experience. Computers don't date or gain weight - so they have a hard time understanding what much of the spam is about simply by looking at the patterns. That's why the Message Sniffer process is designed with people tightly integrated into the system. _M This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html