Re: [sniffer] mini-obfuscation

2005-03-23 Thread Darrell (supp...@invariantsystems.com)
Pete, 

Doesnt Sniffer have a certain level of support for regex's?  I know we have 
had good luck with regex's like this which catch obfuscation techniques with 
viagra with Declude.  We found it easier to use regex's than to list all of 
the different variations. 

(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a4 
[EMAIL PROTECTED],3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,[EMAIL PROTECTED], 
3}x?[_\W]{0,3}(?:\b|\s) 

Darrell

Check out http://www.invariantsystems.com for utilities for Declude And 
Imail.  IMail/Declude Overflow Queue Monitoring, SURBL/URI integration, MRTG 
Integration, and Log Parsers. 

Pete McNeil writes: 

On Tuesday, March 22, 2005, 8:31:07 PM, Andrew wrote: 

snip/ 

CA How many times have we all been frustrated that a piece of spam ending
CA up in *OUR* mailbox that was s close in content to spam we whacked
CA yesterday? 

CA I thought the top n obfuscations might be interesting to look at, and
CA perhaps a shortcut  (temporary, albeit) for spam catching.  I thought we
CA might see whether, for example, broken URLs, fake comments, or high-bit
CA ASCII character substitutions were the obfuscation technique du jour. 

Here you hit it IMHO. The reality appears to be, from my experience,
that small domains of obfuscation patterns rise and fall like swells
on the ocean. That is, stability tends to arise in one domain of
message characteristics and then fall to rise in another domain.
Sometimes the domain is well understood and sometimes it is entirely
new and forces us to think differently about what a feature really
is. 

By domain I mean things like message structure, word obfuscation
techniques, phrase based swapping, html exploitation, etc... 

The du jour part of your statement is a key element to the problem.
Defining and re-defining the conceptual framework that describes
feature domains in the spam is the other key element. 

Put more simply - knowing what to look for is a basic element, but it
gets you nowhere on it's own. Knowing (recognizing) when to look for
the what is the key that makes the problem workable. 

CA I while back curiousity got the better of me (it was raining, and
CA I had a few days off) and I did a few grep sweeps on a warm spam
CA corpus. 

CA I was disappointed in my success rate for: 

CA v.?i.?a.?g.?r.?a.? 

CA and similar queries with deliberately substitutions (e.g. using a 1
CA for i).  I started writing a grep-generating-permutation engine and
CA decided my time was better spent on scritching my cat under his chin. 

That is a nifty direction that I wish I had more time for. Perhaps I
will some day soon when Sniffer get's slashdotted and sales go through
the roof! 

--- meantime, back on this planet, I suggested a very similar thing to
Paul Graham at the first spam conference at MIT. As I recall he said
it was ambitious - a description that I have learned has a special
meaning in scientific circles. Something having to do with avian swine
and snowballs that have successful careers as tour guides in hell. 

One of these days I think I might do it anyway, just to prove the
point, but in the mean time I too prefer to spend more time with my
cat. ;-) 

Don't get me wrong - I strongly believe it can be done this way, but
it requires much more than good technology. It runs right into one of
the biggest problems with AI and, perhaps more importantly, people's
expectations of AI. No matter how good the pattern learning system
might be it will always lack the human experience. Computers don't
date or gain weight - so they have a hard time understanding what much
of the spam is about simply by looking at the patterns. That's why
the Message Sniffer process is designed with people tightly integrated
into the system. 

_M 

 

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html

This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html


Re[2]: [sniffer] mini-obfuscation

2005-03-23 Thread Pete McNeil
On Wednesday, March 23, 2005, 6:04:10 PM, Darrell wrote:

Dsic Pete, 

Dsic Doesnt Sniffer have a certain level of support for regex's?  I know we 
have
Dsic had good luck with regex's like this which catch obfuscation techniques 
with
Dsic viagra with Declude.  We found it easier to use regex's than to list all 
of
Dsic the different variations. 

Dsic 
(?:\b|\s)[_\W]{0,3}(?:\\\/|V)[_\W]{0,3}[ij1!|l\xEC\xED\xEE\xEF][_\W]{0,3}[a4
Dsic [EMAIL PROTECTED],3}[xyz]?[gj][_\W]{0,3}r[_\W]{0,[EMAIL PROTECTED],
Dsic 3}x?[_\W]{0,3}(?:\b|\s) 

The compiler and scanner we use has a limited regex capability. Some
of the features you've used here were kept out of the engine on
purpose. Later versions of the engine (under development) will have
some more of these features - eventually including all of the features
found on most regex systems, and then moving beyond them.

Slick regex patterns like the one you have here are often useful for
describing patterns, but not always as useful for rapidly developing
and modifying dynamic pattern matching schemes.

For example - the regex you have stated here will match a wide range
of permutations in a single statement. That is, after all, a strength
of regex. However in practice it is often found that most of the
possible patterns simply are never seen in the wild or that some
specific variations might be problematic... In these cases it is
better to use a small catalog of simpler patterns because they can be
implemented and understood incrementally, and they can be very easily
excluded on a one-by-one basis if needed. Adding that kind of
flexibility to the regex you have here could make it even more
difficult to understand and correctly encode --- since we have a very
small staff creating and modifying hundreds of rules per day seconds
count. I have to admit that it would take me a few minutes to
completely understand what the above regex really does - and chances
are that if I modified it I would be much more likely to introduce an
error than I would using our more simplified coding scheme.

That's not to say that we won't be introducing more complex pattern
matching capabilities - we certainly will. However, the syntax for
these rules will be less concerned with an economy of keystrokes and
more concerned with reliable, rapid generation and modification.

For example, the coding system we have planned will be able to break
down the pattern you've represented into a number of functional units
that can be mixed and re-used in a hierarchical structure. This will
allow both the robots and the humans to understand and manipulate the
patterns very easily.

Regex (as written) is a good way to represent some patterns
efficiently - but it has the down side that the syntax can be
arbitrarily difficult and that does not naturally represent conceptual
structures that might be found in the patterns to be matched and
readily reused.

Best,

_M







This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html


[sniffer] mini-obfuscation

2005-03-22 Thread Colbeck, Andrew
Wow, Pete!  Wow.

I didn't feel I could measure up to adding on to that thread, so I
started over.

Although the search space is theoretically huge (you pointed out the
marketecture of large numbers), in practice, the spammers mostly use the
grains quite close to the marble and use the grains over again for a
while.

How many times have we all been frustrated that a piece of spam ending
up in *OUR* mailbox that was s close in content to spam we whacked
yesterday?

I thought the top n obfuscations might be interesting to look at, and
perhaps a shortcut  (temporary, albeit) for spam catching.  I thought we
might see whether, for example, broken URLs, fake comments, or high-bit
ASCII character substitutions were the obfuscation technique du jour.

I while back curiousity got the better of me (it was raining, and I had
a few days off) and I did a few grep sweeps on a warm spam corpus.

I was disappointed in my success rate for:

v.?i.?a.?g.?r.?a.?

and similar queries with deliberately substitutions (e.g. using a 1
for i).  I started writing a grep-generating-permutation engine and
decided my time was better spent on scritching my cat under his chin.

Of course, I have a lot more time for my cat since I implemented
Sniffer.

Andrew 8)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Pete McNeil
Sent: Tuesday, March 22, 2005 4:37 PM
To: Colbeck, Andrew
Subject: Re: [sniffer] Money, drugs, and sex


On Tuesday, March 22, 2005, 4:47:30 PM, Andrew wrote:

CA http://www.sophos.com/spaminfo/articles/spamwords.html

CA Interesting, but a pity they didn't publish a list of, say, their 
CA 1,000 most popular obfuscations.

If you do the math then 1000 wouldn't even scratch it. One way to attack
this ( at least one of the ways we do it in Message Sniffer ) is to
apply some obfuscation algorithms to each word in the list using some
generic expansion patterns -- this helps to simplify the problem a bit.

For example, one obfuscation algorithm is to insert a single extra
character in the word. If you take the word obfuscation and apply this
expansion algorithm you get something like:

o~bfuscation
ob~fuscation
obf~uscation
...
obfuscatio~n

where ~ represents any random character.

Then think about adding two characters...

...
ob~fusc~ation
...

Then think about breaking the word with an empty anchor at any of the
places where you would insert a character...

...
obfusa href=http://yo-mama.it;/acation
...

and so on...

Of course, you can't simply apply all of the possible obfuscation
algorithms, and you can't completely exercise each one that you do
try... you have to pick and choose and learn as you go because otherwise
you would simply never finish the job. ***

If you iterate through all of the permutations and count them then the
numbers become astronomical... as in viagra can be obfuscated (and
detected by their fine software) more than 5,600,000,000 different ways
ahem. That's market speak for look how powerful our software is
-whoooah!

This is similar to a lot of other AI problems too and it's probably why
I'm involved since I love AI work. In most AI problems if you add up all
of the possible solutions to the problem you usually come up with a
number you couldn't possibly write down without writing the formula
instead. That is, the number would be so large that you would probably
die of old age before you actually finished writing all the digits. In
the AI world we talk about this huge sea of possibilities as a solution
space.

If you tried to check every possible solution one by one until you found
the best answer it would take you forever. This is called a brute force
attack. It's also what makes the big numbers seem impressive, and what
makes most encryption schemes work.###

Since we don't usually have forever, we do something else in the AI
world. We use algorithms to search the solution space for the best
answer. That is, rather than just going through the possible solutions
one at a time as we come to them (brute force) we try to figure out
which ones to look at and which ones to skip. The way we make that
decision is to use an algorithm that leverages special rules of thumb
(heuristics) to help us search the solution space more efficiently. This
effectively reduces the solution space and makes it possible to come
up with an answer that is good enough+++ within the time we have.

So, when they talk about recognizing more than 5 billion different
obfuscated forms of the word viagra they are really just estimating how
many of the permutations their heuristics are able to eliminate from the
solution space. (A more accurate way to think about it might be that a
single heuristic for a particular obfuscated word covers a large amount
of the solution space all at once. Since it's already been covered it
doesn't have to be searched -- the extra work is eliminated as compared
to a brute-force attack.)

For example: Suppose you have a sandbox into which someone has 

Re: [sniffer] mini-obfuscation

2005-03-22 Thread Pete McNeil
On Tuesday, March 22, 2005, 8:31:07 PM, Andrew wrote:

snip/

CA How many times have we all been frustrated that a piece of spam ending
CA up in *OUR* mailbox that was s close in content to spam we whacked
CA yesterday?

CA I thought the top n obfuscations might be interesting to look at, and
CA perhaps a shortcut  (temporary, albeit) for spam catching.  I thought we
CA might see whether, for example, broken URLs, fake comments, or high-bit
CA ASCII character substitutions were the obfuscation technique du jour.

Here you hit it IMHO. The reality appears to be, from my experience,
that small domains of obfuscation patterns rise and fall like swells
on the ocean. That is, stability tends to arise in one domain of
message characteristics and then fall to rise in another domain.
Sometimes the domain is well understood and sometimes it is entirely
new and forces us to think differently about what a feature really
is.

By domain I mean things like message structure, word obfuscation
techniques, phrase based swapping, html exploitation, etc...

The du jour part of your statement is a key element to the problem.
Defining and re-defining the conceptual framework that describes
feature domains in the spam is the other key element.

Put more simply - knowing what to look for is a basic element, but it
gets you nowhere on it's own. Knowing (recognizing) when to look for
the what is the key that makes the problem workable.

CA I while back curiousity got the better of me (it was raining, and
CA I had a few days off) and I did a few grep sweeps on a warm spam
CA corpus.

CA I was disappointed in my success rate for:

CA v.?i.?a.?g.?r.?a.?

CA and similar queries with deliberately substitutions (e.g. using a 1
CA for i).  I started writing a grep-generating-permutation engine and
CA decided my time was better spent on scritching my cat under his chin.

That is a nifty direction that I wish I had more time for. Perhaps I
will some day soon when Sniffer get's slashdotted and sales go through
the roof!

--- meantime, back on this planet, I suggested a very similar thing to
Paul Graham at the first spam conference at MIT. As I recall he said
it was ambitious - a description that I have learned has a special
meaning in scientific circles. Something having to do with avian swine
and snowballs that have successful careers as tour guides in hell.

One of these days I think I might do it anyway, just to prove the
point, but in the mean time I too prefer to spend more time with my
cat. ;-)

Don't get me wrong - I strongly believe it can be done this way, but
it requires much more than good technology. It runs right into one of
the biggest problems with AI and, perhaps more importantly, people's
expectations of AI. No matter how good the pattern learning system
might be it will always lack the human experience. Computers don't
date or gain weight - so they have a hard time understanding what much
of the spam is about simply by looking at the patterns. That's why
the Message Sniffer process is designed with people tightly integrated
into the system.

_M




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html