Wow, Pete!  Wow.

I didn't feel I could measure up to adding on to that thread, so I
started over.

Although the search space is theoretically huge (you pointed out the
marketecture of large numbers), in practice, the spammers mostly use the
grains quite close to the marble and use the grains over again for a
while.

How many times have we all been frustrated that a piece of spam ending
up in *OUR* mailbox that was soooo close in content to spam we whacked
yesterday?

I thought the top "n" obfuscations might be interesting to look at, and
perhaps a shortcut  (temporary, albeit) for spam catching.  I thought we
might see whether, for example, broken URLs, fake comments, or high-bit
ASCII character substitutions were the obfuscation technique du jour.

I while back curiousity got the better of me (it was raining, and I had
a few days off) and I did a few grep sweeps on a warm spam corpus.

I was disappointed in my success rate for:

v.?i.?a.?g.?r.?a.?

and similar queries with deliberately substitutions (e.g. using a "1"
for "i").  I started writing a grep-generating-permutation engine and
decided my time was better spent on scritching my cat under his chin.

Of course, I have a lot more time for my cat since I implemented
Sniffer.

Andrew 8)

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Pete McNeil
Sent: Tuesday, March 22, 2005 4:37 PM
To: Colbeck, Andrew
Subject: Re: [sniffer] Money, drugs, and sex


On Tuesday, March 22, 2005, 4:47:30 PM, Andrew wrote:

CA> http://www.sophos.com/spaminfo/articles/spamwords.html

CA> Interesting, but a pity they didn't publish a list of, say, their 
CA> 1,000 most popular obfuscations.

If you do the math then 1000 wouldn't even scratch it. One way to attack
this ( at least one of the ways we do it in Message Sniffer ) is to
apply some obfuscation algorithms to each word in the list using some
generic expansion patterns -- this helps to simplify the problem a bit.

For example, one obfuscation algorithm is to insert a single extra
character in the word. If you take the word obfuscation and apply this
expansion algorithm you get something like:

o~bfuscation
ob~fuscation
obf~uscation
...
obfuscatio~n

where ~ represents any random character.

Then think about adding two characters...

...
ob~fusc~ation
...

Then think about breaking the word with an empty anchor at any of the
places where you would insert a character...

...
obfus<a href="http://yo-mama.it";></a>cation
...

and so on...

Of course, you can't simply apply all of the possible obfuscation
algorithms, and you can't completely exercise each one that you do
try... you have to pick and choose and learn as you go because otherwise
you would simply never finish the job. ***

If you iterate through all of the permutations and count them then the
numbers become astronomical... as in "viagra can be obfuscated (and
detected by their fine software) more than 5,600,000,000 different ways"
<ahem>. That's market speak for "look how powerful our software is
-whoooah!"

This is similar to a lot of other AI problems too and it's probably why
I'm involved since I love AI work. In most AI problems if you add up all
of the possible solutions to the problem you usually come up with a
number you couldn't possibly write down without writing the formula
instead. That is, the number would be so large that you would probably
die of old age before you actually finished writing all the digits. In
the AI world we talk about this huge sea of possibilities as a "solution
space".

If you tried to check every possible solution one by one until you found
the best answer it would take you forever. This is called a brute force
attack. It's also what makes the big numbers seem impressive, and what
makes most encryption schemes work.###

Since we don't usually have "forever", we do something else in the AI
world. We use algorithms to "search the solution space" for the best
answer. That is, rather than just going through the possible solutions
one at a time as we come to them (brute force) we try to figure out
which ones to look at and which ones to skip. The way we make that
decision is to use an algorithm that leverages special "rules of thumb"
(heuristics) to help us search the solution space more efficiently. This
effectively "reduces the solution space" and makes it possible to come
up with an answer that is good enough+++ within the time we have.

So, when they talk about recognizing more than 5 billion different
obfuscated forms of the word viagra they are really just estimating how
many of the permutations their heuristics are able to eliminate from the
solution space. (A more accurate way to think about it might be that a
single heuristic for a particular obfuscated word covers a large amount
of the solution space all at once. Since it's already been covered it
doesn't have to be searched -- the extra work is eliminated as compared
to a brute-force attack.)

For example: Suppose you have a sandbox into which someone has thrown a
marble. If you have to find the marble then you could estimate all of
the grains of sand you would have to examine in order to find it. Let's
see... for a sandbox that is 3 meters on a side and 10 cm deep that
would be... (scratches head, punches on calculator, looks at watch,
gives up...) a "Sagan" of sand grains. (I fondly remember Dr. Carl Sagan
talking about astronomical numbers like this when talking about the
cosmos.)

So, to find the marble in the sandbox without individually picking up
each grain of sand then we'll need some tools (algorithms) to help us
reduce the problem. We could also use a heuristic to help us reduce the
problem further...

Let's do this:

1) Get a bucket, a screen with holes smaller than a marble but larger
than a few grains of sand, and a shovel.

2) Use a stick to draw lines on the sandbox and divide it into a grid
where each square is about the size of our bucket.

3) Skipping all of the squares where the sand is smooth and undisturbed
do the following:

4) Shovel all of the sand from a disturbed square into the bucket, then
sift the bucket of sand back into its' square using the screen.

5) Check the screen for the marble and if you find it stop - otherwise
move on to the next disturbed square.

---

This collection of heuristics and algorithms will find the marble very
quickly - MUCH more quickly than we might be led to believe if we were
to think about evaluating each grain of sand one at a time. (Consider
the bucket, screen, and stick to be functions -- just another kind of
algorithm. Step 3 is a powerful heuristic because that allows us to skip
(reduce) huge sections of the solution space. In this case it is very
likely that only one square will have been disturbed - the one that was
hit by the marble when it was thrown into the sandbox.)

---

All of this is a long winded way of saying that a list of the top 1000
obfuscations would be a lot like having the first 1000 grains of sand in
the square that contains the marble ;-) While interesting to look at,
the 1000 grains of sand would almost certainly have nothing else to do
with the actual marble - save that they are in the same square.

Sorry for the extra bandwidth... I just got on a tear about it and
couldn't stop.

Best,

_M

*** Check out our new Change Rates graph to see how well we are
"learning" which things to try and which ones to skip:
http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp

Roughly: The dark blue area represents that part of the solution space
that we have searched unsuccessfully at any point in time. Brighter
colors (green, cyan, yellow, red) represent more successful search
attempts. The white colors represent how hard we worked on a given day.

Part of the process we use is to observe which kinds of rules are more
effective over time and to concentrate more on generating that type of
rule and on expanding the algorithms that we used to evolve that rule
type.

+++ Good enough = It is usually close to impossible to say for certain
that you have "THE BEST ANSWER" to a problem like this because you
deliberately don't look at MOST of the solutions that are possible. Some
of the most difficult problems in mathematics are like this --- we're
all really, really sure that we've got the best answer, but we can't
really prove it -- at least not yet ;-)

### What makes strong encryption strong is that the number of possible
keys is very, very large and because there isn't any apparent way to
guess the key without trying every one of them one at a time - a brute
force attack. The result is that if you were to use all the computing
power at your disposal to try every possible key it would take you so
long that whatever is protected by the key isn't worth the effort. The
trick is that there is almost always some way to make better guesses and
skip huge parts of the solution set that must be searched. This is why
quantum computing is getting a lot of buzz lately...

Right now our best encryption keys are based on very large prime
numbers. You take a couple of these numbers and combine them to come up
with a very large composite number. The reason we use these kinds of
keys is that it makes it easy for us to send keys to each other without
using invincible nuclear powered carrier pigeons. This is known as the
key exchange problem - a problem largely solved with public-key
encryption (as in PGP - Pretty Good Privacy).

Using ordinary computers there is no quick easy way to "factor" these
large numbers and determine what the original large prime numbers were.
As a result the best approach (in general) is a brute-force attack where
we simply use large numbers of computers all at once to try large
numbers of keys until we stumble upon the answer.

Quantum mechanics allows for subatomic particles which are in more than
one state (or even place) at a time... Unlike todays computers which
represent numbers as strings of bits where each bit can only be in one
of two states (0 or 1), a quantum computer can use strings of bits where
each bit can be in almost any number of states. With a computer like
that it becomes possible (theoretically) for one computer to represent
all of the possible factors all at once without having to go through
them one at a time... So, if you have a quantum computer you can factor
a PGP key _almost_ as quickly as you can multiply a pair of integers.




This E-Mail came from the Message Sniffer mailing list. For information
and (un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html

This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Reply via email to