On Tuesday, March 22, 2005, 4:47:30 PM, Andrew wrote:

CA> http://www.sophos.com/spaminfo/articles/spamwords.html

CA> Interesting, but a pity they didn't publish a list of, say, their 1,000
CA> most popular obfuscations.

If you do the math then 1000 wouldn't even scratch it. One way to
attack this ( at least one of the ways we do it in Message Sniffer )
is to apply some obfuscation algorithms to each word in the list using
some generic expansion patterns -- this helps to simplify the problem
a bit.

For example, one obfuscation algorithm is to insert a single extra
character in the word. If you take the word obfuscation and apply this
expansion algorithm you get something like:

o~bfuscation
ob~fuscation
obf~uscation
...
obfuscatio~n

where ~ represents any random character.

Then think about adding two characters...

...
ob~fusc~ation
...

Then think about breaking the word with an empty anchor at any of the
places where you would insert a character...

...
obfus<a href="http://yo-mama.it";></a>cation
...

and so on...

Of course, you can't simply apply all of the possible obfuscation
algorithms, and you can't completely exercise each one that you do
try... you have to pick and choose and learn as you go because
otherwise you would simply never finish the job. ***

If you iterate through all of the permutations and count them then the
numbers become astronomical... as in "viagra can be obfuscated (and
detected by their fine software) more than 5,600,000,000 different
ways" <ahem>. That's market speak for "look how powerful our software
is -whoooah!"

This is similar to a lot of other AI problems too and it's probably
why I'm involved since I love AI work. In most AI problems if you add
up all of the possible solutions to the problem you usually come up
with a number you couldn't possibly write down without writing the
formula instead. That is, the number would be so large that you would
probably die of old age before you actually finished writing all the
digits. In the AI world we talk about this huge sea of possibilities
as a "solution space".

If you tried to check every possible solution one by one until you
found the best answer it would take you forever. This is called a
brute force attack. It's also what makes the big numbers seem
impressive, and what makes most encryption schemes work.###

Since we don't usually have "forever", we do something else in the AI
world. We use algorithms to "search the solution space" for the best
answer. That is, rather than just going through the possible solutions
one at a time as we come to them (brute force) we try to figure out
which ones to look at and which ones to skip. The way we make that
decision is to use an algorithm that leverages special "rules of
thumb" (heuristics) to help us search the solution space more
efficiently. This effectively "reduces the solution space" and makes
it possible to come up with an answer that is good enough+++ within
the time we have.

So, when they talk about recognizing more than 5 billion different
obfuscated forms of the word viagra they are really just estimating
how many of the permutations their heuristics are able to eliminate
from the solution space. (A more accurate way to think about it might
be that a single heuristic for a particular obfuscated word covers a
large amount of the solution space all at once. Since it's already
been covered it doesn't have to be searched -- the extra work is
eliminated as compared to a brute-force attack.)

For example: Suppose you have a sandbox into which someone has thrown
a marble. If you have to find the marble then you could estimate all
of the grains of sand you would have to examine in order to find it.
Let's see... for a sandbox that is 3 meters on a side and 10 cm deep
that would be... (scratches head, punches on calculator, looks at
watch, gives up...) a "Sagan" of sand grains. (I fondly remember Dr.
Carl Sagan talking about astronomical numbers like this when talking
about the cosmos.)

So, to find the marble in the sandbox without individually picking up
each grain of sand then we'll need some tools (algorithms) to help us
reduce the problem. We could also use a heuristic to help us reduce
the problem further...

Let's do this:

1) Get a bucket, a screen with holes smaller than a marble but larger
than a few grains of sand, and a shovel.

2) Use a stick to draw lines on the sandbox and divide it into a grid
where each square is about the size of our bucket.

3) Skipping all of the squares where the sand is smooth and
undisturbed do the following:

4) Shovel all of the sand from a disturbed square into the bucket,
then sift the bucket of sand back into its' square using the screen.

5) Check the screen for the marble and if you find it stop - otherwise
move on to the next disturbed square.

---

This collection of heuristics and algorithms will find the marble very
quickly - MUCH more quickly than we might be led to believe if we were
to think about evaluating each grain of sand one at a time. (Consider
the bucket, screen, and stick to be functions -- just another kind of
algorithm. Step 3 is a powerful heuristic because that allows us to
skip (reduce) huge sections of the solution space. In this case it is
very likely that only one square will have been disturbed - the one
that was hit by the marble when it was thrown into the sandbox.)

---

All of this is a long winded way of saying that a list of the top 1000
obfuscations would be a lot like having the first 1000 grains of sand
in the square that contains the marble ;-) While interesting to look
at, the 1000 grains of sand would almost certainly have nothing else
to do with the actual marble - save that they are in the same square.

Sorry for the extra bandwidth... I just got on a tear about it and
couldn't stop.

Best,

_M

*** Check out our new Change Rates graph to see how well we are
"learning" which things to try and which ones to skip:
http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp

Roughly: The dark blue area represents that part of the solution space
that we have searched unsuccessfully at any point in time. Brighter
colors (green, cyan, yellow, red) represent more successful search
attempts. The white colors represent how hard we worked on a given
day.

Part of the process we use is to observe which kinds of rules are more
effective over time and to concentrate more on generating that type of
rule and on expanding the algorithms that we used to evolve that rule
type.

+++ Good enough = It is usually close to impossible to say for certain
that you have "THE BEST ANSWER" to a problem like this because you
deliberately don't look at MOST of the solutions that are possible.
Some of the most difficult problems in mathematics are like this ---
we're all really, really sure that we've got the best answer, but we
can't really prove it -- at least not yet ;-)

### What makes strong encryption strong is that the number of possible
keys is very, very large and because there isn't any apparent way to
guess the key without trying every one of them one at a time - a brute
force attack. The result is that if you were to use all the computing
power at your disposal to try every possible key it would take you so
long that whatever is protected by the key isn't worth the effort. The
trick is that there is almost always some way to make better guesses
and skip huge parts of the solution set that must be searched. This is
why quantum computing is getting a lot of buzz lately...

Right now our best encryption keys are based on very large prime
numbers. You take a couple of these numbers and combine them to come
up with a very large composite number. The reason we use these kinds
of keys is that it makes it easy for us to send keys to each other
without using invincible nuclear powered carrier pigeons. This is
known as the key exchange problem - a problem largely solved with
public-key encryption (as in PGP - Pretty Good Privacy).

Using ordinary computers there is no quick easy way to "factor" these
large numbers and determine what the original large prime numbers
were. As a result the best approach (in general) is a brute-force
attack where we simply use large numbers of computers all at once to
try large numbers of keys until we stumble upon the answer.

Quantum mechanics allows for subatomic particles which are in more
than one state (or even place) at a time... Unlike todays computers
which represent numbers as strings of bits where each bit can only be
in one of two states (0 or 1), a quantum computer can use strings of
bits where each bit can be in almost any number of states. With a
computer like that it becomes possible (theoretically) for one
computer to represent all of the possible factors all at once without
having to go through them one at a time... So, if you have a quantum
computer you can factor a PGP key _almost_ as quickly as you can
multiply a pair of integers.




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Reply via email to