You truly are a mad scientist - But we love ya! :) 

Matt

MaxNett Ltd
T.08701 624 989
F.08701 624 889
www.maxnett.co.uk

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Pete McNeil
Sent: 23 March 2005 00:37
To: Colbeck, Andrew
Subject: Re: [sniffer] Money, drugs, and sex

On Tuesday, March 22, 2005, 4:47:30 PM, Andrew wrote:

CA> http://www.sophos.com/spaminfo/articles/spamwords.html

CA> Interesting, but a pity they didn't publish a list of, say, their 
CA> 1,000 most popular obfuscations.

If you do the math then 1000 wouldn't even scratch it. One way to attack
this ( at least one of the ways we do it in Message Sniffer ) is to apply
some obfuscation algorithms to each word in the list using some generic
expansion patterns -- this helps to simplify the problem a bit.

For example, one obfuscation algorithm is to insert a single extra character
in the word. If you take the word obfuscation and apply this expansion
algorithm you get something like:

o~bfuscation
ob~fuscation
obf~uscation
...
obfuscatio~n

where ~ represents any random character.

Then think about adding two characters...

...
ob~fusc~ation
...

Then think about breaking the word with an empty anchor at any of the places
where you would insert a character...

...
obfus<a href="http://yo-mama.it";></a>cation
...

and so on...

Of course, you can't simply apply all of the possible obfuscation
algorithms, and you can't completely exercise each one that you do try...
you have to pick and choose and learn as you go because otherwise you would
simply never finish the job. ***

If you iterate through all of the permutations and count them then the
numbers become astronomical... as in "viagra can be obfuscated (and detected
by their fine software) more than 5,600,000,000 different ways" <ahem>.
That's market speak for "look how powerful our software is -whoooah!"

This is similar to a lot of other AI problems too and it's probably why I'm
involved since I love AI work. In most AI problems if you add up all of the
possible solutions to the problem you usually come up with a number you
couldn't possibly write down without writing the formula instead. That is,
the number would be so large that you would probably die of old age before
you actually finished writing all the digits. In the AI world we talk about
this huge sea of possibilities as a "solution space".

If you tried to check every possible solution one by one until you found the
best answer it would take you forever. This is called a brute force attack.
It's also what makes the big numbers seem impressive, and what makes most
encryption schemes work.###

Since we don't usually have "forever", we do something else in the AI world.
We use algorithms to "search the solution space" for the best answer. That
is, rather than just going through the possible solutions one at a time as
we come to them (brute force) we try to figure out which ones to look at and
which ones to skip. The way we make that decision is to use an algorithm
that leverages special "rules of thumb" (heuristics) to help us search the
solution space more efficiently. This effectively "reduces the solution
space" and makes it possible to come up with an answer that is good
enough+++ within the time we have.

So, when they talk about recognizing more than 5 billion different
obfuscated forms of the word viagra they are really just estimating how many
of the permutations their heuristics are able to eliminate from the solution
space. (A more accurate way to think about it might be that a single
heuristic for a particular obfuscated word covers a large amount of the
solution space all at once. Since it's already been covered it doesn't have
to be searched -- the extra work is eliminated as compared to a brute-force
attack.)

For example: Suppose you have a sandbox into which someone has thrown a
marble. If you have to find the marble then you could estimate all of the
grains of sand you would have to examine in order to find it.
Let's see... for a sandbox that is 3 meters on a side and 10 cm deep that
would be... (scratches head, punches on calculator, looks at watch, gives
up...) a "Sagan" of sand grains. (I fondly remember Dr.
Carl Sagan talking about astronomical numbers like this when talking about
the cosmos.)

So, to find the marble in the sandbox without individually picking up each
grain of sand then we'll need some tools (algorithms) to help us reduce the
problem. We could also use a heuristic to help us reduce the problem
further...

Let's do this:

1) Get a bucket, a screen with holes smaller than a marble but larger than a
few grains of sand, and a shovel.

2) Use a stick to draw lines on the sandbox and divide it into a grid where
each square is about the size of our bucket.

3) Skipping all of the squares where the sand is smooth and undisturbed do
the following:

4) Shovel all of the sand from a disturbed square into the bucket, then sift
the bucket of sand back into its' square using the screen.

5) Check the screen for the marble and if you find it stop - otherwise move
on to the next disturbed square.

---

This collection of heuristics and algorithms will find the marble very
quickly - MUCH more quickly than we might be led to believe if we were to
think about evaluating each grain of sand one at a time. (Consider the
bucket, screen, and stick to be functions -- just another kind of algorithm.
Step 3 is a powerful heuristic because that allows us to skip (reduce) huge
sections of the solution space. In this case it is very likely that only one
square will have been disturbed - the one that was hit by the marble when it
was thrown into the sandbox.)

---

All of this is a long winded way of saying that a list of the top 1000
obfuscations would be a lot like having the first 1000 grains of sand in the
square that contains the marble ;-) While interesting to look at, the 1000
grains of sand would almost certainly have nothing else to do with the
actual marble - save that they are in the same square.

Sorry for the extra bandwidth... I just got on a tear about it and couldn't
stop.

Best,

_M

*** Check out our new Change Rates graph to see how well we are "learning"
which things to try and which ones to skip:
http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp

Roughly: The dark blue area represents that part of the solution space that
we have searched unsuccessfully at any point in time. Brighter colors
(green, cyan, yellow, red) represent more successful search attempts. The
white colors represent how hard we worked on a given day.

Part of the process we use is to observe which kinds of rules are more
effective over time and to concentrate more on generating that type of rule
and on expanding the algorithms that we used to evolve that rule type.

+++ Good enough = It is usually close to impossible to say for certain
that you have "THE BEST ANSWER" to a problem like this because you
deliberately don't look at MOST of the solutions that are possible.
Some of the most difficult problems in mathematics are like this --- we're
all really, really sure that we've got the best answer, but we can't really
prove it -- at least not yet ;-)

### What makes strong encryption strong is that the number of possible keys
is very, very large and because there isn't any apparent way to guess the
key without trying every one of them one at a time - a brute force attack.
The result is that if you were to use all the computing power at your
disposal to try every possible key it would take you so long that whatever
is protected by the key isn't worth the effort. The trick is that there is
almost always some way to make better guesses and skip huge parts of the
solution set that must be searched. This is why quantum computing is getting
a lot of buzz lately...

Right now our best encryption keys are based on very large prime numbers.
You take a couple of these numbers and combine them to come up with a very
large composite number. The reason we use these kinds of keys is that it
makes it easy for us to send keys to each other without using invincible
nuclear powered carrier pigeons. This is known as the key exchange problem -
a problem largely solved with public-key encryption (as in PGP - Pretty Good
Privacy).

Using ordinary computers there is no quick easy way to "factor" these large
numbers and determine what the original large prime numbers were. As a
result the best approach (in general) is a brute-force attack where we
simply use large numbers of computers all at once to try large numbers of
keys until we stumble upon the answer.

Quantum mechanics allows for subatomic particles which are in more than one
state (or even place) at a time... Unlike todays computers which represent
numbers as strings of bits where each bit can only be in one of two states
(0 or 1), a quantum computer can use strings of bits where each bit can be
in almost any number of states. With a computer like that it becomes
possible (theoretically) for one computer to represent all of the possible
factors all at once without having to go through them one at a time... So,
if you have a quantum computer you can factor a PGP key _almost_ as quickly
as you can multiply a pair of integers.




This E-Mail came from the Message Sniffer mailing list. For information and
(un)subscription instructions go to
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Scanned for Spam and Viruses by GetNoSpam.net
          http://www.getnospam.net




This E-Mail came from the Message Sniffer mailing list. For information and 
(un)subscription instructions go to 
http://www.sortmonster.com/MessageSniffer/Help/Help.html

Reply via email to