You truly are a mad scientist - But we love ya! :) Matt
MaxNett Ltd T.08701 624 989 F.08701 624 889 www.maxnett.co.uk -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Pete McNeil Sent: 23 March 2005 00:37 To: Colbeck, Andrew Subject: Re: [sniffer] Money, drugs, and sex On Tuesday, March 22, 2005, 4:47:30 PM, Andrew wrote: CA> http://www.sophos.com/spaminfo/articles/spamwords.html CA> Interesting, but a pity they didn't publish a list of, say, their CA> 1,000 most popular obfuscations. If you do the math then 1000 wouldn't even scratch it. One way to attack this ( at least one of the ways we do it in Message Sniffer ) is to apply some obfuscation algorithms to each word in the list using some generic expansion patterns -- this helps to simplify the problem a bit. For example, one obfuscation algorithm is to insert a single extra character in the word. If you take the word obfuscation and apply this expansion algorithm you get something like: o~bfuscation ob~fuscation obf~uscation ... obfuscatio~n where ~ represents any random character. Then think about adding two characters... ... ob~fusc~ation ... Then think about breaking the word with an empty anchor at any of the places where you would insert a character... ... obfus<a href="http://yo-mama.it"></a>cation ... and so on... Of course, you can't simply apply all of the possible obfuscation algorithms, and you can't completely exercise each one that you do try... you have to pick and choose and learn as you go because otherwise you would simply never finish the job. *** If you iterate through all of the permutations and count them then the numbers become astronomical... as in "viagra can be obfuscated (and detected by their fine software) more than 5,600,000,000 different ways" <ahem>. That's market speak for "look how powerful our software is -whoooah!" This is similar to a lot of other AI problems too and it's probably why I'm involved since I love AI work. In most AI problems if you add up all of the possible solutions to the problem you usually come up with a number you couldn't possibly write down without writing the formula instead. That is, the number would be so large that you would probably die of old age before you actually finished writing all the digits. In the AI world we talk about this huge sea of possibilities as a "solution space". If you tried to check every possible solution one by one until you found the best answer it would take you forever. This is called a brute force attack. It's also what makes the big numbers seem impressive, and what makes most encryption schemes work.### Since we don't usually have "forever", we do something else in the AI world. We use algorithms to "search the solution space" for the best answer. That is, rather than just going through the possible solutions one at a time as we come to them (brute force) we try to figure out which ones to look at and which ones to skip. The way we make that decision is to use an algorithm that leverages special "rules of thumb" (heuristics) to help us search the solution space more efficiently. This effectively "reduces the solution space" and makes it possible to come up with an answer that is good enough+++ within the time we have. So, when they talk about recognizing more than 5 billion different obfuscated forms of the word viagra they are really just estimating how many of the permutations their heuristics are able to eliminate from the solution space. (A more accurate way to think about it might be that a single heuristic for a particular obfuscated word covers a large amount of the solution space all at once. Since it's already been covered it doesn't have to be searched -- the extra work is eliminated as compared to a brute-force attack.) For example: Suppose you have a sandbox into which someone has thrown a marble. If you have to find the marble then you could estimate all of the grains of sand you would have to examine in order to find it. Let's see... for a sandbox that is 3 meters on a side and 10 cm deep that would be... (scratches head, punches on calculator, looks at watch, gives up...) a "Sagan" of sand grains. (I fondly remember Dr. Carl Sagan talking about astronomical numbers like this when talking about the cosmos.) So, to find the marble in the sandbox without individually picking up each grain of sand then we'll need some tools (algorithms) to help us reduce the problem. We could also use a heuristic to help us reduce the problem further... Let's do this: 1) Get a bucket, a screen with holes smaller than a marble but larger than a few grains of sand, and a shovel. 2) Use a stick to draw lines on the sandbox and divide it into a grid where each square is about the size of our bucket. 3) Skipping all of the squares where the sand is smooth and undisturbed do the following: 4) Shovel all of the sand from a disturbed square into the bucket, then sift the bucket of sand back into its' square using the screen. 5) Check the screen for the marble and if you find it stop - otherwise move on to the next disturbed square. --- This collection of heuristics and algorithms will find the marble very quickly - MUCH more quickly than we might be led to believe if we were to think about evaluating each grain of sand one at a time. (Consider the bucket, screen, and stick to be functions -- just another kind of algorithm. Step 3 is a powerful heuristic because that allows us to skip (reduce) huge sections of the solution space. In this case it is very likely that only one square will have been disturbed - the one that was hit by the marble when it was thrown into the sandbox.) --- All of this is a long winded way of saying that a list of the top 1000 obfuscations would be a lot like having the first 1000 grains of sand in the square that contains the marble ;-) While interesting to look at, the 1000 grains of sand would almost certainly have nothing else to do with the actual marble - save that they are in the same square. Sorry for the extra bandwidth... I just got on a tear about it and couldn't stop. Best, _M *** Check out our new Change Rates graph to see how well we are "learning" which things to try and which ones to skip: http://www.sortmonster.com/MessageSniffer/Performance/ChangeRates.jsp Roughly: The dark blue area represents that part of the solution space that we have searched unsuccessfully at any point in time. Brighter colors (green, cyan, yellow, red) represent more successful search attempts. The white colors represent how hard we worked on a given day. Part of the process we use is to observe which kinds of rules are more effective over time and to concentrate more on generating that type of rule and on expanding the algorithms that we used to evolve that rule type. +++ Good enough = It is usually close to impossible to say for certain that you have "THE BEST ANSWER" to a problem like this because you deliberately don't look at MOST of the solutions that are possible. Some of the most difficult problems in mathematics are like this --- we're all really, really sure that we've got the best answer, but we can't really prove it -- at least not yet ;-) ### What makes strong encryption strong is that the number of possible keys is very, very large and because there isn't any apparent way to guess the key without trying every one of them one at a time - a brute force attack. The result is that if you were to use all the computing power at your disposal to try every possible key it would take you so long that whatever is protected by the key isn't worth the effort. The trick is that there is almost always some way to make better guesses and skip huge parts of the solution set that must be searched. This is why quantum computing is getting a lot of buzz lately... Right now our best encryption keys are based on very large prime numbers. You take a couple of these numbers and combine them to come up with a very large composite number. The reason we use these kinds of keys is that it makes it easy for us to send keys to each other without using invincible nuclear powered carrier pigeons. This is known as the key exchange problem - a problem largely solved with public-key encryption (as in PGP - Pretty Good Privacy). Using ordinary computers there is no quick easy way to "factor" these large numbers and determine what the original large prime numbers were. As a result the best approach (in general) is a brute-force attack where we simply use large numbers of computers all at once to try large numbers of keys until we stumble upon the answer. Quantum mechanics allows for subatomic particles which are in more than one state (or even place) at a time... Unlike todays computers which represent numbers as strings of bits where each bit can only be in one of two states (0 or 1), a quantum computer can use strings of bits where each bit can be in almost any number of states. With a computer like that it becomes possible (theoretically) for one computer to represent all of the possible factors all at once without having to go through them one at a time... So, if you have a quantum computer you can factor a PGP key _almost_ as quickly as you can multiply a pair of integers. This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html Scanned for Spam and Viruses by GetNoSpam.net http://www.getnospam.net This E-Mail came from the Message Sniffer mailing list. For information and (un)subscription instructions go to http://www.sortmonster.com/MessageSniffer/Help/Help.html
