First, you are confusing several things here about how SA works. If you understand this better you will have a better chance of deciding if SA could do what you want it to do.
SA works two ways (well, a lot more, but two of importance here): 1) by hard-coded rules that check for known kinds of patterns, and 2) by using Bayesian filtering. Rules do not require "training". They instead requiring occasional monitoring for efficiency, and they have to be designed, tested, and written by hand. SA needs to be restarted when new rules are added. Bayesian filtering works off tokens, and requires training. It requires matches to a number of tokens to give a score, unlike a rule that can match a single pattern and give a score. You do not have to really understand much about how Bayes works, beyond being able to train it by feeding it things that are good and bad, and telling it which kind you are giving it. What you want to look for is patterns within a single item. These patterns can contain a lot of punctuation. Bayes tends to break symbols on punctuation. So the one URL might become quite a few symbols. This could or could not be useful. My guess is that you would be much better off not using Bayes at all. Which means that you would have to write rules to catch the sort of things you want to catch. In theory you would have to write at least one obfuscation rule for every domain name you want to check on. This would be a manual process. But it could be automated to some extent by using some of the various tools available for making obfuscated phrase checks. I have no idea how well this would work. My thought is that it would be a lot of overkill, or at least overhead, to do what you want. You might be best off building a process where you can pipe new domain names through an obfuscation rule generator, and then combine the ever-growing output into a perl script. Then pipe test domain names through this script, and see which ones it flags as being possibly bogus. You could do that with SA, but it might be more work than simply writing and compiling a perl script. Loren -----Original Message----- From: "Andrews, Rick" <[EMAIL PROTECTED]> Sent: Oct 28, 2004 6:15 PM To: "'users@spamassassin.apache.org'" <users@spamassassin.apache.org> Subject: Using SpamAssassin, but not for spam Greetings, I'm trying to investigate whether SpamAssassin can be used in a non-spam application that we're trying to build. I've read lots of stuff on the website but I'm still not sure. I thought I would ask you, the experts. The application needs to determine whether a certain domain name is "similar" to another domain name. We have a list of known domain names, and occasionally want to compare a "target" domain name to see if it is similar to any of the known domain names. The target might contain replacement characters ("1" instead of "I" or "L", zero instead of "O", gratuitious dots or hyphens, etc.) in much the same way that spammers try to get past spam filters. That's why I thought SpamAssassin might be appropriate. To give an example, we want to automatically detect that "my-d0m.a1n_name.com" is very close to "mydomainname.com". But from what I've read, I think it may not be appropriate for several reasons: 1) We probably would have much more ham (known domain names) than spam (close to a known domain name, but not legal) 2) We wouldn't have large amounts of ham or spam to feed through SpamAssassin to enable it to learn and improve 3) The "target" domain name would in most cases be a single token as far as SpamAssassin is concerned; unlike an email which likely contains hundreds of tokens from which to decide if it is spam What do you think? Would it take a lot of work to adapt SpamAssassin for this application? Does it seem like an appropriate tool to use? Thanks in advance, -Rick