First, you are confusing several things here about how SA works.  If you 
understand this better you will have a better chance of deciding if SA could do 
what you want it to do.

SA works two ways (well, a lot more, but two of importance here): 
1) by hard-coded rules that check for known kinds of patterns, and 
2) by using Bayesian filtering.

Rules do not require "training".  They instead requiring occasional monitoring 
for efficiency, and they have to be designed, tested, and written by hand.  SA 
needs to be restarted when new rules are added.

Bayesian filtering works off tokens, and requires training.  It requires 
matches to a number of tokens to give a score, unlike a rule that can match a 
single pattern and give a score.  You do not have to really understand much 
about how Bayes works, beyond being able to train it by feeding it things that 
are good and bad, and telling it which kind you are giving it.

What you want to look for is patterns within a single item.  These patterns can 
contain a lot of punctuation.  Bayes tends to break symbols on punctuation.  So 
the one URL might become quite a few symbols.  This could or could not be 
useful.  My guess is that you would be much better off not using Bayes at all.

Which means that you would have to write rules to catch the sort of things you 
want to catch.  In theory you would have to write at least one obfuscation rule 
for every domain name you want to check on.

This would be a manual process.  But it could be automated to some extent by 
using some of the various tools available for making obfuscated phrase checks.

I have no idea how well this would work.  My thought is that it would be a lot 
of overkill, or at least overhead, to do what you want.  You might be best off 
building a process where you can pipe new domain names through an obfuscation 
rule generator, and then combine the ever-growing output into a perl script.  
Then pipe test domain names through this script, and see which ones it flags as 
being possibly bogus.

You could do that with SA, but it might be more work than simply writing and 
compiling a perl script.

        Loren


-----Original Message-----
From: "Andrews, Rick" <[EMAIL PROTECTED]>
Sent: Oct 28, 2004 6:15 PM
To: "'users@spamassassin.apache.org'" <users@spamassassin.apache.org>
Subject: Using SpamAssassin, but not for spam

Greetings,

I'm trying to investigate whether SpamAssassin can be used in a non-spam
application that we're trying to build. I've read lots of stuff on the
website but I'm still not sure. I thought I would ask you, the experts.

The application needs to determine whether a certain domain name is
"similar" to another domain name. We have a list of known domain names, and
occasionally want to compare a "target" domain name to see if it is similar
to any of the known domain names. The target might contain replacement
characters ("1" instead of "I" or "L", zero instead of "O", gratuitious dots
or hyphens, etc.) in much the same way that spammers try to get past spam
filters. That's why I thought SpamAssassin might be appropriate. To give an
example, we want to automatically detect that "my-d0m.a1n_name.com" is very
close to "mydomainname.com".

But from what I've read, I think it may not be appropriate for several
reasons:

1) We probably would have much more ham (known domain names) than spam
(close to a known domain name, but not legal)

2) We wouldn't have large amounts of ham or spam to feed through
SpamAssassin to enable it to learn and improve

3) The "target" domain name would in most cases be a single token as far as
SpamAssassin is concerned; unlike an email which likely contains hundreds of
tokens from which to decide if it is spam

What do you think? Would it take a lot of work to adapt SpamAssassin for
this application? Does it seem like an appropriate tool to use?

Thanks in advance,

-Rick

Reply via email to