Re: why not doing a test that checks "name"- pairs

Chip M. Sat, 18 Aug 2007 11:39:36 -0700

Alberto, your reasoning is correct, based on my experience of actually
implementing and using such a system, albeit in a small scale environment.
As "sm" points out, it is particularly useful as a "pass" rule for exact
matches to your users' actual email client "real name"s.


I've implemented this as part of a qmail filter that runs after SA.  As
I've mentioned in other posts, I'm in a shared web hosting environment, and
have no control over SA, so designed my filter to complement the great
strengths of SA, and fill in the holes that are created by a limited
environment.  Just over twenty domains use my filter, and we all share
data, so as to improve everyone's killrates.

I have no idea how practical this would be as an SA plugin, and am
Pearl-illiterate, so I merely describe how I have approached it.  More than
a year ago, I started using _VERY_ crude general header based (To/Cc
checking) real name "pass" rules, then in March of 2007 I added an explicit
"RealName" virtual header so as to allow more powerful rules, including
"match not" type penalty rules.


* Main Issues: *
        - generating a list of account specific real names (preferably 
automatically)
        - real-time extraction of the correct "real name"
        - some "real names" have been compromised, and should receive MUCH lower
pass scores
        - some account names are inherently poorly suited to real name pass 
rules
        (e.g. "jayne.cobb" since all words in the real name also appear as words
in the account part - "jcobb" is a better form)
        - some senders transpose real name parts (e.g. "Cobb, Jayne" in place of
"Jayne Cobb")
        - some senders use cutesy nicknames or other tricks (e.g. "Hero of 
Canton"
in place of "Jayne Cobb")
        - some senders (particularly Bulkers) use the complete account name as 
the
real name, and should not be scored normally
        (e.g. "[EMAIL PROTECTED]" [EMAIL PROTECTED])


* PREP: Semi-automatic Real Name Data Generation: *
I'm just-a-programmer, not a sysadmin, so don't know how a typical pipeline
works, however, if it's practical, automatic real name extraction should be
fairly straight forward.  Just write something that you can temporarily
plug in _AFTER_ SA, and which extracts the account & real name pair from
everthing which passes SA, accumulates the frequencies, and picks the most
often occurring real name(s) for each account (I usually limit this to one
or two).  Include an option for human inspection, mainly for cases where
there is no clear cut winner.  In my experience, the majority of accounts
can be generated automatically, however it's wise to inspect all
possibilities.  That's manageable for small companies (less than 20), and
shouldn't be too bad for low 100s.  The collector app only needs to be run
for a week or so.  New users could be added manually.

It took me much less than five minutes to generate such a data list AND all
matching rules for the last person to join my Team (18 accounts, one week
of data), and my tool merely dumps the per account RealNames with
frequencies.  A slicker tool could make this VERY practical for larger
userbases.

Maintenance and verification would probably be an utter pain for anything
in the 1000s, so best to let us small and nimble types prove its efficacy. :)

There is anecdotal evidence that Hotmail may be doing something with real
name based rules, granted, there's reports that it's a somewhat sub optimal
implementation.  I speculate that they could easily pull the real name
straight out of each user's settings.


* Plugin: Real Name Extraction: *
An actual SA plugin would need to use the SMTP Recipient (or most reliable
Delivered-To account name) to pick out the matching account from the To or
Cc headers, then pull out its real name.  There should also be some
facility for associating external aliases with accounts (e.g. a redirected
ISP account).  If it FAILS to find a matching account, _ALL_ other real
name tests should be skipped or return false.


* Plugin: Real Name Testing: *
If it does find a matching account, three main real name based tests can be
performed:  empty, match, match not.

It's probably easier to understand how these work with a sample, so let's
say we have a user whose account is "[EMAIL PROTECTED]", the real
name in his email client is "Jayne Cobb", and an automatic real name
collector has shown that occasionally he receives important email that uses
the real name "Hero of Canton".  Somewhere, we would construct two data
lists specific to his account, that would look something like this:
        realname_full  = jayne cobb, hero of canton
        realname_words = jayne, cobb, hero, canton

The generic real name "match" test would only trigger if the extracted real
name exactly (case in-sensitive) matched either "jayne cobb" or "hero of
canton", and the "match not" test would only trigger if NONE of the four
words "jayne, cobb, hero, canton" was found anywhere in the real name.
It's feasible to do "soft" matching, instead of word boundary based
matching (my code allows either).

Here's some examples:
        [EMAIL PROTECTED]
        "Jayne Cobb" [EMAIL PROTECTED]
        "Jayen Cobb" [EMAIL PROTECTED]
        "Peter Petrelli" [EMAIL PROTECTED]
The first triggers an "empty" test, but none of the other types of tests.
The second triggers an exact "match" pass rule.
The third has a misspelling so it fails an exact "match" pass rule, AND it
also fails a "match not" penalty rule because one of the words ("Cobb")
does match.  In other words, it receives ZERO total real name points.
The fourth triggers a "match not" penalty rule, because NO words match.

By using a LIST of acceptable individual words in the "match not" rule,
there's no need to mess about with fuzzy matching.  It is still possible
for a fuzzy misfire to occur, however so far I have not seen any actual FPs
caused by them (in more than half a million human+machine reviewed emails).
 Our only FPs have contained word that were widely off, so fuzzy matching
would have made no difference.  As always, careful scoring is appropriate,
and your mileage may vary.  A fuzzy matching option might be more suited to
a later version of a plugin.


* Scoring Notes: *
I generally score the "empty" test either not at all or fairly low (0.5).
I find it's most helpful as a bonus penalty in compound/meta rules, for
example, I give many attachments (zip, PDF, or any image) a small to medium
score if the real name is empty.

I score the "match" rule between -0.51 and -4.59, depending on whether the
real name has been compromised (one of our users gets a lot of ED spam sent
from Russia with his correct real name), and whether that person has
critical "pass" needs.  I have found it to be an EXCELLENT means of
preventing FPs, particularly during times when I'm tinkering with stuff to
fight an emerging threat, and make a dumb mistake.  :)

I score the "match not" rules typically in the 1.02 to 3.06 range (default
of 2.60).  FPs have been extremely low, with most being unimportant
bulk/junk type mail.

One weakness in my own filter is the lack of metas.  If an SA Real Name
plugin were developed, it would be more powerful, since it could be used to
reject specific attachment types that also triggered a "match not" test.
That level of control is more suited to a small business, but it sure is
nice to have. :)


* Efficacy and Performance Notes: *
Since I rolled this out last March, these tests alone immediately improved
my users' typical killrates from about 99.40% to 99.75% (three of us are
now at 100.00%), with a significant decrease in FPs.  Those levels have
been maintained, even during a period when many emerging threats have
driven down our SA rates (again, using a very constrained SA setup).

I have no feel for the SA system performance issues.  In my case, I do all
the "simple" (fast) tests first, then exit if the score is high enough, and
only then do DNS tests.  My general impression is that my overall
performance is higher, because on average these tests avoid more tests than
the time they consume.

Bottom line, I think these can be very effective for a smallish
environment.  Granted, I really need to write some code to extract precise
stats.  I am confident of the beneficial effect on FPs, because I check ALL
of those by hand.
        - "Chip"

Re: why not doing a test that checks "name"- pairs

Reply via email to