Yes - you missed something. It is about intersecting one corpi and NOT intersecting the other.

This is about what doesn't match - not what does.

On 01/20/16 10:26, Shawn Bakhtiar wrote:
Sorry.. how is this different than Naive Bayes filtering??

"Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam."
https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering

"the set of fingerprints of the test message is intersected with the spam and ham corpi creating sub sets of matches. Then you do a set diff both ways (ham - spam) (spam - ham) and whichever side is bigger wins. Generally it will match on only one side or very predominately on one side.” — Marc Perkel

You are still looking up words/phrases in a dictionary set, and coming up with a probability factor of which side it falls on (an application of Baye’s theorom).

Or did I miss something?



On Jan 20, 2016, at 9:17 AM, Wrolf <wr...@wrolf.net <mailto:wr...@wrolf.net>> wrote:

Good luck with your patent application, it should be in the infinitely elastic queue right after my perpetual motion machine.

Not sure how you will deal with the number of ham tokens in spam messages. Also not sure how much ham will get canned as spam - but then, maybe people shouldn't be sending each other poetry?

haiku by email
blossoms in my inbox
drink morning coffee


;-)


Wrolf
wr...@wrolf.net <mailto:wr...@wrolf.net>

On Wed, Jan 20, 2016 at 11:52 AM, Marc Perkel <supp...@junkemailfilter.com <mailto:supp...@junkemailfilter.com>> wrote:

    OK - following up on this. I have my provisional patent filed.
    I'm still doing development to improve it and working on a
    licensing contract. But the license will be based on the Creative
    Commons patent with some restrictions added. Basically I want to
    get a license fee from the big guys and my spam filtering
    competitors. So unless you are in the spam filtering business or
    have more than 10,000 email addresses it's not going to cost you
    anything.

    I'm going to describe the concept here. I'm not going to share my
    code because my code is specific to my system and it a
    combination of bash scripts, redis, pascal, php, and Exim rules.
    And the open source programmers are likely to implement it better
    than I have. Basically I'm trying not to put myself out of
    business and this new method is a bigger breakthrough than
    Bayesian filtering.

    Maybe I should call it a new plan for spam?

    So - I'm just going to introduce the concept right now about how
    it works. Once you know what I'm doing it should be easy to
    implement, I had it working in a couple of days and I'm not an
    outstanding programmer. One thing to keep in mind is this is a
    paradigm shift. It's not about matching - *it's about NOT
    matching*. And although it is far better at catching spam, it
    best feature is actively identifying good email.

    The secret sauce

    Suppose I get an email with the subject line "Let's get some
    lunch". I know it's a good email because spammers never say
    "Let's go to lunch". In fact there are an infinite number of
    words and phrases that are used in good email that are never ever
    used in spam. And if I'm using words and phrases *never used in
    spam* that are used in ham - it's good email. And similarly - if
    I'm using words and phrases that are used in spam and *never used
    in spam* - it's spam.

    So - how do I get a list of words and phrases never used in spam?
    I create a list of words and phrases that are used in spam and
    check to see if it's *not on the list*.

    What I do is tokenize the spamiest parts of the email, like the
    subject line, into words and phrases of 1 2 3 and 4 word phrases.

    the quick brown fox jumps over the lazy dog - becomes

    "the" "quick" "the quick" "brown" "quick brown" "the quick brown"
    "fox" "brown fox" "quick brown fox" "the quick brown fox" "jumps"
    "fox jumps" "brown fox jumps" "quick brown fox jumps" "over"
    "jumps over" "fox jumps over" "brown fox jumps over" "the" "over
    the" "jumps over the" "fox jumps over the" "lazy" "the lazy"
    "over the lazy" "jumps over the lazy" "dog" "lazy dog" "the lazy
    dog" "over the lazy dog"

    These tokens are learned as ham or spam and added to sets. I'm
    using Redis to do this because it has extremely fast set
    operations. I don't know of anything other than Redis that can do
    this. So think about Redis as the way to implement this.

    A new message comes in. It is tokenized and fingerprinted and
    hundreds of fingerprints are generated. Then it's all set
    operations. the set of fingerprints of the test message is
    intersected with the spam and ham corpi creating sub sets of
    matches. Then you do a set diff both ways (ham - spam) (spam -
    ham) and whichever side is bigger wins. Generally it will match
    on only one side or very predominately on one side.

    So I'm not just tokenizing the subject. Also the first 25 words
    of the message, the text of links in the message, The name part
    of the from address, The header names, the attachment names, the
    PHP script if there is one, and various behavior characteristics,
    (slow, no quit, no RDNS, number on mime parts, multiple
    recipients, etc.)

    SpamAssassin is all about matching rules. This is all about not
    matching. Not matching allows you to compare to an infinite set
    rather than a finite set. So when spammers start misspelling
    words to not match the rules, my system catches that and makes
    its own rules. The tricks that spammers use not makes it easier
    to catch them using this method.


    I will post a link to a better explanation later when I write
    one. But wanted to let you all know this wasn't just a tease from
    some crazy person.

    So - here's what I want to see happen.

    I'd like to see SA implement this. I will provide a license to
    include with it giving most people a free license. sort of like
    how Spamhaus isn't free to everyone, but it's in SA. Then the new
    method will take off and eventually I'll get a little something
    for this.

    This new method (I'm calling it the Evolution Spam Filter because
    the algorithm mimics evolution.) it doesn't just block spammers,
    it decimates spammers. It's not just a treatment - it's the cure.
    I hate spam and although I could have kept this secret and made
    money having the best spam filter on the planet, I decided I had
    a moral obligation to make this generally available. I think this
    will save the global economy billions of dollars in recovered
    productivity and crime and fraud prevention.

    I'm seeing close to 100% accuracy. It is so accurate it's scary
    and I think my implementation is crude at best. I think if it
    were done right it could even get closer to 100% than I have.
    Once you wrap your brain around the concept it's almost scary how
    well it works.

    The side effects is this is a very fast and simple recursive
    learner. What happens is that as people converse by email it
    learns more words and phrases about the stuff that people talk
    about that are never used in spam. It doesn't have to know what
    language you are using, it will learn it on it's own. It's like
    having SA with 100 million accurate rules where it write new
    rules itself.

    I will leave you with that and I'll have more later.





--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to