On Monday 22 August 2016 at 18:00:35, Marc Perkel wrote:

> On 08/22/16 07:37, Antony Stone wrote:
> > 
> > So what makes "cheapest Viagra online" a token, such that "cheapest" and
> > "online" are not tokens?
>
> They would all be tokens. Just pointing out one that would match spam
> and not match ham. "cheapest" and "online" would likely be in both sets
> and would be ignored.

Hm, that doesn't tie up with your earlier reply:

On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote:

> On 08/22/16 07:28, Dianne Skoll wrote:
> > On Mon, 22 Aug 2016 07:16:41 -0700
> > 
> > As far as I understand your algorithm, if an email contains at least one
> > token in the "ham" set and zero tokens in the "spam" set, you classify it
> > as ham.  And conversely, if it contains at least one spam token but zero
> > ham tokens, you classify it as spam.
> 
> YES! YES! YES!

Er, really?  See below.

> Although I look at some thousand "fingerprints" to get a more
> significant result.
> 
> > The other two possibilities (no tokens in either or some tokens in both)
> > are undecidable.
> 
> Exactly!

So, it's not that "if an email contains at least one token in the 'ham' set 
and zero tokens in the 'spam' set, you classify it as ham".

You in fact ignore any tokens in the email which are in both the 'ham' and 
'spam' sets, and then - what - work out which set contains more of the left-
over tokens?


Antony.

-- 
Pavlov is in the pub enjoying a pint.
The barman rings for last orders, and Pavlov jumps up exclaiming "Damn!  I 
forgot to feed the dog!"

                                                   Please reply to the list;
                                                         please *don't* CC me.

Reply via email to