> It may come down to my understanding of Bayes and its tokens.. Also
> having a bit a problem explaining this concept on paper...
>
> I see this as adding an extra layer to the Bayes:
>
> Consider the following 2 basic emails:
>
> Mail 1:
> Viagra
>
> Mail 2:
> V1agra
>
>
> With Bayes:
>
> Mail 1:
> <token 1>
>
> Mail 2:
> <token 2>
>
> With Concepts & Bayes:
>
> Mail 1:
> <token 1>
> <meds>
>
> Mail 2:
> <token 2>
> <meds>
>
> ---
>
> So without Concepts:
>
> Mail 1 comes into the platform, is tokenized (token1) and is classified
> and learnt as spam.
> Mail 2 comes into the platform, tokenized (token2) and has no common
> tokens with mail 1 - so no association is made

Why is mail 1 classified and learned as spam and mail 2 not?
Classification and learning are two separate matters. Learning can be done
automatic (based on other rules) or manual.

In your example both Viagra and V1agra could be auto learned as spam based
on other rules. And also by hand. There is basically no difference between
them.

>
> With Concepts
>
> Mail 1 comes into the platform, is tokenized (token1 & meds) and is
> classified and learnt as spam.
> Mail 2 comes into the platform, is tokenized (token2 & meds) and has the
> same common "meds" token as associated with Mail 1
>
> Does this makes sense - am I right in my assumptions?

I think it might actually complicate matters for bayes. Now you introduced
a 3rd token, 'meds'. This token is now also considered when bayes decides
if its ham or spam. However, meds is not an unique token, it already
exists (because i can write a mail asking about my dads meds, or other
spam might mention it).

So how does this affect all mail with 'meds' in it? Does it classify more
ham? But maybe also more false positives? Or maybe the opposite, because
meds is used in alot of legit mail?

I think its very difficult to tamper with this, introducing new content.

Then there is the effort needed of maintaining your concepts (i assume all
asociations are made by hand, i didnt look at the code yet). It is most
likely always outdated. The original bayes filter would know about V1agra
before you did, and added it to your concepts. And once bayes knows it
already, what is the point of creating a concept of it for bayes?

It would be interesting to see what a new bayes db would do which is ONLY
trained with your concepts keywords. This would be a very small bayes db i
guess. Curious if it could be effective in any way.

>
> Paul
>
> On 25/05/16 09:02, Merijn van den Kroonenberg wrote:
>>> With David's help I have tracked down the problem(s). Version 0.02 is
>>> up. Would be interested to hear you thoughts - even if just theoretical
>>> about the affect to the Bayes DB.
>> Just in theory, i am curious what part of the Bayes filter you hope to
>> improve? I think you are not adding any *new* information to the e-mail,
>> your concepts are based purely on the mail content right?
>>
>> It seems you just overpower some tokens a bit more but I am not sure if
>> your concepts are useful for a bayes filter. Especially a bayes filter
>> would not need this I would say. Maybe the concepts would be useful to
>> humans or rules written by humans.
>>
>>> Paul
>>> --
>>> Paul Stead
>>> Systems Engineer
>>> Zen Internet
>>>
>>
>
> --
> Paul Stead
> Systems Engineer
> Zen Internet
>


Reply via email to