Re: SA Concepts - plugin for email semantics

2016-05-31 Thread David Jones
>From: RW <rwmailli...@googlemail.com> >Sent: Tuesday, May 31, 2016 5:20 PM >To: users@spamassassin.apache.org >Subject: Re: SA Concepts - plugin for email semantics >On Tue, 31 May 2016 15:20:56 -0400 >Bill Cole wrote: >> On 29 May 2016, at 11:07, RW wrote

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread RW
On Tue, 31 May 2016 15:20:56 -0400 Bill Cole wrote: > On 29 May 2016, at 11:07, RW wrote: > > > Statistical filters are based on some statistical theory combined > > with pragmatic kludges and assumptions. Practical filters have been > > developed based on what's been found to work, not on

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Dianne Skoll
On Tue, 31 May 2016 21:23:11 +0100 Paul Stead wrote: > The implementation was undertaken from a personal interest - I asked > the question of what people thought of the implementation and the > impact to Bayes DB. I think what the "concepts" concept ends up doing

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Paul Stead
On 31/05/16 20:20, Bill Cole wrote: It is no shock that while this implementation has Paul Stead's name on it, it is apparently mostly the product of the anti-spam community's most spectacular case of Dunning-Kruger Syndrome, who has apparently figured out that his personal 'brand' has

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Bill Cole
On 29 May 2016, at 11:07, RW wrote: On Sat, 28 May 2016 15:37:21 -0400 Bill Cole wrote: More importantly (IMHO) they aren't designed to collide with existing common tokens and be added back into messages that may contain those tokens already in order to influence Bayesian classification.

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread RW
On Tue, 31 May 2016 12:05:39 -0400 Bill Cole wrote: > On 31 May 2016, at 2:21, Henrik K wrote: > > > On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote: > >> On Mon, 30 May 2016 17:45:52 -0400 > >> "Bill Cole" wrote: > >> > >>> So you could

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread RW
On Mon, 30 May 2016 17:45:52 -0400 Bill Cole wrote: > The "Naive Bayes" classification approach is theoretically moored to > Bayes' Theorem FWIW Bayes hasn't been "Naive Bayes" for a long time.

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Bill Cole
On 31 May 2016, at 2:21, Henrik K wrote: On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote: On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Reindl Harald
Am 31.05.2016 um 02:30 schrieb Bill Cole: On 30 May 2016, at 18:25, Dianne Skoll wrote: On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural

Re: SA Concepts - plugin for email semantics

2016-05-31 Thread Henrik K
On Mon, May 30, 2016 at 06:25:08PM -0400, Dianne Skoll wrote: > On Mon, 30 May 2016 17:45:52 -0400 > "Bill Cole" wrote: > > > So you could have 'sex' and 'meds' and 'watches' tallied up in into > > frequency counts that sum up natural (word) and synthetic

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Bill Cole
On 30 May 2016, at 18:25, Dianne Skoll wrote: On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural (word) and synthetic (concept) occurrences,

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Dianne Skoll
On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" wrote: > So you could have 'sex' and 'meds' and 'watches' tallied up in into > frequency counts that sum up natural (word) and synthetic (concept) > occurrences, not just as incompatible types of input

Re: SA Concepts - plugin for email semantics

2016-05-30 Thread Bill Cole
On 28 May 2016, at 17:53, John Hardin wrote: Based on that, do you have an opinion on the proposal to add two-word (or configurable-length) combinations to Bayes? CAVEAT: it has literally been decades since I've worked deep in statistics on a routine basis rather than just using blindly

Re: SA Concepts - plugin for email semantics

2016-05-29 Thread Reindl Harald
Am 29.05.2016 um 02:46 schrieb Dianne Skoll: And also, two-word phrases can be stronger indicators than the individual words; "hot" and "sex" in isolation may not be strong spam indicators, but "hot sex" probably is stronger. Going from one-word tokens to one+two-word tokens will have a

Re: SA Concepts - plugin for email semantics

2016-05-29 Thread RW
On Sat, 28 May 2016 15:37:21 -0400 Bill Cole wrote: > More importantly (IMHO) they aren't designed to collide with existing > common tokens and be added back into messages that may contain those > tokens already in order to influence Bayesian classification. > > There is sound statistical

Re: SA Concepts - plugin for email semantics

2016-05-28 Thread Dianne Skoll
On Sat, 28 May 2016 14:53:15 -0700 (PDT) John Hardin wrote: > Based on that, do you have an opinion on the proposal to add two-word > (or configurable-length) combinations to Bayes? I have an opinion. :) Extending Bayes to look at multiple tokens is a *very* good idea.

Re: SA Concepts - plugin for email semantics

2016-05-28 Thread John Hardin
On Sat, 28 May 2016, Bill Cole wrote: There is sound statistical theory consistent with empirical evidence underpinning the Bayes classifier implementation in SA. While there can be legitimate critiques of the SA implementation specifically and in general how well email word frequency fits

Re: SA Concepts - plugin for email semantics

2016-05-28 Thread Bill Cole
On 25 May 2016, at 13:15, Dianne Skoll wrote: On Wed, 25 May 2016 18:10:57 +0100 Paul Stead wrote: [quoting Dianne] "Concepts" is a lossy process. You are throwing away information. That is by design, similar to fingerprinting emails with iXhash or Razor.

Re: SA Concepts - plugin for email semantics

2016-05-26 Thread Matus UHLAR - fantomas
On Thu, 26 May 2016 12:20:35 +0200 Matus UHLAR - fantomas wrote: you apparently mistook razor to DCC, the DCC is here to measure bulkiness, but not (necessarily) spamminess. On 26.05.16 09:46, Dianne Skoll wrote: Yes, you are correct. Thanks for the clarification! And

Re: SA Concepts - plugin for email semantics

2016-05-26 Thread Dianne Skoll
On Thu, 26 May 2016 12:20:35 +0200 Matus UHLAR - fantomas wrote: > you apparently mistook razor to DCC, the DCC is here to measure > bulkiness, but not (necessarily) spamminess. Yes, you are correct. Thanks for the clarification! And also, just to clarify another thing:

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Dianne Skoll
On Wed, 25 May 2016 18:10:57 +0100 Paul Stead wrote: > > Yes, except here's the problem. A drug company might legitimately > > talk about Viagra, so that wouldn't be a spam token. V1agra almost > > certainly would be a spam token. Bayes can distinguish between

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Paul Stead
On 25/05/16 15:21, Dianne Skoll wrote: On Wed, 25 May 2016 15:07:37 +0100 Paul Stead wrote: Consider the following 2 basic emails: Mail 1: Viagra Mail 2: V1agra Yes, except here's the problem. A drug company might legitimately talk about Viagra, so that

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Dianne Skoll
On Wed, 25 May 2016 15:07:37 +0100 Paul Stead wrote: > Consider the following 2 basic emails: > Mail 1: > Viagra > Mail 2: > V1agra Yes, except here's the problem. A drug company might legitimately talk about Viagra, so that wouldn't be a spam token. V1agra

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Paul Stead
It may come down to my understanding of Bayes and its tokens.. Also having a bit a problem explaining this concept on paper... I see this as adding an extra layer to the Bayes: Consider the following 2 basic emails: Mail 1: Viagra Mail 2: V1agra With Bayes: Mail 1: Mail 2: With

Re: SA Concepts - plugin for email semantics

2016-05-25 Thread Merijn van den Kroonenberg
> > With David's help I have tracked down the problem(s). Version 0.02 is > up. Would be interested to hear you thoughts - even if just theoretical > about the affect to the Bayes DB. Just in theory, i am curious what part of the Bayes filter you hope to improve? I think you are not adding any

Re: SA Concepts - plugin for email semantics

2016-05-24 Thread Paul Stead
On 24/05/16 17:09, David Jones wrote: Good idea. I would like to test this out so I put this on my CentOS 6 servers (perl v5.10.1) and got this: May 24 10:59:51.850 [30158] warn: plugin: failed to parse plugin /etc/mail/spamassassin/Concepts.pm: Type of arg 1 to push must be array (not

Re: SA Concepts - plugin for email semantics

2016-05-24 Thread David Jones
>From: Paul Stead >Sent: Tuesday, May 24, 2016 9:55 AM >To: users@spamassassin.apache.org >Subject: SA Concepts - plugin for email semantics >Hi guys, >Based upon some information from others on the list I have put together >a plugin for SA which canonicalises an