Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-16 Thread Kai Schaetzl
Tony Earnshaw wrote on Sun, 15 Jun 2003 18:48:20 +0200:

 Course I didn't take offense at *anything* you have said, say now or
 will say in the future. I don't know what this is about; quoting could
 help jog my memory. It's a huge list, and I can't find the bit you're
 talking about here. Anyway, my excuses if I made you feel you had to
 write this.


Hi Tony,

it was your response below. I really didn't quite understand it, but was 
wondering what Gerhard Schröder had to do with this or statistics. And 
so I figured there was a slight chance that this was intended to express 
that you didn't like my reply. Whatever, thanks for clarifying. :-)

Have a nice week!

 That's exactly what I was saying (perhaps I'd misunderstood Tom.) I
 was 
 trying to say that teaching it spam under the level that one
 has defined 
 as being spam - even if it is spam - amounts to
 defeating one's own 
 purpose.
  
  No, that's exactly what Tom was questioning and after thinking
 about it for 
  two seconds it becomes obvious that he's right. One
 should teach Bayes 
  every spam it doesn't get known otherwise.
 
 Do you remember the indignant voice of Gerhard Schröder (clip was in
 
 English) when being told by D. Rumsfeld that he was obliged to take
 part 
 in the war against Irak?
 
 Well, that's how I feel about this particular thing. Some balmy
 
 institute once forced statistics down my throat.




Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-16 Thread Tony Earnshaw
Kai Schaetzl wrote:

it was your response below. I really didn't quite understand it, but was 
wondering what Gerhard Schrder had to do with this or statistics. And 
so I figured there was a slight chance that this was intended to express 
that you didn't like my reply. Whatever, thanks for clarifying. :-)
Ah. No, it wasn't against you. Rumsfeld had told Schröder he had to do 
what the US said (go to war), because the US said so. Schröder, 
remembering past history, got very annoyed. I don't suppose German 
listeners would have heard or even appreciated how annoyed he was - it 
was a BBC clip, in English. I found it marvelous, the BBC played it over 
and over again for weeks and I found it just as marvelous each time.

What I was trying to say was, that I'd once had to learn statistics as 
part of a business course and what people on the list were saying about 
murdering the Bayes database before it had even reached maturity made me 
feel like Gerhard Schröder.

Have a nice week!
:-) You the same

Best,

Tonni

--
Tony Earnshaw
Working to get a life

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-16 Thread Tom Meunier
 -Original Message-
 From: Tony Earnshaw [mailto:[EMAIL PROTECTED]
 Sent: Monday, June 16, 2003 7:58 AM
 To: [EMAIL PROTECTED]
 Subject: Re: [SAtalk] Removing headers etc.. to feed Bayes correctly
 
...people on the list were 
 saying about 
 murdering the Bayes database before it had even reached 
 maturity made me 
 feel like Gerhard Schröder.


Tony,

To my mind, it's not murdering, or anything remotely approaching it.  The suggestion 
to let sa-learn do the initial ham and spam seeding is simply not optimal.  
Autolearning above (or below) a threshold established by SpamAssassin is an 
ill-conceived method of establishing an initial Bayes token base.  Pre-selecting a 
corpus through spamassassin directly contradicts the entire basis upon which Bayesian 
theory relies for a token database:  the assumption that there are interesting 
tokens that normal heuristics are missing.  A Bayes database doesn't reach maturity 
by having a certain number of SA-filtered spams 15 and SA-filtered hams -2; it 
reaches maturity by having a certain number of confirmed hams and spams, period.  
Therefore, if one organization obtains initial Bayes seeding strictly through 
auto-learning for three weeks and get 2000 hams and 2000 spams in it, and another does 
theirs in 15 minutes by manually teaching it 2000 hams from this week, and 2000 spams 
from this week (that SpamAssassin has never touched), the LATTER would be the much, 
much more accurate Bayesian seeding procedure.

This is discussed in-depth in Paul Graham's writing on the topic, specifically the 
part where he mentions that tokens like per and FL and ff are actually very 
reliable indicators of spammishness.

-tom


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-16 Thread Tony Earnshaw
Tom Meunier wrote:

To my mind, it's not murdering, or anything remotely approaching it.
Bless you for bothering, Tom. Your line wrap is so utterly impossible 
with Mozilla 1.4rc1 (would have been better in Evo 1.2.4, but I fscked 
that up on my machine by compiling and installing gtk+2.2), that I don't 
know whether I can cope with it. We'll see :-)

 The suggestion to let sa-learn do the initial ham and spam seeding is 
simply not optimal.

That's what I keep on saying.

 Autolearning above (or below) a threshold established by SpamAssassin 
is an ill-conceived method of establishing an initial Bayes token base.

That's what I keep on saying.

 Pre-selecting a corpus through spamassassin directly contradicts the 
entire basis upon which Bayesian theory relies for a token database:

Agreed.

 the assumption that there are interesting tokens that normal 
heuristics are missing.

Agreed.

A Bayes database doesn't reach maturity by having a certain number of SA-filtered spams 15 and SA-filtered hams -2; it reaches maturity by having a certain number of confirmed hams and spams, period.
Disagree. It never reache maturity. But just as a kid or a kitten, it 
has to reach maturity. No good teaching it as an adult until then.

 Therefore, if one organization obtains initial Bayes seeding strictly 
through auto-learning for three weeks and get 2000 hams and 2000 spams 
in it, and another does theirs in 15 minutes by manually teaching it 
2000 hams from this week, and 2000 spams from this week (that 
SpamAssassin has never touched), the LATTER would be the much, much more 
accurate Bayesian seeding procedure.

That last line went on for 1,5 kilometers gasp. It's the *pattern* of 
tokens that matters. And until that pattern is established, it's useless 
to rely on it. It's useless to expect that a kid of 5 should know the 
difference between play and reality when he's pointing a Colt 45 at 
someone, unless he either shoots him, or you smack his hand and take the 
gun away. I choose for smacking his hand and taking the gun away.

This is discussed in-depth in Paul Graham's writing on the topic, specifically the part where he mentions that tokens like per and FL and ff are actually very reliable indicators of spammishness.
Probably. But a: I'm pig-headed and b: you can't reach a statistical 
conclusion with a population of 1. The greater the bias, the greater the 
accuracy. The kid of 5 will probably thank you in later life for 
smacking his hand and taking the gun away. Maybe you're a mathematician, 
I wouldn't know. I hate math. But I've done enough chi squared and other 
analyses to now that.

Best - and I really appreciate your involvement,

Tony

--
Tony Earnshaw
Working to get a life

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-16 Thread Justin Mason

Tom Meunier said:

 A Bayes 
 database doesn't reach maturity by having a certain number of SA-filtered spa
 ms 15 and SA-filtered hams -2; it reaches maturity by having a certain numb
 er of confirmed hams and spams, period.  Therefore, if one organization obtai
 ns initial Bayes seeding strictly through auto-learning for three weeks and g
 et 2000 hams and 2000 spams in it, and another does theirs in 15 minutes by m
 anually teaching it 2000 hams from this week, and 2000 spams from this week (
 that SpamAssassin has never touched), the LATTER would be the much, much more
  accurate Bayesian seeding procedure.

Yes, exactly correct.

 This is discussed in-depth in Paul Graham's writing on the topic, specificall
 y the part where he mentions that tokens like per and FL and ff are
  actually very reliable indicators of spammishness.

Mind you, this is not so correct. ;)

Ignore that part of PG's writings; it indicates only that he does not get
very much HTML email ;)   We tested this, and against our corpora it did
very badly.  So it's one of those things that mean 1 thing for 1 person
and another for others -- which, coincidentally, is where bayes does well
;)

--j.


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-16 Thread Kai Schaetzl
Sorry, clicked the wrong button. My last reply wasn't meant to go to the 
list but only to Tony, my apologies.


Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





---
This SF.Net email is sponsored by: INetU
Attention Web Developers  Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-15 Thread Tony Earnshaw
Kai Schaetzl wrote:

That's exactly what I was saying (perhaps I'd misunderstood Tom.) I was 
trying to say that teaching it spam under the level that one has defined 
as being spam - even if it is spam - amounts to defeating one's own 
purpose.
No, that's exactly what Tom was questioning and after thinking about it for 
two seconds it becomes obvious that he's right. One should teach Bayes 
every spam it doesn't get known otherwise.
Do you remember the indignant voice of Gerhard Schröder (clip was in 
English) when being told by D. Rumsfeld that he was obliged to take part 
in the war against Irak?

Well, that's how I feel about this particular thing. Some balmy 
institute once forced statistics down my throat.

Best,

Tony

--
Tony Earnshaw
- Deyr fé, deyr frendr
deyr sjálfr 'it sama
- ek veit ein aldrigi deyr
- dómr um dauðan hvern.
From Hávamál - what gods have said

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-15 Thread Kai Schaetzl
Tony, I don't know what you mean. But if you took any offense from my 
reply it was definitely not meant this way, sorry.


Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-15 Thread Tony Earnshaw
Kai Schaetzl wrote:

Tony, I don't know what you mean. But if you took any offense from my 
reply it was definitely not meant this way, sorry.
Kai,

Course I didn't take offense at *anything* you have said, say now or 
will say in the future. I don't know what this is about; quoting could 
help jog my memory. It's a huge list, and I can't find the bit you're 
talking about here. Anyway, my excuses if I made you feel you had to 
write this.

Best,

Tony

--
Tony Earnshaw
- Deyr fé, deyr frendr
deyr sjálfr 'it sama
- ek veit ein aldrigi deyr
- dómr um dauðan hvern.
From Hávamál - what gods have said

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-15 Thread Tony Earnshaw
Kai Schaetzl wrote:

Tony, I don't know what you mean. But if you took any offense from my 
reply it was definitely not meant this way, sorry.
Kai,

Course I didn't take offense at *anything* you have said, say now or
will say in the future. I don't know what this is about; quoting could
help jog my memory. It's a huge list, and I can't find the bit you're
talking about here. Anyway, my excuses if I made you feel you had to
write this.
Best,

Tony

--
Tony Earnshaw
- Deyr fé, deyr frendr
deyr sjálfr 'it sama
- ek veit ein aldrigi deyr
- dómr um dauðan hvern.
From Hávamál - what gods have said

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-14 Thread Tony Earnshaw
Justin Mason wrote:

You'll confuse the whole Bayes database if you do anything different. 
Why in goodness name put a minimum score of 5 in the first place, if 
you're going to contradict yourself?


Actually, Tom's dead right.

If it's spam, feed it to the bayes learner as spam; if it's ham, do
the opposite.  Stuff that the learner got wrong is especially valuable,
as it fixes the tokens that were misleading it in the first place.
That's exactly what I was saying (perhaps I'd misunderstood Tom.) I was 
trying to say that teaching it spam under the level that one has defined 
as being spam - even if it is spam - amounts to defeating one's own 
purpose. One can start doing that *after* one's got one's initial biased 
database, but at least give the whole thing a reasonable base on which 
to begin.

Best,

Tony

--
Tony Earnshaw
- Deyr fé, deyr frendr
deyr sjálfr 'it sama
- ek veit ein aldrigi deyr
- dómr um dauðan hvern.
From Hávamál - the voice of the gods.

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-14 Thread Tony Earnshaw
Robert Menschel wrote:

TE To get a reasonable base, it's been my understanding that you teach
TE Bayes what is spam and what isn't. ...
Agreed.

TE You don't start contradicting what you've taught it by teaching it
TE low scoring spam until after you've reached your minimum bias of 200.
How is teaching Bayes that this email with a low SA score is actually
spam contradicting what we've taught it? I see this as teaching Bayes
that there is spam which SA doesn't yet have adequate rules for, and IMO,
from my experience, Bayes is a lot more flexible in handling these than
SA is.
I'm simply saying wait until you've got your 200-base bias to do so.

(I don't have the ability to create new SA rules because of my end-user
status. I can change scores, but that has limited application. I can feed
all of my spam into Bayes, and Bayes works wonders at recognizing spam
that SA can't.) 
The system you're using probably reache that minimum bias long ago.

TE You'll confuse the whole Bayes database if you do anything different.
TE Why in goodness name put a minimum score of 5 in the first place, if
TE you're going to contradict yourself?
Bayes doesn't care about the score, and Bayes can't IMO be confused by
seeing ham or spam as long as it's properly identified.
Right. It's the *pattern of tokens* used for Bayes analysis I'm 
concerned about. After 1,000 spams that's o.k., but under 200 it's critical.

So really we're saying the same thing, only you didn't notice my proviso 
of a minimum number of spams/non-spams (the latter are far more numerous 
on my system) before treating the thing as an adult.

Best,

Tony

--
Tony Earnshaw
- Deyr fé, deyr frendr
deyr sjálfr 'it sama
- ek veit ein aldrigi deyr
- dómr um dauðan hvern.
From Hávamál - the voice of the gods.

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-14 Thread Kai Schaetzl
Tony Earnshaw wrote on Sat, 14 Jun 2003 10:35:50 +0200:

 That's exactly what I was saying (perhaps I'd misunderstood Tom.) I was 
 trying to say that teaching it spam under the level that one has defined 
 as being spam - even if it is spam - amounts to defeating one's own 
 purpose.


No, that's exactly what Tom was questioning and after thinking about it for 
two seconds it becomes obvious that he's right. One should teach Bayes 
every spam it doesn't get known otherwise.


Kai

-- 

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


RE: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-13 Thread Tom Meunier
I'm kind of confused here.  The way I see it (which could very well be a 
misunderstanding, mind you) is that the reason it autolearns spam over 15 points by 
default is to make darned sure that it doesn't learn a false positive.  Then one would 
augment its learning by feeding missed spams through sa-learn.  The only reason I can 
think of to NOT feed low-scoring spams through sa-learn is that I've decided that a 
spam that scores 5.x points has no interesting tokens.  Quite the opposite is true; 
that's why we feed it with a corpus of known spam in the first place, rather than 
feeding it a corpus of known spam that has been run through spamassassin manually and 
the under-15 spams weeded out.  Same goes with hand-feeding hams that score 4.x 
points, in the theory that there's a fixed probability that a ham from that source 
will at some point trigger another test and trip it over the threshold.

Perhaps I misunderstand.  If so, I'd appreciate alternate viewpoints and discussion.

-tom

-Original Message-
From: Tony Earnshaw [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 10, 2003 3:10 PM
To: Simon Crowther
Cc: [EMAIL PROTECTED]
Subject: Re: [SAtalk] Removing headers etc.. to feed Bayes correctly


Simon Crowther wrote:

 I wish to start feeding some of these low scoring spams using SA 
 Learn.

Don't. Have patience; trust me.

Tony

-- 
Tony Earnshaw

There's none so daft as them as will not learn

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-13 Thread Tony Earnshaw
Tom Meunier wrote:

I'm kind of confused here.  The way I see it (which could very well be a 
misunderstanding, mind you) is that the reason it autolearns spam over 15 points by 
default is to make darned sure that it doesn't learn a false positive.  Then one would 
augment its learning by feeding missed spams through sa-learn.  The only reason I can 
think of to NOT feed low-scoring spams through sa-learn is that I've decided that a 
spam that scores 5.x points has no interesting tokens.  Quite the opposite is true; 
that's why we feed it with a corpus of known spam in the first place, rather than 
feeding it a corpus of known spam that has been run through spamassassin manually and 
the under-15 spams weeded out.  Same goes with hand-feeding hams that score 4.x 
points, in the theory that there's a fixed probability that a ham from that source 
will at some point trigger another test and trip it over the threshold.
Perhaps I misunderstand.  If so, I'd appreciate alternate viewpoints and discussion.
Sorry for the late reply - I was doing something else :-)

To get a reasonable base, it's been my understanding that you teach 
Bayes what is spam and what isn't. Your basic spam score's already 
defined (default +5.0) in local.cf. You go on doing that until you've 
got 200 of the things (spam.) To my mind that should be closer to 500 or 
even 1,000, but never mind. You do that to get a reasonably biased base.

You don't start contradicting what you've taught it by teaching it low 
scoring spam until after you've reached your minimum bias of 200.

You'll confuse the whole Bayes database if you do anything different. 
Why in goodness name put a minimum score of 5 in the first place, if 
you're going to contradict yourself?

It's not me that made up the above, it's in the documentation and the 
list archives and what I kept to myself when I first started teaching Bayes.

Tony

--
Tony Earnshaw
- Deyr fé, deyr frendr
deyr sjálfr 'it sama
- ek veit ein aldrigi deyr
- dómr um dauðan hvern.
From Hávamál - the voice of the gods.

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-13 Thread Justin Mason

Tony Earnshaw said:
 Tom Meunier wrote:
  I'm kind of confused here.  The way I see it (which could very well be a mi
  sunderstanding, mind you) is that the reason it autolearns spam over 15 point
  s by default is to make darned sure that it doesn't learn a false positive.  
  Then one would augment its learning by feeding missed spams through sa-learn.
The only reason I can think of to NOT feed low-scoring spams through sa-lea
  rn is that I've decided that a spam that scores 5.x points has no interesting
   tokens.  Quite the opposite is true; that's why we feed it with a corpus of 
  known spam in the first place, rather than feeding it a corpus of known spam 
  that has been run through spamassassin manually and the under-15 spams weeded
   out.  Same goes with hand-feeding hams that score 4.x points, in the theory 
  that there's a fixed probability that a ham from that source will at some poi
  nt trigger another test and trip it over the threshold.
  Perhaps I misunderstand.  If so, I'd appreciate alternate viewpoints and di
  scussion.
 
 To get a reasonable base, it's been my understanding that you teach 
 Bayes what is spam and what isn't. Your basic spam score's already 
 defined (default +5.0) in local.cf. You go on doing that until you've 
 got 200 of the things (spam.) To my mind that should be closer to 500 or 
 even 1,000, but never mind. You do that to get a reasonably biased base.
 
 You don't start contradicting what you've taught it by teaching it low 
 scoring spam until after you've reached your minimum bias of 200.

 You'll confuse the whole Bayes database if you do anything different. 
 Why in goodness name put a minimum score of 5 in the first place, if 
 you're going to contradict yourself?

Actually, Tom's dead right.

If it's spam, feed it to the bayes learner as spam; if it's ham, do
the opposite.  Stuff that the learner got wrong is especially valuable,
as it fixes the tokens that were misleading it in the first place.

--j.


---
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-10 Thread Martin Radford
At Tue Jun 10 09:25:45 2003, Simon Crowther wrote:

 Received: from mailgate.msxi-euro.com
   ([136.140.231.40])
   by msxi-euro.com; Tue, 10 Jun 2003 01:36:56 +0100
 Received: by mailgate.msxi-euro.com (Postfix, from userid 1002)
   id 5EE9F9F25; Tue, 10 Jun 2003 01:36:26 + (GMT)
 Received: from lfallback1.lnd.ops.eu.uu.net
 (lfallback1.lnd.ops.eu.uu.net [62.189.34.30])
   by mailgate.msxi-euro.com (Postfix) with ESMTP id 647139F22
   for [EMAIL PROTECTED]; Tue, 10 Jun 2003 01:36:21 +
 (GMT)
 Received: from modemcable055.158-203-24.mtl.mc.videotron.ca
 ([24.203.158.55] helo=gtei.net)
   by lfallback1.lnd.ops.eu.uu.net with smtp (Exim 3.22 #1)
   id 19PX8l-0006n0-00
   for [EMAIL PROTECTED]; Tue, 10 Jun 2003 00:37:33 +
 Message-ID:
 [EMAIL PROTECTED]
 From: Kristopher Nance [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: no joke d  
 ips9553ds9
 Date: Tue, 10 Jun 2003 15:49:45 +
 MIME-Version: 1.0
 In-Reply-To: [EMAIL PROTECTED]
 Content-Type: text/html
 Content-Transfer-Encoding: 8bit
 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
 X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
 X-Spam-Status: No, hits=3.5 required=5.0
   tests=BANG_GUARANTEE,CLICK_BELOW,DATE_IN_FUTURE_12_24,
 HTML_FONT_BIG,HTML_FONT_BIG_B,HTML_FONT_COLOR_BLUE,
 HTML_FONT_COLOR_RED,HTML_MESSAGE,HTML_TAG_EXISTS_TBODY,
 IN_REP_TO,MAILTO_TO_SPAM_ADDR,MIME_HTML_ONLY,MONEY_BACK,
 MSGID_GOOD_EXCHANGE,PENIS_ENLARGE,PENIS_ENLARGE2,
  ^^^

Upgrading to 2.55 will deal with this - MSGID_GOOD_EXCHANGE is worth
-5.7 in 2.53 and is one of the reasons for the release of 2.54.
Without that rule, the spam would have got 9.2.

If you can't upgrade immediately, add the following to your local.cf:

score MSGID_GOOD_EXCHANGE 0.0

and restart spamd (if you're using spamd).

I don't see anything Groupwise-specific in the headers.  sa-learn will
ignore any SpamAssassin markup it finds.

Martin
-- 
Martin Radford  |   Only wimps use tape backup: _real_ 
[EMAIL PROTECTED] | men just upload their important stuff  -o)
Registered Linux user #9257 |  on ftp and let the rest of the world  /\\
- see http://counter.li.org |   mirror it ;)  - Linus Torvalds _\_V


---
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk


Re: [SAtalk] Removing headers etc.. to feed Bayes correctly

2003-06-10 Thread Tony Earnshaw
Simon Crowther wrote:

I wish to start feeding some of these low scoring spams using SA 
Learn.
Don't. Have patience; trust me.

Tony

--
Tony Earnshaw
There's none so daft as them as will not learn

http://j-walk.com/blog/docs/conference.htm
http://www.billy.demon.nl
Mail: [EMAIL PROTECTED]


---
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
___
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk