Re[2]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello DZ-Jay,

Sunday, August 15, 2004, 9:31:38 AM, you wrote:

DJ Some time around 08/15/2004 09:23:46, I think I heard MikeD (3) say:
 Hello Andre,

 Sunday, August 15, 2004, 6:44:17 AM, you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

 Funny, that.  When I first upgraded I did not and it seemed to work
 fine ... until I rebooted.

DJ Strange... rebooting shouldn't affect anything...

Well I am guessing that because I had been running the old version of
Bayesit earlier in the day, that it continued to use that until I
rebooted.  It is the only thing that I can think of that makes sense.

 After that, yes, I deleted all the dict files I could find.
 Apparently there were two sets, one from the old version and one set
 from the new.

DJ I had to do the same thing when upgrading from v0.4gm to
DJ v0.5.4 because I was having problems.

 I then re-trained it on the accumulated spam and ham folders I have
 with about 2,000 messages each.  BTW, If I give Bayesit all 2,000
 messages at once to chew on, it would hang.  If I gave it in
 chunks it seemed to work OK shrug

DJ Hum... after deleting the dict files, I trained normally with
DJ lots of spam/non-spam messages (I'm pretty sure it was more than
DJ 2,000) without a problem.  So I don't know what could have
DJ happened in your case (?)

DJ I personally find BayesIt extremely powerful, accurate, and
DJ fast (I come from POPFile, with an accuracy of 99.6 % which
DJ required a LOT of manual tuning, had quite some false positives,
DJ and was VERY slow...), but what it misses it *really* misses (0%,
DJ as opposed to some mid-way value).

I have used several 0.4 versions and they worked great, so I am
guessing that I just need to 'fix' a setting somewhere ... or at least
I hope that is it g

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[3]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello Pete,

Sunday, August 15, 2004, 9:52:14 AM, you wrote:

PH Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW Hello MikeD,

AW On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

PH What are their names and where are they?

Originally I had two sets of dictionaries.  One (I assume for the old
version were in c:\Program Files\TheBat\bayesit\base.  The current
version is creating the following files here ...

c:\My Documents\BatMail\bayesit\base
transact
spamdict.idx
nspamdict.idx
spamdict.lst
spamdict.bye
nspamdict.lst
selective.txt
nspamdict.bye

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello DZ-Jay,

Sunday, August 15, 2004, 10:20:52 AM, you wrote:

DJ Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say:
DJ I was too.  I just upgraded yesterday to 0.5.9 and I haven't
DJ noticed a difference.  It does provide a white/black list, which I
DJ don't care to use because it defeats the purpose of a Bayesian
DJ filter (there's huge discussion -- more like religious wars --
DJ about this on the POPFile list hehe).  Also, the kludges.txt file
DJ doesn't seem to be implemented either (ignore list for headers).

 That's too bad sigh

DJ I just learned (by re-reading a babelfished translation of
DJ the russian BayesIt page) that the kludges file (whitelist of
DJ kludges) does seem to work, except I misunderstood it.  I thought
DJ it worked like POPFile's ignore list, which ignores the
DJ specified tokens when computing the probability of a message.  But
DJ it is not a list of just tokens, it is a list of header names
DJ that will be ignored, for example, if you put in the list:

DJ message-id
DJ x-mailer
DJ subject

DJ If will ignore the values of headers that start with those
DJ strings.  This is very useful, though.

DJ I wonder, is the ignore list in the black/white list rules
DJ window what I confused the kludges list for? i.e. is it akin to
DJ the POPFile ignore list?  Anybody know?

Hmmm ... does it just ignore those 'lines' in the header?  If so, I
don't think that will be a problem for me.  My Kludges contains:

x-spam-checker-version
x-spam-level
x-spam-report
x-spam-status
x-uidl

And I don't think any of those are causing a problem.

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-16 Thread MikeD (3)
Hello George,

Sunday, August 15, 2004, 11:35:29 AM, you wrote:

GM DZ-Jay wrote:

DJ Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:

What is Graham?
What is Spam-grade?

DJ AFAIK, spam-grade would be the probability of it being spam, and
DJ Graham, I suppose, means the probability of it being not-spam (I
DJ suppose, non-spam-grade  ham-grade  graham ?)

GM It might be coincidence, but Paul Graham has written much about
GM Bayesian filtering.  I'd guess it has something to do with his
GM methodology.  Even if I'm wrong, there's some interesting reading at:

GM http://www.paulgraham.com/antispam.html


Yes, Paul uses a slightly modified algorithm from the original Bayes.
So does that mean it is calculating using both algorithms to create
two values?

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-16 Thread Stuart Cuddy
Hello DZ-Jay,
Sunday, August 15, 2004, 1:47:45 PM, you wrote:

 It might be coincidence, but Paul Graham has written much about
 Bayesian filtering.  I'd guess it has something to do with his
 methodology.  Even if I'm wrong, there's some interesting reading at:

 http://www.paulgraham.com/antispam.html

DJ Thanx for the info... that would make more sense, although
DJ how come the spam-grade and graham values coinside in all messages
DJ without exception?  I guess I'll ask Alexey about it.  In the
DJ meantime, I'll check out the link you sent :)

Does Alexey not frequent this list?  It would sure be helpful if he
could answer directly.

Does anyone know how we can continue this conversation directly with
him?


-- 
 Stuartmailto:[EMAIL PROTECTED]
Using The Bat! v2.13 Lucky Beta/5 on Windows 98 4.10 Build   A 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Andre Wichartz
Hello DZ-Jay,

On 14 Aug 2004 at 14:42:17 -0400 GMT [20:42 CEST] you wrote:

DJ That makes sense.  But do you know how the weight is
DJ calculated? I can assume it is the product of its initial
DJ probability by the regarding threshold value, is that true?

I don't program the thing. For specific questions you really should ask
Alexey.

DJ And is it only for tokens that have the same occurrence in spam and
DJ non-spam messages, or is the weight skewed by this threshold on all
DJ tokens to give them an extra non-spamy umph in order to avoid
DJ false positives?

I just made an example. It would of course work regardless how often a
word occurs.

-- 
Cheers,
 Andre

Geh nicht nur die glatten Strassen:
 geh Wege, die vor Dir noch niemand ging,
 damit Du Spuren hinterlässt,und nicht nur Staub.  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Andre Wichartz
Hello MikeD,

On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

M I have been following this thread since I have been having some
M problems too.  I was using the old version (0.4gm) until I upgraded to
M the current version of TB.

M The settings I used to use don't seem to work any more and I either
M get everything filtered as junk or nothing is filtered as junk.  I
M trained it with about 2000 spam and 2000 ham messages and still no
M joy.  I have tried low threshold numbers and high with out much
M difference.

Have you deleted you spam and non-spam dictionary files when you
upgraded?

-- 
Cheers,
 Andre

I don't suffer from insanity.
 I enjoy every minute of it.  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/14/2004 22:24:58, I think I heard MikeD (3) say:
 What settings are you using?  Under the old version (0.4gm) I had it
 trained and was getting most spam caught, no false positives with a
 Move message setting of 10.  Now I have gone down as low as 1 and as
 high as 99 without success.

I started with the move message setting at 40 and continued to lowered it without 
noticing any effect.  That's when I checked the BAYESIT.LOG file and realized that all 
messages are marked with either 100/99 % or 0% probability, which means that no matter 
how low I set the parameter, it will continue working the same.  I don't understand 
how come there is no gray area, with messages marked with a, say, 30% probability, 
etc.  I do not get any false positives at all, but I do get about 4%  of false 
negatives...

 BTW, I am using the 0.5.5 verision that came with 2.12.  Should I be
 using the newer version that I saw mentioned?

I was too.  I just upgraded yesterday to 0.5.9 and I haven't noticed a difference.  It 
does provide a white/black list, which I don't care to use because it defeats the 
purpose of a Bayesian filter (there's huge discussion -- more like religious wars -- 
about this on the POPFile list hehe).  Also, the kludges.txt file doesn't seem to be 
implemented either (ignore list for headers).

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/14/2004 23:28:14, I think I heard Thomas Fernandez say:
DJ That makes sense. But do you know how the weight is calculated?

 Check out for a mathematician called Bayes. 19th century, IIRC.

Have you read at all the entire thread, or did you just decided to come in and offer 
your insightful comments at just this point?  I'm talking about the regarding 
threshold value and how is it used, i.e. given the bayesian probability of a message 
what *ADDITIONAL* computation occurs with that parameter.  Do you know?  Do you think 
Mr. Bayes would have had enough visionary insight to see how this BayesIt-specific 
parameter was used by Alexey in his plugin?

DJ I can assume it is the product of its initial probability by the
DJ regarding threshold value, is that true?

 It's not that simple.

What is not that simple? The bayesian algorithm or how the regarding threshold is 
used by the plugin?  Because, if you have noticed from the context of the comment, I 
am talking about the parameters in the ADVANCED.INI file and how they are implemented.

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 07:43:05, I think I heard Andre Wichartz say:
 Hello DZ-Jay,

 On 14 Aug 2004 at 14:42:17 -0400 GMT [20:42 CEST] you wrote:

DJ That makes sense.  But do you know how the weight is
DJ calculated? I can assume it is the product of its initial
DJ probability by the regarding threshold value, is that true?

 I don't program the thing. For specific questions you really should ask
 Alexey.

I thought that with so much traffic in this list there would be someone who knew.  Oh 
well...

DJ And is it only for tokens that have the same occurrence in spam and
DJ non-spam messages, or is the weight skewed by this threshold on all
DJ tokens to give them an extra non-spamy umph in order to avoid
DJ false positives?

 I just made an example. It would of course work regardless how often a
 word occurs.

So you don't know... Ok.  I'll continue looking for info, probably contacting Alexey.

Thanx
dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread MikeD (3)
Hello Andre,

Sunday, August 15, 2004, 6:44:17 AM, you wrote:

AW Hello MikeD,

AW On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

Funny, that.  When I first upgraded I did not and it seemed to work
fine ... until I rebooted.

After that, yes, I deleted all the dict files I could find.
Apparently there were two sets, one from the old version and one set
from the new.

I then re-trained it on the accumulated spam and ham folders I have
with about 2,000 messages each.  BTW, If I give Bayesit all 2,000
messages at once to chew on, it would hang.  If I gave it in
chunks it seemed to work OK shrug

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread MikeD (3)
Hello DZ-Jay,

Sunday, August 15, 2004, 8:12:23 AM, you wrote:

DJ Some time around 08/14/2004 22:24:58, I think I heard MikeD (3) say:
 What settings are you using?  Under the old version (0.4gm) I had it
 trained and was getting most spam caught, no false positives with a
 Move message setting of 10.  Now I have gone down as low as 1 and as
 high as 99 without success.

DJ I started with the move message setting at 40 and continued
DJ to lowered it without noticing any effect.  That's when I checked
DJ the BAYESIT.LOG file and realized that all messages are marked
DJ with either 100/99 % or 0% probability, which means that no matter
DJ how low I set the parameter, it will continue working the same.  I
DJ don't understand how come there is no gray area, with messages
DJ marked with a, say, 30% probability, etc.  I do not get any false
DJ positives at all, but I do get about 4%  of false negatives...

At the moment, everything in the log is .99.  Nothing has any other
value.  Does that sound right?

 BTW, I am using the 0.5.5 verision that came with 2.12.  Should I be
 using the newer version that I saw mentioned?

DJ I was too.  I just upgraded yesterday to 0.5.9 and I haven't
DJ noticed a difference.  It does provide a white/black list, which I
DJ don't care to use because it defeats the purpose of a Bayesian
DJ filter (there's huge discussion -- more like religious wars --
DJ about this on the POPFile list hehe).  Also, the kludges.txt file
DJ doesn't seem to be implemented either (ignore list for headers).

That's too bad sigh


-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Thomas Fernandez
Hello DZ-Jay,

On Sun, 15 Aug 2004 09:20:41 -0400 GMT (15/08/2004, 20:20 +0700 GMT),
DZ-Jay wrote:

DJ That makes sense. But do you know how the weight is calculated?

 Check out for a mathematician called Bayes. 19th century, IIRC.

DJ Have you read at all the entire thread, or did you just
DJ decided to come in and offer your insightful comments at just this
DJ point?

I 've read the thread, but nowhere was mentioned how a Bayesian filter
works. I thought that was your question. Apparantly it wasn't, so
sorry for having wasted bandwidth.

 It's not that simple.

DJ What is not that simple? The bayesian algorithm or how the
DJ regarding threshold is used by the plugin?

The Bayesian algorithms. Your question, to which I answered, could be
understood this way, so I don't feel I have to apologise.

-- 

Regards,
Thomas.

Sorry, Officer, I didn't realize my radar detector wasn't plugged
in.

Message reply created with The Bat! 2.12.02
under Chinese Windows 98 4.10 Build  A 





Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:28:42, I think I heard Thomas Fernandez say:
DJ What is not that simple? The bayesian algorithm or how the
DJ regarding threshold is used by the plugin?

 The Bayesian algorithms. Your question, to which I answered, could be
 understood this way, so I don't feel I have to apologise.

I guess some people in this list just have to offer an answer -- any answer -- just 
because.

Well then, thank you for your wonderfully insightful answer of Check out for a 
mathematician called Bayes. 19th century, IIRC.  No need to apologize at all, I have 
such a better grasp on the subject now, thanks!

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say:
DJ I started with the move message setting at 40 and continued
DJ to lowered it without noticing any effect.  That's when I checked
DJ the BAYESIT.LOG file and realized that all messages are marked
DJ with either 100/99 % or 0% probability, which means that no matter
DJ how low I set the parameter, it will continue working the same.  I
DJ don't understand how come there is no gray area, with messages
DJ marked with a, say, 30% probability, etc.  I do not get any false
DJ positives at all, but I do get about 4%  of false negatives...

 At the moment, everything in the log is .99.  Nothing has any other
 value.  Does that sound right?

That's more or less what I get, and in my opinion, it doesn't seem to be right.

However, I recently noticed why some obviously spam messages are given a probability 
of 0%:  Apparently the analysis engine is regarding a few empty tokens with a value 
of 0%, which unspamifies the final value, for example, in my log file, I get this in 
some messages:

: ---
15.08.2004 08:13:41 [EMAIL PROTECTED]
Graham:  0
Spam-grade:  0
Value for The Bat!: 0
: ---
LOTS OF SPAM-FULL TOKENS HERE
...
:  0
:  0
:  0
:  0
:  0
:  0
:  0
:  0

As you can see, no matter how many spam tokens are found, all those 0's will end up 
clearing the final probability value.  This seems to me a bug in the tokenizer.  I 
haven't been able to find a common denominator for messages that cause this.

Does anybody else get empty tokens in their log files?

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:23:46, I think I heard MikeD (3) say:
 Hello Andre,

 Sunday, August 15, 2004, 6:44:17 AM, you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

 Funny, that.  When I first upgraded I did not and it seemed to work
 fine ... until I rebooted.

Strange... rebooting shouldn't affect anything...

 After that, yes, I deleted all the dict files I could find.
 Apparently there were two sets, one from the old version and one set
 from the new.

I had to do the same thing when upgrading from v0.4gm to v0.5.4 because I was having 
problems.

 I then re-trained it on the accumulated spam and ham folders I have
 with about 2,000 messages each.  BTW, If I give Bayesit all 2,000
 messages at once to chew on, it would hang.  If I gave it in
 chunks it seemed to work OK shrug

Hum... after deleting the dict files, I trained normally with lots of spam/non-spam 
messages (I'm pretty sure it was more than 2,000) without a problem.  So I don't know 
what could have happened in your case (?)

I personally find BayesIt extremely powerful, accurate, and fast (I come from POPFile, 
with an accuracy of 99.6 % which required a LOT of manual tuning, had quite some false 
positives, and was VERY slow...), but what it misses it *really* misses (0%, as 
opposed to some mid-way value).

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread Alexander S. Kunz
Hello DZ-Jay,

15-Aug-2004 15:12, you wrote:

 I checked the BAYESIT.LOG file and realized that all messages are
 marked with either 100/99 % or 0% probability, which means that no matter
 how low I set the parameter, it will continue working the same.  I don't
 understand how come there is no gray area, with messages marked with a,
 say, 30% probability, etc.  I do not get any false positives at all, but
 I do get about 4%  of false negatives...

I just checked my POPfile bucket pages and found it very interesting that,
despite spam is only 5.8% of my messages (lucky me, hu?), the distinct
word count for those spam messages is by far the highest (only messages
marked as genuine/english come close). I'd interpret that as spam is
*very* recognizable after a certain training period. That could explain
your results with BayesIt - maybe.

In practice, I had similar (odd) results with BayesIt. :-) ...part of the
reason that made me switch to POPfile...

-- 
Best regards,
 Alexander

Bradley's Bromide: If computers get too powerful, we can organize them into
a committee... that will do them in.



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 10:20:47, I think I heard Alexander S. Kunz say:
 I just checked my POPfile bucket pages and found it very interesting that,
 despite spam is only 5.8% of my messages (lucky me, hu?), the distinct
 word count for those spam messages is by far the highest (only messages
 marked as genuine/english come close). I'd interpret that as spam is
 *very* recognizable after a certain training period. That could explain
 your results with BayesIt - maybe.


Yes, I agree that that could be the reason.  However, the messages that are missed 
(roughly 4% of total spam traffic) are marked with a 0%, which would qualify them as 
unambiguosly genuine (non-spam), but they obviously are not, as a lot of spam tokens 
are found in them.  This is why I think there might be a problem with the filter 
itself, or with my settings.

 In practice, I had similar (odd) results with BayesIt. :-) ...part of the
 reason that made me switch to POPfile...

Funny, I went the other way... POPfile was very reliable for me (99.6%) but required 
constant manual hacking of the corpus to maintain this accuracy, plus with a 
sufficiently high corpus, it was really slow (took almost a couple of seconds to 
download each message, even very small ones), which with a dial-up connection and 
hundreds of messages a day is almost unbearable.

Plus, there was no way to offer some extra weight to non-spam messages (like with 
regarding threshold in BayesIt), which almost completely irradicates false 
positives.  With POPfile I had to scan my spam box once in a while in order to make 
sure.  With BayesIt, after doing so for a few months without even a single false 
positive, I concluded that it was not necessary anymore to scan the spam folder often. 
 I like that :)

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread Pete Holsberg
Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW Hello MikeD,

AW On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

What are their names and where are they?


-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say:
 Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW Hello MikeD,

AW On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

 What are their names and where are they?

Their names are spamdict.* and nspamdict.* and they are located in a directory called 
base within the BayesIt working directory, which is normally either:

TB! installation dir\BayesIt\base
or
TB! installation dir\MAIL\BayesIt\base

dZ.



-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 09:24:56, I think I heard MikeD (3) say:
DJ I was too.  I just upgraded yesterday to 0.5.9 and I haven't
DJ noticed a difference.  It does provide a white/black list, which I
DJ don't care to use because it defeats the purpose of a Bayesian
DJ filter (there's huge discussion -- more like religious wars --
DJ about this on the POPFile list hehe).  Also, the kludges.txt file
DJ doesn't seem to be implemented either (ignore list for headers).

 That's too bad sigh

I just learned (by re-reading a babelfished translation of the russian BayesIt page) 
that the kludges file (whitelist of kludges) does seem to work, except I 
misunderstood it.  I thought it worked like POPFile's ignore list, which ignores the 
specified tokens when computing the probability of a message.  But it is not a list of 
just tokens, it is a list of header names that will be ignored, for example, if you 
put in the list:

message-id
x-mailer
subject

If will ignore the values of headers that start with those strings.  This is very 
useful, though.

I wonder, is the ignore list in the black/white list rules window what I confused 
the kludges list for? i.e. is it akin to the POPFile ignore list?  Anybody know?

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread Stuart Cuddy
Hello DZ-Jay,
Sunday, August 15, 2004, 9:25:14 AM, you wrote:

DJ However, I recently noticed why some obviously spam messages
DJ are given a probability of 0%:  Apparently the analysis engine is
DJ regarding a few empty tokens with a value of 0%, which
DJ unspamifies the final value, for example, in my log file, I get
DJ this in some messages:

I am not seeing the empty tokens, but the following message is being
   received without being caught. I sent it again to myself about 5 or
   6 times and marked it as junk each time. The values do not seem to
   change at all.

   What is Graham?
   What is Spam-grade?
   

[EMAIL PROTECTED]
Graham:  7.59688e-029
Spam-grade:  7.59688e-029
Value for The Bat!: 0
: ---
biz:  0.01
--:  0.0212766
size:  0.01
Advance:  0.01
H this:  0.058463
partners:  0.01
Today:  0.01
H PLease:  0.01
H de:  0.0359281
Career:  0.01
text:  0.01
experience:  0.0133407
aol:  0.01
Verdana:  0.01
past:  0.01

-- 
 Stuartmailto:[EMAIL PROTECTED]
Using The Bat! v2.13 Lucky Beta/5 on Windows 98 4.10 Build   A 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:

What is Graham?
What is Spam-grade?

AFAIK, spam-grade would be the probability of it being spam, and Graham, I suppose, 
means the probability of it being not-spam (I suppose, non-spam-grade  ham-grade  
graham ?)

But in my log I see exactly what you see in yours: that the graham and spam-grade 
values are identical in every case.  This keeps getting fishier and fishier...

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:
 I am not seeing the empty tokens, but the following message is being
received without being caught. I sent it again to myself about 5 or
6 times and marked it as junk each time. The values do not seem to
change at all.

Maybe this is because of your value in the recalculating strategy parameter, which 
governs how often automatic retraining is done.  Try lowering this value and 
re-marking the message as spam and see if the values change.

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-15 Thread Pete Holsberg
Sunday, August 15, 2004, 11:11:00 AM, you wrote:

DJ Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say:
 Sunday, August 15, 2004, 7:44:17 AM, you wrote:

AW Hello MikeD,

AW On 14 Aug 2004 at 14:47:24 -0500 GMT [21:47 CEST] you wrote:

AW Have you deleted you spam and non-spam dictionary files when you
AW upgraded?

 What are their names and where are they?

DJ Their names are spamdict.* and nspamdict.* and they are located in a directory
DJ called base within the BayesIt working directory, which is normally either:

DJ TB! installation dir\BayesIt\base
DJ or
DJ TB! installation dir\MAIL\BayesIt\base


??? Mine are in C:\Documents and Settings\pjh\Application Data\BayesIt\base

TB is in C:\Program Files\The Bat!\thebat.exe and BayesIt is in C:\Program 
Files\BayesIt
under Windows 2000.

Is this significant?

-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 11:57:15, I think I heard Pete Holsberg say:
 Sunday, August 15, 2004, 11:11:00 AM, you wrote:

DJ Some time around 08/15/2004 10:52:14, I think I heard Pete Holsberg say:

DJTB! installation dir\BayesIt\base
DJ or
DJTB! installation dir\MAIL\BayesIt\base


 ??? Mine are in C:\Documents and Settings\pjh\Application Data\BayesIt\base

 TB is in C:\Program Files\The Bat!\thebat.exe and BayesIt is in C:\Program 
 Files\BayesIt
 under Windows 2000.

Well, I guess those are the default installation paths:  The application in the 
Program Files directory and the BayesIt files in your profile directory.  Since I have 
TB! installed in a non-standard directory (i.e. outside the Program Files directory), 
BayesIt was installed within that directory.  I guess then I should have said:

TB! installation dir\BayesIt\base
or
User profile dir\BayesIt\base

Sorry about that.  I guess that since I don't use the default installation paths I 
don't know where things normally fall.

In any case, the dict files fall within the BayesIt working directory, which is 
specified in BayesIt options window.

 Is this significant?

Not at all.

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread George Mitchell
DZ-Jay wrote:

DJ Some time around 08/15/2004 11:13:49, I think I heard Stuart Cuddy say:

What is Graham?
What is Spam-grade?

DJ AFAIK, spam-grade would be the probability of it being spam, and
DJ Graham, I suppose, means the probability of it being not-spam (I
DJ suppose, non-spam-grade  ham-grade  graham ?)

It might be coincidence, but Paul Graham has written much about
Bayesian filtering.  I'd guess it has something to do with his
methodology.  Even if I'm wrong, there's some interesting reading at:

http://www.paulgraham.com/antispam.html

-- 
George

Using The Bat! 2.12.00 on Windows XP Pro 5.1, Build 2600, Service Pack 1.



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-15 Thread DZ-Jay
Some time around 08/15/2004 12:35:29, I think I heard George Mitchell say:
 It might be coincidence, but Paul Graham has written much about
 Bayesian filtering.  I'd guess it has something to do with his
 methodology.  Even if I'm wrong, there's some interesting reading at:

 http://www.paulgraham.com/antispam.html

Thanx for the info... that would make more sense, although how come the spam-grade and 
graham values coinside in all messages without exception?  I guess I'll ask Alexey 
about it.  In the meantime, I'll check out the link you sent :)

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Hello:

I've been running BayesIt for a while and it works beautifully.  My accuracy 
right now is at 96.75%, so I guess I shouldn't complain.  But out of a few hundred 
messages I get a day, it misses about 10 that look like obvious spam but were marked 
as not-spam.  I checked the BAYESIT.LOG file and found that almost all messages are 
valued by BayesIt at either 99, 100 or 0.  Its as if BayesIt thinks all messages are 
absolutely spam, or absolutely not-spam.  In a sense I think this is good, and its 
because I started training it with a large collection of spam/not-spam messages.  But 
I cannot help but think that there should be more of a gray area for some messages... 
For example, the 10 messages that it misses daily are valued at 0.  I think there 
should be a way for me to tune the configurations in order to make it more accurate.  
On the other hand, I do not get ANY false positives, so that is a very good thing.

This is what I have in my ADVANCED.INI:

working thread priority=2
onexit thread priority=3
selective download spam threshold=50
export selective download=1
simple digits spam marks=1
no spaces spam marks=1
limit size to hash=19
limit size to hash header=96
temporary dictionary=C:\DOCUME~1\dz\LOCALS~1\Temp
use expiration=0
age to expirate=90
learn from zero=0   ; I changed this one today, was 1
max size of log file=5242880
recalculating strategy=0.0002   ; I changed this one today, was 5
regarding threshold=1.5  ; I changed this today, was 1.8
use autotrain=1
use degeneration=1
number of exclamations=5

Any recommendations?  BTW, I do not understand very well the regarding threshold 
parameter, can someone explain it please?  I use BayesIt 0.5.5

Thanx! :)

-dZ.

-- 
Powered by The Bat! v.2.12.00 times BayesIt v.0.5.5
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread Andre Wichartz
Hello DZ-Jay,

On 14 Aug 2004 at 09:28:34 -0400 GMT [15:28 CEST] you wrote:

DJ BTW, I do not understand very well the regarding threshold
DJ parameter, can someone explain it please?

From advanced.ini:

; this number shows, how much heavier non-spam tokens than spam. It
makes some kind of guard and keeps from false positives. Usual value
is 2, but you can also try others...

-- 
Cheers,
 Andre

Charlie was a Chemist, but Charlie is no more.
 What Charlie thought was H20 was H2SO4.  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 10:34:25, I think I heard Andre Wichartz say:
 Hello DZ-Jay,

 On 14 Aug 2004 at 09:28:34 -0400 GMT [15:28 CEST] you wrote:

DJ BTW, I do not understand very well the regarding threshold
DJ parameter, can someone explain it please?

 From advanced.ini:

 ; this number shows, how much heavier non-spam tokens than spam. It
 makes some kind of guard and keeps from false positives. Usual value
 is 2, but you can also try others...

Yes, I am aware of its definition, but what I don't understand is what would be the 
effect of changing it to, say, 1.2 from 1.5 (apart from the academic answer of making 
non-spam tokens a bit less heavier).  How does the plugin use this value?

Thanx
dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread Andre Wichartz
Hello DZ-Jay,

On 14 Aug 2004 at 11:30:32 -0400 GMT [17:30 CEST] you wrote:

DJ Yes, I am aware of its definition, but what I don't understand
DJ is what would be the effect of changing it to, say, 1.2 from 1.5
DJ (apart from the academic answer of making non-spam tokens a bit less
DJ heavier).  How does the plugin use this value?

Assume a word orccurs equally often in spam and non-spam mails. If you
set the value to 1 the word will get a spam propability of 0.5. If you
set it to a higher value the word will get something lower than 0.5.
Words in non-spam mails just count more and you can set just how much
more.

At least that's my take on it.

-- 
Cheers,
 Andre

1. If it's green or it wiggles, it's biology.
 2. If it stinks, it's chemistry.
 3. If it doesn't work, it's physics.  



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread Pete Holsberg
Saturday, August 14, 2004, 12:27:41 PM, you wrote:

AW Hello DZ-Jay,

AW On 14 Aug 2004 at 11:30:32 -0400 GMT [17:30 CEST] you wrote:

DJ Yes, I am aware of its definition, but what I don't understand
DJ is what would be the effect of changing it to, say, 1.2 from 1.5
DJ (apart from the academic answer of making non-spam tokens a bit less
DJ heavier).  How does the plugin use this value?

AW Assume a word orccurs equally often in spam and non-spam mails. If you
AW set the value to 1 the word will get a spam propability of 0.5. If you
AW set it to a higher value the word will get something lower than 0.5.
AW Words in non-spam mails just count more and you can set just how much
AW more.

Where do you do the setting???



-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say:

 Where do you do the setting???

In a file called ADVANCED.INI in the BayesIt working directory, or in the TB! 
installation directory.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 12:27:41, I think I heard Andre Wichartz say:
 Assume a word orccurs equally often in spam and non-spam mails. If you
 set the value to 1 the word will get a spam propability of 0.5. If you
 set it to a higher value the word will get something lower than 0.5.
 Words in non-spam mails just count more and you can set just how much
 more.

 At least that's my take on it.

That makes sense.  But do you know how the weight is calculated? I can assume it is 
the product of its initial probability by the regarding threshold value, is that 
true?  And is it only for tokens that have the same occurrence in spam and non-spam 
messages, or is the weight skewed by this threshold on all tokens to give them an 
extra non-spamy umph in order to avoid false positives?

Thanx
dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread Pete Holsberg
Saturday, August 14, 2004, 2:37:03 PM, you wrote:

DJ Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say:

 Where do you do the setting???

DJ In a file called ADVANCED.INI in the BayesIt working directory, or 
DJ in the TB! installation directory.

Not found anywhere on either HD!

Can it be created manually??

-- 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread MikeD
Hello All,

I have been following this thread since I have been having some
problems too.  I was using the old version (0.4gm) until I upgraded to
the current version of TB.

The settings I used to use don't seem to work any more and I either
get everything filtered as junk or nothing is filtered as junk.  I
trained it with about 2000 spam and 2000 ham messages and still no
joy.  I have tried low threshold numbers and high with out much
difference.

Is there a good getting started file somewhere that I have
just missed?

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 15:47:24, I think I heard MikeD say:
 The settings I used to use don't seem to work any more and I either
 get everything filtered as junk or nothing is filtered as junk.  I
 trained it with about 2000 spam and 2000 ham messages and still no
 joy.  I have tried low threshold numbers and high with out much
 difference.

That's pretty much what I get:  messages are either COMPLETELY spam (99 or 100 % 
probability) or COMPLETELY not-spam (0% probability).  Although mine seems to catch 
most (~97%) of spam, out of a few hundred emails daily, so its not that bad.  And 
that's with the default settings.  I'm trying to tune it to get it a bit higher in 
accuracy, if possible, but can't seem to get much help on this subject :(

dZ.

 Is there a good getting started file somewhere that I have
 just missed?




-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread DZ-Jay
Some time around 08/14/2004 14:54:38, I think I heard Pete Holsberg say:
 Saturday, August 14, 2004, 2:37:03 PM, you wrote:

DJ Some time around 08/14/2004 12:34:07, I think I heard Pete Holsberg say:

 Where do you do the setting???

DJ In a file called ADVANCED.INI in the BayesIt working directory, or 
DJ in the TB! installation directory.

 Not found anywhere on either HD!

 Can it be created manually??

Yes you can... but which version of BayesIt are you using?  Maybe you are using an 
older version...  Here's the default ADVANCED.INI file that came with BayesIt 0.5.9:

working thread priority = 2;
onexit thread priority = 3;
export selective download = 1;
selective download spam threshold = 10;
simple digits spam marks = 1;
no spaces spam marks = 1;
limit size to hash = 19;
limit size to hash header = 96;
temporary dictionary = c:\\temp;
use expiration = 0;
age to expirate = 100;
learn from zero = 1;
max size of log file = 131072;
recalculating strategy = 3;
regarding threshold = 1.5;
use autotrain = 1;
use degeneration = 1;
number of exclamations = 5;

dZ.

-- 
Powered by The Bat! v.2.12.00,
  Hindered by MS Windows 2000 v.5.0 build 2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re[2]: Help with BayesIt tuning

2004-08-14 Thread MikeD (3)
Hello DZ-Jay,

Saturday, August 14, 2004, 5:31:59 PM, you wrote:

DJ Some time around 08/14/2004 15:47:24, I think I heard MikeD say:
 The settings I used to use don't seem to work any more and I either
 get everything filtered as junk or nothing is filtered as junk.  I
 trained it with about 2000 spam and 2000 ham messages and still no
 joy.  I have tried low threshold numbers and high with out much
 difference.

DJ That's pretty much what I get:  messages are either
DJ COMPLETELY spam (99 or 100 % probability) or COMPLETELY not-spam
DJ (0% probability).  Although mine seems to catch most (~97%) of
DJ spam, out of a few hundred emails daily, so its not that bad.  And
DJ that's with the default settings.  I'm trying to tune it to get it
DJ a bit higher in accuracy, if possible, but can't seem to get much
DJ help on this subject :(

What settings are you using?  Under the old version (0.4gm) I had it
trained and was getting most spam caught, no false positives with a
Move message setting of 10.  Now I have gone down as low as 1 and as
high as 99 without success.

BTW, I am using the 0.5.5 verision that came with 2.12.  Should I be
using the newer version that I saw mentioned?

-- 
Best regards,
 MikeDmailto:[EMAIL PROTECTED]
Using The Bat! v2.12.00 on Windows ME 4.90 Build  3000
 



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Help with BayesIt tuning

2004-08-14 Thread Thomas Fernandez
Hello DZ-Jay,

On Sat, 14 Aug 2004 14:42:17 -0400 GMT (15/08/2004, 01:42 +0700 GMT),
DZ-Jay wrote:

 Assume a word orccurs equally often in spam and non-spam mails. If you
 set the value to 1 the word will get a spam propability of 0.5. If you
 set it to a higher value the word will get something lower than 0.5.
 Words in non-spam mails just count more and you can set just how much
 more.

 At least that's my take on it.

DJ That makes sense. But do you know how the weight is calculated?

Check out for a mathematician called Bayes. 19th century, IIRC.

DJ I can assume it is the product of its initial probability by the
DJ regarding threshold value, is that true?

It's not that simple.

-- 

Cheers,
Thomas.

24 Dinge, die man beim Sex nicht sagen sollte: 8. Du bist fast so gut
wie mein Ex!

Message reply created with The Bat! 2.12.02
under Chinese Windows 98 4.10 Build  A 





Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html