Re[2]: BayesIt filtering: Spam error high

2004-08-09 Thread Peter Kerekes

Hello Alexander,

Sunday, August 8, 2004, 4:36:01 AM, you wrote:

ASK Hello Peter Kerekes,

ASK 08-Aug-2004 00:21, you wrote:

 Need some help to improve on Spam filtering.

ASK Forgive me the stupid question, but do you actually train the BayesIt
ASK filter if it has a false negative? (right click on the mail, Special, Mark
ASK as Junk).

Sure. First I did train on approx. 600 junk mail, all addressed to me, and
keep on training since then continuously.

 Can anyone suggest a change in any settings to improve filtration?

ASK Have you trained the BayesIt filter by importing a large Spam database? I
ASK found its accuracy dropped when I did that. I only used my Spam for
ASK training BayesIt, and its accuracy was much higher.



-- 
Best regards,

Peter Kerekes (Toronto, Canada) 

When two people in a business always agree, one of them is unnecessary. 
  -- Anon

TB! v2.12.03 and BayesIt! 0.5.9 on Windows 20005.0.2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: BayesIt filtering: Spam error high

2004-08-08 Thread Alexander S. Kunz
Hello Peter Kerekes,

08-Aug-2004 00:21, you wrote:

 Need some help to improve on Spam filtering.

Forgive me the stupid question, but do you actually train the BayesIt
filter if it has a false negative? (right click on the mail, Special, Mark
as Junk).


 Can anyone suggest a change in any settings to improve filtration?

Have you trained the BayesIt filter by importing a large Spam database? I
found its accuracy dropped when I did that. I only used my Spam for
training BayesIt, and its accuracy was much higher.

-- 
Best regards,
 Alexander

No one listens until you make a mistake.



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: BayesIt filtering: Spam error high

2004-08-08 Thread Wolffe

A Have you trained the BayesIt filter by importing a large Spam database? I
A found its accuracy dropped when I did that. I only used my Spam for
A training BayesIt, and its accuracy was much higher.

If you've trained BayesIt once, can you untrain it a train it afresh?



-- 
Cheers Yall
\\'

Between saying and doing, many a pair of shoes is worn out.



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html


Re: BayesIt filtering: Spam error high

2004-08-08 Thread DZ-Jay
Some time around 08/07/2004 18:21:10, I think I heard Peter Kerekes say:
!SNIP!
 I haven't much clue what the lines in advance.ini do and therefore do not
 want to experiment with it.

 Can anyone suggest a change in any settings to improve filtration?

 Advance.ini file:

 working thread priority=2
 onexit thread priority=3
 selective download spam threshold=10
 export selective download=1
 simple digits spam marks=1
 no spaces spam marks=1
 limit size to hash=19
 limit size to hash header=96
 temporary dictionary=c:\\temp
 use expiration=0
 age to expirate=100
 learn from zero=1
 max size of log file=131072
 recalculating strategy=3
 regarding threshold=1.5
 use autotrain=1
 use degeneration=1
 number of exclamations=5
!SNIP!

Hello:
I've been looking for a while on help tuning my BayesIt installation, but I 
can't seem to find much help, even though I search the archives of this list and the 
web.  I used to use PopFile and became pretty proficient at tuning it, even editing 
the corpus by hand, but even though I find BayesIt much more competent (and accurate), 
I don't seem to understand much of its advanced features.

I've read from many that the new Advanced.ini file contains comments from the 
developer explaining the various options, but mine (v5.5) does not.  The only help 
available from the BayesIt site is outdated and refers to an updated version in the 
RitLabs page, but its in Russian, which I cannot read.  So, with the help of the fish 
(the Babelfish, that is), an online Russian-English dictionary, and a bit of deductive 
reasoning, I was able to translate it as best as I could.  It helped me a bit, so I 
thought it might help others too.

Still, some explanations are a bit too technical, and they could use some 
finessing, so if anybody can help further, I (and others, I'm sure) will appreciate it 
inmensely.  Technical or not, its still more understandable to us non-russian speaking 
people.

;working thread priority (2)
;Determines the priority of the base retraining process.
;Retraining is carried out by the filter in the background mode
;and it is usually imperceptible to the user. By default, the
;value of this parameter (2) corresponds to the system parameter
;THREAD_PRIORITY_LOWEST.

;onexit thread priority (3)
;If, during the retraining process, the user clicked on the exit
;button in The Bat!, the retraining process will acquire the
;indicated priority. Usually, it is higher than normal. This is
;necessary so that the filter notifies the current retraining
;operation as soon as possible when it is safe to interrupt the
;process without risk of losing important data. By default, the
;value of this parameter (3) corresponds to the system parameter
;TRHEAD_PRIORITY_NORMAL.

;export selective download (1)
;When defined, the filter will export the collection of trigger
;lines for the selective download filter. If the parameter is set
;to 1, then the filter will create the file selective.txt in
;the working folder, which will contain the constantly updated
;list of regular expressions encountered in the headers of
;spam-messages. If this parameter is set to 0, then no lists of
;lines will be exported.

;selective download spam threshold (10)
;Determines with what frequency any one token must appear in the
;headers of spam messages in order for it to be included in the
;file selective.txt (see the previous parameter). It is
;recommended that this number is computed so that the size of the
;file selective.txt would not exceed 40Kb-50Kb. With larger lists
;of trigger lines, The Bat! becomes unstable. Words are selected
;into the file selective.txt based on the following criteria: the
;word must exist in the headers of the message and must never be
;encountered in the headers of non-spam messages, and has been
;encountered n number of times in the headers of spam messages;
;where n corresponds to the discussed parameter.

;simple digits spam marks (1)
;Allows html-comments in the messages of the form !--2345--
;(i.e. consisting of some numbers) to be treated as special
;generalized technical tokens. Since such headers are encountered
;in essence in spam messages, this special token can
;substantially help during the analysis of some messages.

;no spaces spam marks (1)
;Is analogous to the previous parameter; however, it treats as
;special tokens not only numerical comments, but any comment
;which does not contain whitespace.

;limit size to hash (19)
;Allows you to assign a maximum length to the words which will be
;stored in the base unchanged. If any word exceeds the assigned
;length (for example, a pgp-signature), then it will be
;automatically encoded into a hash and saved in the base in its
;original form.

;limit size to hash header (96)
;Assigns a similar length to the tokens from 

BayesIt filtering: Spam error high

2004-08-07 Thread Peter Kerekes
Need some help to improve on Spam filtering.

My ISP is has a spam filter. Based on their default setting I lost to much
mail as spam. I changed it so now 10% of my mail is spam. Which is OK since
I use BayesIt as an internal filter.

The problem is that BayesIt designate to many mails as non-spam.(Spam error
is 20%). I cannot remember any mail which was designated as spam by BayesIt
in error.

Can I change some setting in the advance.ini file or somewhere else to
improve the filtering of Spam mail?

My advance.ini file and information (abbreviated) is below

I haven't much clue what the lines in advance.ini do and therefore do not
want to experiment with it.

Can anyone suggest a change in any settings to improve filtration?


Advance.ini file:

 working thread priority=2
 onexit thread priority=3
 selective download spam threshold=10
 export selective download=1
 simple digits spam marks=1
 no spaces spam marks=1
 limit size to hash=19
 limit size to hash header=96
 temporary dictionary=c:\\temp
 use expiration=0
 age to expirate=100
 learn from zero=1
 max size of log file=131072
 recalculating strategy=3
 regarding threshold=1.5
 use autotrain=1
 use degeneration=1
 number of exclamations=5

 Plug-in Information:

 Antispam filtering data:
 
 Spam frequency dictionary:
  Size: 907 letters.
  Capacity: 70713 words.
 Non-spam frequency dictionary:
  Size: 5397 letters.
  Capacity: 127363 words.
 Current active base:
  Active current base contains 32992 words.
  Status: OK
 
 Last month statistic
 General numbers
  Spam traffic (bytes): 2583757
  Spam letters: 337
  NON-spam traffic (bytes): 17897140
  NON-spam letters: 3028
  Total traffic (bytes): 20480897
  Total letters: 3365
  Part of the spam in terms of letters: 10.01%.
  
 Errors
  SPAM errors (letters): 24.04%.
  SPAM errors (traffic): 20.09%.
  NON-spam errors (letters): 0%.
  NON-spam errors (traffic): 0%.
  Totally errors (letters): 2.41%.
  Totally errors (traffic): 2.53%.
  
 Current running version is 0.5.9

-- 
Best regards,
Peter Kerekes (Toronto, Canada) 

If you are never scared, embarrassed or hurt, it means you never take chances. 
  -- Julia Soul   

TB! v2.12.03 and BayesIt! 0.5.9 on Windows 20005.0.2195 Service Pack 4



Current version is 2.12.00 | 'Using TBUDL' information:
http://www.silverstones.com/thebat/TBUDLInfo.html