Re: [Razor-users] Spamassassin's Razor scores

2008-02-07 Thread Jim Hermann - UUN Hostmaster
 -Original Message-
 On Tuesday, 5. February 2008 18:15:24 you wrote:
  I used to have higher SA scores for 95-100% spam confidence.
 
  However, I found that I could not increase the score very much.
  Occasionally, I would get a false positive for a blank 
 email, no text with
  a few HTML tags and just attachments.  The Razor database regularly
  contains data that indicates that a blank email is 100% 
 known spam.  There
  was no way to prevent the false positive because the 
 whitelist feature for
  hash values was removed.  I also tried combining scores for 
 messages with a
  small amount of text and positive razor hits, but that 
 allows too much
  spam.
 
 Hmm, that would be a little show stopper.
 
 What did the other tests of SpamAssassin report for such mails?
 I can imagine they report it as spam, too.
 
 Thomas

Here is an example of the SpamAssassin report for a blank email with Word
attachment:

 pts rule name  description
 -- -
-0.0 SPF_HELO_PASS  SPF: HELO matches SPF record
-0.0 SPF_PASS   SPF: sender matches SPF record
 0.1 HTML_90_100BODY: Message is 90 HTML
-2.6 BAYES_00   BODY: Bayesian spam probability is 0 to 1
[cf: 100]
 6.1 RAZOR2_CF_RANGE_91_100 Razor2 gives confidence between 91 and 100
[cf: 100]
 1.5 RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level
above 50%
[cf: 100]
 1.5 RAZOR2_CHECK   Listed in Razor2 (http://razor.sf.net/)
-1.5 RAZOR2_IGNORE  Message in RAZOR2 and has very little text
-

The RAZOR2_CF_RANGE_91_100 and RAZOR2_IGNORE were my custom RAZOR rules.  I
could not get RAZOR2_IGNORE consistently to recognize when to ignore the
RAZOR2_CF_RANGE_91_100 results.

meta RAZOR2_IGNORE   RAZOR2_CHECK + HTML_90_100  1
describe RAZOR2_IGNORE   Message in RAZOR2 and has very little text
tflags RAZOR2_IGNORE net

meta RAZOR2_IGNORE2  RAZOR2_CHECK + MIME_HTML_MOSTLY  1
describe RAZOR2_IGNORE2  Message in RAZOR2 and has very little text2
tflags RAZOR2_IGNORE2net

fullRAZOR2_CF_RANGE_00_01   eval:check_razor2_range('','00','01')
tflags  RAZOR2_CF_RANGE_00_01   net
describe RAZOR2_CF_RANGE_00_01  Razor2 gives confidence between 00 and 01

fullRAZOR2_CF_RANGE_02_10   eval:check_razor2_range('','02','10')
tflags  RAZOR2_CF_RANGE_02_10   net
describe RAZOR2_CF_RANGE_02_10  Razor2 gives confidence between 02 and 10

fullRAZOR2_CF_RANGE_11_50  eval:check_razor2_range('','11','50')
tflags  RAZOR2_CF_RANGE_11_50  net
describe RAZOR2_CF_RANGE_11_50 Razor2 gives confidence between 11 and 50

fullRAZOR2_CF_RANGE_51_90  eval:check_razor2_range('','51','90')
tflags  RAZOR2_CF_RANGE_51_90  net
describe RAZOR2_CF_RANGE_51_90 Razor2 gives confidence between 51 and 90

fullRAZOR2_CF_RANGE_91_100  eval:check_razor2_range('','91','100')
tflags  RAZOR2_CF_RANGE_91_100  net
describe RAZOR2_CF_RANGE_91_100 Razor2 gives confidence between 91 and 100

Jim


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-02-07 Thread Jim Hermann - UUN Hostmaster
 

 -Original Message-
 On Thu, Feb 07, 2008 at 01:51:47PM -0600, Jim Hermann - UUN 
 Hostmaster wrote:
  The RAZOR2_CF_RANGE_91_100 and RAZOR2_IGNORE were my custom 
 RAZOR rules.  I
  could not get RAZOR2_IGNORE consistently to recognize when 
 to ignore the
  RAZOR2_CF_RANGE_91_100 results.
  
  meta RAZOR2_IGNORE2  RAZOR2_CHECK + MIME_HTML_MOSTLY  1
 
 a) Eww.  RAZOR2_CHECK  MIME_HTML_MOSTLY
 b) MIME_HTML_MOSTLY probably doesn't help you with non-HTML 
 mails (I'd have to
look at the rule to figure out what it does)
 c) this feels like something razor-whitelist could help with, 
 at least if it's
a consistent checksum.

Vipul told me that razor-whitelist does not support checksums anymore.

Even if it did, the blank checksum changes with each version of MS Outlook
and each default FONT.

MIME_HTML_MOSTLY works some of the time because text length is zero or 1.
Here is the content of the blank message from MS Outlook:

end of headers

This is a multi-part message in MIME format.

--=_NextPart_000_0001_01C69F5D.B03731E0
Content-Type: multipart/alternative;
boundary==_NextPart_001_0002_01C69F5D.B03731E0


--=_NextPart_001_0002_01C69F5D.B03731E0
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding: 7bit


--=_NextPart_001_0002_01C69F5D.B03731E0
Content-Type: text/html;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable

!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN
HTMLHEAD
META HTTP-EQUIV=3DContent-Type CONTENT=3Dtext/html; charset=3Dus-ascii
TITLEMessage/TITLE

META content=3DMSHTML 6.00.2900.2912 name=3DGENERATOR/HEAD
BODYFONT face=3DMaiandra GD/FONT/BODY/HTML

--=_NextPart_001_0002_01C69F5D.B03731E0--

--=_NextPart_000_0001_01C69F5D.B03731E0
Content-Type: application/msword;
[snip]

Jim


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-02-07 Thread Thomas Jarosch
Hello Jim,

On Tuesday, 5. February 2008 18:15:24 you wrote:
 I used to have higher SA scores for 95-100% spam confidence.

 However, I found that I could not increase the score very much.
 Occasionally, I would get a false positive for a blank email, no text with
 a few HTML tags and just attachments.  The Razor database regularly
 contains data that indicates that a blank email is 100% known spam.  There
 was no way to prevent the false positive because the whitelist feature for
 hash values was removed.  I also tried combining scores for messages with a
 small amount of text and positive razor hits, but that allows too much
 spam.

Hmm, that would be a little show stopper.

What did the other tests of SpamAssassin report for such mails?
I can imagine they report it as spam, too.

Thomas

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-02-06 Thread Jim Hermann - UUN Hostmaster
-Original Message-
From: 'Thomas Jarosch'; ''
Subject: RE: [Razor-users] Spamassassin's Razor scores

 Hello together,
 
 I'm wondering if it would make sense to add additional rules 
 for 95% to 100% 
 spam confidence? Is anybody already using a setup like that? 
 Any drawbacks?
 
 Cheers,
 Thomas

Thomas,

I used to have higher SA scores for 95-100% razor spam confidence.

However, I found that I could not increase the score very much.
Occasionally, I would get a false positive for a blank email, no text with a
few HTML tags and just attachments.  The Razor database regularly contains
data that indicates that a blank email is 100% known spam.  There was no way
to prevent the false positive because the whitelist feature for hash values
was removed.  I also tried combining scores for messages with a small amount
of text and positive razor hits, but that allows too much spam.

Jim


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-02-05 Thread Jim Hermann - UUN Hostmaster
-Original Message-
From: 'Thomas Jarosch'; ''
Subject: RE: [Razor-users] Spamassassin's Razor scores

 Hello together,
 
 I'm wondering if it would make sense to add additional rules 
 for 95% to 100% 
 spam confidence? Is anybody already using a setup like that? 
 Any drawbacks?
 
 Cheers,
 Thomas

Thomas,

I used to have higher SA scores for 95-100% spam confidence.

However, I found that I could not increase the score very much.
Occasionally, I would get a false positive for a blank email, no text with a
few HTML tags and just attachments.  The Razor database regularly contains
data that indicates that a blank email is 100% known spam.  There was no way
to prevent the false positive because the whitelist feature for hash values
was removed.  I also tried combining scores for messages with a small amount
of text and positive razor hits, but that allows too much spam.

Jim


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-02-04 Thread Thomas Jarosch
Hello Matt,

On Thursday, 31. January 2008 17:18:35 you wrote:
 e4 total hits: 981
 e4 cf 100: 857
 e4 cf 90-99:  12
 e4 cf 80-89:   2
 e4 cf 70-79:  10
 e4 cf 60-69:  21
 e4 cf 50-59:  16
 e4 cf 40-49:   8
 e4 cf 30-39:  30
 e4 cf 20-29:  25
 e4 cf 10-19:   0
 e4 cf 0-9:   0
 ---
 e8 total hits:1532
 e8 cf 100:1334
 e8 cf 90-99:  22
 e8 cf 80-89:  16
 e8 cf 70-79:  29
 e8 cf 60-69:  23
 e8 cf 50-59:  33
 e8 cf 40-49:  38
 e8 cf 30-39:  24
 e8 cf 20-29:  13
 e8 cf 10-19:   0
 e8 cf 0-9:   0

Interesting results! A separate category for 100% could improve things I 
guess. Could you make another run with the spam-data from the weekend?

We have a busy mailserver here, if you send me your patch
I'll try to gather some data, too.

Cheers,
Thomas

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-01-31 Thread Thomas Jarosch
On Wednesday, 30. January 2008 20:36:09 Theo Van Dinter wrote:
 On Wed, Jan 30, 2008 at 01:11:53PM -0500, Matt Kettler wrote:
  You can try it if you like. The existing rules are the result of some
  testing that was done several years ago. I think it was Theo that did
  it..

 Yeah, I wrote the rules + code way back when...  I've been trying to find
 some stats for this stuff, but didn't come up with anything useful.

 My recollection was that w/ e8 the cf was either really low or really high,
 and we just took the 51_100 values from the older pre-e8 rules and made it
 all consistent.

 I don't recall e4 stats.

Maybe Vipul the great can provide some statistics if there is such a thing 
like 80% or 90% cf level and if it's worth expanding the SpamAssassin rules.
As Theo noted there is probably more diversity for e4 than for e8, if at all.

Thomas

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-01-31 Thread Matt Kettler
Thomas Jarosch wrote:
 On Wednesday, 30. January 2008 20:36:09 Theo Van Dinter wrote:
 My recollection was that w/ e8 the cf was either really low or really high,
 and we just took the 51_100 values from the older pre-e8 rules and made it
 all consistent.

 I don't recall e4 stats.
 
 Maybe Vipul the great can provide some statistics if there is such a thing 
 like 80% or 90% cf level and if it's worth expanding the SpamAssassin rules.
 As Theo noted there is probably more diversity for e4 than for e8, if at all.


I'm currently seeing both e8 and e4 with 87% of their matches being cf=100, 
which matches what I started to see yesterday. My samples are still pretty 
small, but I can definitely see a trend.

Based on the numbers I'm seeing below, it *might* be valuable to split SA up 
into three cf ranges ie: 0-50, 51-99, 100-100. I'm not sure if there's more FPs 
in that 51-99 range that may be detracting from 100's performance, but it seems 
sensible to me to let the 100 grouping stand by itself since it is such a large 
percentage of hits.

I wrote a quick little grep and wc -l shell script that greps through my 
razor-agent.log to so I can monitor it really quick (note: ac is currently 21, 
hence the 0's at the low end.)

e4 total hits: 981
e4 cf 100: 857  
e4 cf 90-99:  12
e4 cf 80-89:   2
e4 cf 70-79:  10
e4 cf 60-69:  21
e4 cf 50-59:  16
e4 cf 40-49:   8
e4 cf 30-39:  30
e4 cf 20-29:  25
e4 cf 10-19:   0
e4 cf 0-9:   0
---
e8 total hits:1532
e8 cf 100:1334  
e8 cf 90-99:  22
e8 cf 80-89:  16
e8 cf 70-79:  29
e8 cf 60-69:  23
e8 cf 50-59:  33
e8 cf 40-49:  38
e8 cf 30-39:  24
e8 cf 20-29:  13
e8 cf 10-19:   0
e8 cf 0-9:   0



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-01-30 Thread Matt Kettler
Thomas Jarosch wrote:
 Hello together,
 
 I've been using Razor and SpamAssassin quite a while now
 using the standard rules distributed with SpamAssassin.
 
 SpamAssassin normally evalutes Razor's spam confidence level
 between 51% and 100%. This results in the following score:
 
 RAZOR2_CF_RANGE_51_100=0.5,
 RAZOR2_CF_RANGE_E4_51_100=1.5,
 RAZOR2_CF_RANGE_E8_51_100=1.5,
 RAZOR2_CHECK=0.5
 
 - 4 points at maximum.
 
 I'm wondering if it would make sense to add additional rules for 95% to 100% 
 spam confidence? Is anybody already using a setup like that? Any drawbacks?

You can try it if you like. The existing rules are the result of some testing 
that was done several years ago. I think it was Theo that did it..

Regardless, last time I looked I found that e8 tends to strongly gravitate 
towards 100, if its listed. There are some hits below 100, but they're 
comparatively rare. This is probably due to really fast prorogation of reports 
for this signature type.

I just re-tweaked my Core.pm to make cf comparison logging a lower-level event 
so I can check if this has changed. So far (only a minute or two) I've gotten 5 
e8's and 1 e4, all cf=100.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-01-30 Thread Theo Van Dinter
On Wed, Jan 30, 2008 at 01:11:53PM -0500, Matt Kettler wrote:
 You can try it if you like. The existing rules are the result of some testing 
 that was done several years ago. I think it was Theo that did it..

Yeah, I wrote the rules + code way back when...  I've been trying to find some
stats for this stuff, but didn't come up with anything useful.

My recollection was that w/ e8 the cf was either really low or really high,
and we just took the 51_100 values from the older pre-e8 rules and made it all
consistent.

I don't recall e4 stats.

 I just re-tweaked my Core.pm to make cf comparison logging a lower-level 
 event 
 so I can check if this has changed. So far (only a minute or two) I've gotten 
 5 
 e8's and 1 e4, all cf=100.

Yeah, unfortunately we don't log the actual cf values anywhere by default, so
it's hard to runs some stats w/out rerunning all messages and pounding on the
razor servers.

We have the NetCache plugin which was an initial attempt I was working on to
grab all network-related test results and store them in an X-Spam-* header for
later use via the --reuse option, but a) Razor2 is the only thing in there,
and b) no one enables it by default since nothing actually uses the resulting
header.

-- 
Randomly Selected Tagline:
... are you nuts?  Well, yeah, I got references ... - Prof. Michaelson


pgpntMbb5RYZ2.pgp
Description: PGP signature
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users


Re: [Razor-users] Spamassassin's Razor scores

2008-01-30 Thread Matt Kettler
Theo Van Dinter wrote:
 On Wed, Jan 30, 2008 at 01:11:53PM -0500, Matt Kettler wrote:
 You can try it if you like. The existing rules are the result of some 
 testing 
 that was done several years ago. I think it was Theo that did it..
 
 Yeah, I wrote the rules + code way back when...  I've been trying to find some
 stats for this stuff, but didn't come up with anything useful.
 
 My recollection was that w/ e8 the cf was either really low or really high,
 and we just took the 51_100 values from the older pre-e8 rules and made it all
 consistent.
 
 I don't recall e4 stats.
 
 I just re-tweaked my Core.pm to make cf comparison logging a lower-level 
 event 
 so I can check if this has changed. So far (only a minute or two) I've 
 gotten 5 
 e8's and 1 e4, all cf=100.
 
 Yeah, unfortunately we don't log the actual cf values anywhere by default, so
 it's hard to runs some stats w/out rerunning all messages and pounding on the
 razor servers.

If you want some quick stats, I can post you a patch to Razor2's Core.pm that 
enables logging of cf levels in your razor-agent.log without flooding you. 
That's probably not useful for mass-checks, but can be useful for a little 
grep-based statistics gathering.

By default, if you set your debuglevel=6 it will log which engines matched, at 
what cf level, and what signature hash. However, there's a lot of other junk 
that's completely uninteresting to anyone outside the razor team (ie: byte 
counts on a per-read/write basis come in at debuglevel=4)

The patch I made moves the byte counts up to level 5, and the engine match 
lines 
down the level 4.

But some quick stats for the past hour and 15 mins:

e8 - 140 matches, 16 of which were less than cf 100 (11.4% of hits).

e4 - 92 matches, 12 of which were less than cf 100 (13.0% of hits).

Admittedly the sample is small, but you do get the idea. There is a pretty 
strong gravitation towards 100, so differentiating between 51-95 and 95-100 
isn't much different than 51-100.


 We have the NetCache plugin which was an initial attempt I was working on to
 grab all network-related test results and store them in an X-Spam-* header for
 later use via the --reuse option, but a) Razor2 is the only thing in there,
 and b) no one enables it by default since nothing actually uses the resulting
 header.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Razor-users mailing list
Razor-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/razor-users