Re: Trying to understand how bayes works.

2015-12-12 Thread Axb

On 12/12/2015 02:07 AM, Reindl Harald wrote:



Am 11.12.2015 um 20:58 schrieb Axb:

I hate stale data... that's all


how can bayes data be stale?

a spam message is a spam message now, tomorrow and next year
the same especially for ham


over time...
header patterns change
url patterns change
html templates change
rcvd headers change

all this is what bayes uses when it learns from your mailflow.
What isn't seen/used within a cautious period of time can be defined as 
stale data and therefore safely expired.


or.. how many fresh spams have you seen lately using forged  Netscape 
Communicator headers which were detected by SA rules over a decade ago?

or why do you think SARE rules became useless? (even dangerous).
Why would anyone want to sit on such unused tokens?

While for YOU, Bayes seems to be mission critical, for most setups it's 
the extra bit to push a score to learning threshold.


There's also practical/financial reasons not to sit on old Bayes data 
but these should be pretty obvious.


I think this horse can be buried and instead celebrate that Perkel seems 
happy with his new Redis toy.  .-)


Axb


Re: Trying to understand how bayes works.

2015-12-12 Thread Reindl Harald



Am 12.12.2015 um 17:13 schrieb RW:

On Sat, 12 Dec 2015 13:29:40 +0100
Axb wrote:


On 12/12/2015 01:08 PM, Reindl Harald wrote:



I hate stale data... that's all


But you do keep stale data in the retained tokens, what you are getting
rid of is the contribution from old mails that's least likely to make a
difference to any classifications.  Expiry is about managing database
size; if it were about expiring stale information it would be
implemented differently.


correct


practical reasons?
it's a computer

performance... If I keep accessing X years of stale data my scanning
times go to the roof.


The time taken to look-up n tokens from a database containing m tokens
shouldn't strongly depend on m. There's something wrong if it does.


correct

a message has a fixed number of tokens which are querie against the 
database and it's primary key - it don't matter if that database has 150 
thousand or 2 mio tokens - proven by the automated mass-test passing 
every corpus message agianst spamd, there is no change in performance, 
it only takes longer because the number of messages to test


that's how databases are working by design


financial reasons?
if you mean performance


no... money.. If I see 15 million msgs/day and keep the Bayes data
which those millions provided over a decade or more, I'd be in the TB
amount of data... I couldn't really justify requesting servers with
TBs RAM. Accounting would put me in the looney house.


The number of tokens depends on how many you train, not on how many you
scan


correct and to say it clear:

the need to train goes down when you don't lose data which re-appear in 
two months again - seasonal data and so on


i only need to train around 15-20 ham messages each week which are not 
BAYES_00 and there is not that much more spam below the milter-reject score


what i currently do is train milter-rejects which are not BAYES_99 by 
pass them through spamc in the feed-script and ignore anything which has 
already BAYES_99/999 - most likely i could even stop that now after a 
year of training *because* it catchs practically anything


so the real *need* of training has gone down to around 50 mails per week 
and it don't matter if it's 1000, 1 or 15 million msg/day, the 
number only increases with users, the 75000 samples are covering them 
all on a site-wide setup


what's the difference to a default setups:

* you need to invest time at the begin
* it catchs "new" campaigns from the first message on
  which are in fact not really new, spam topics are
  always the same over years
* it don't get false-positives on seasonal ham because
  you did not lose the ham-tokens from the last season

summary: you can score bayes much higher without false-positives and so 
hit messages before the sending servers are on enough blackslists or 
RUIBL hits them - finally: your overall system works much more accurate 
after you paied the price of inital feeding




signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-12 Thread RW
On Sat, 12 Dec 2015 13:29:40 +0100
Axb wrote:

> On 12/12/2015 01:08 PM, Reindl Harald wrote:
  
> >> I hate stale data... that's all  

But you do keep stale data in the retained tokens, what you are getting
rid of is the contribution from old mails that's least likely to make a
difference to any classifications.  Expiry is about managing database
size; if it were about expiring stale information it would be
implemented differently.

> > practical reasons?
> > it's a computer  
> performance... If I keep accessing X years of stale data my scanning 
> times go to the roof.

The time taken to look-up n tokens from a database containing m tokens
shouldn't strongly depend on m. There's something wrong if it does. 

> > financial reasons?
> > if you mean performance  
> 
> no... money.. If I see 15 million msgs/day and keep the Bayes data
> which those millions provided over a decade or more, I'd be in the TB
> amount of data... I couldn't really justify requesting servers with
> TBs RAM. Accounting would put me in the looney house.

The number of tokens depends on how many you train, not on how many you
scan. 




Re: Trying to understand how bayes works.

2015-12-12 Thread Axb

On 12/12/2015 05:13 PM, RW wrote:

The number of tokens depends on how many you train, not on how many you
scan.

Obvious...

via autolearn my Bayes gets a constant feed of +500k "forced learn" 
spams/day.


Works for me to expire those after 3 or 7 days, depending on the trap feed.
Production traffic's auto learn spam/ham  tokens token have a 14day TTL.
Works for me, very well. To each his own...






Re: Trying to understand how bayes works.

2015-12-12 Thread Reindl Harald



Am 12.12.2015 um 20:12 schrieb Axb:

On 12/12/2015 05:13 PM, RW wrote:

The number of tokens depends on how many you train, not on how many you
scan.

Obvious...

via autolearn my Bayes gets a constant feed of +500k "forced learn"
spams/day.

Works for me to expire those after 3 or 7 days, depending on the trap feed.
Production traffic's auto learn spam/ham  tokens token have a 14day TTL.
Works for me, very well. To each his own...


well "autolearn" is the reason that you *require* "autoexpire"

that's a setup working for most users / admins out-of-the-box good 
enoguh without maintain something - so be it - that doe snot mean it 
delivers the best possible results





signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-11 Thread Marc Perkel


On 12/11/15 06:58, RW wrote:

On Thu, 10 Dec 2015 13:54:05 -0800
Marc Perkel wrote:


Bayes breaks the message down into some sort of tokens and then does
statistics on those tokens as to tokens found in spam vs. tokens
found in ham.

But what about combinations of tokens? I'm thinking that I'd like to
have something that says when it sees tokens X and Y and Z then
that's spam even though X,Y,Z might be in ham when not combined.

Does bayes do that or is there anything that does?

In general making arbitrary combinations is not practical. What some
filters do is make tokens out of word combinations in a sliding window.
This can be very useful in catching difficult spams that are composed
of common neutral words, although in my experience it's a little more
prone to FPs than Bayes.

I use Bogofilter and DSPAM.

On Thu, 10 Dec 2015 21:28:44 -0800
Marc Perkel wrote:


I'm thinking about incorporating Bogofilter but instead of feeding it
messages I'm thinking about feeding it the SpamAssassin results - the
rule names it hit + other data about the message and then let it
score the rules. That's what I want to experiment with.

I thought of trying something like that myself, but my filtering became
practically perfect before I got around to it, so I never bothered. And
I think there are some problems with it.

The first is that FNs in SpamAssassin tend to come from a lack of
useful information rather than the scoring system failing to combine it
well.

The second is that most rules are either fairly neutral or strongly
spammy. There are few strong ham indicators to balance the rest. You
might be able to balance it with metadata, and reputation information,
but the trick is to do it without getting a high FP rate on new senders.

If you did wish to take account of rule combinations, you'd really have
to do it yourself because sliding-window tokenization wouldn't do it
well.




What I was thinking about doing was creating a string of tokens that 
represented key features of the message. Then run that through a program 
that created new tokens out of every possible combination of 2 tokens 
and adding that to the string. Then running bayes on that. My tokens 
will not be the text of the message but rules hit including a lot of 
rules I create not for points but just for tokens.


For example. I create rules that look for many phrases about a subject 
and the subject becomes a token. For examples:


JESUS
ROYALTY
MONEY

But themselves not an indicator of spam. But if you have all 3 then it's 
definitely spam. The idea is to not look at words but look at the 
meaning of phrases. For instance, introductions:


Dear (friend)
I am (someone)
I am contacting you because (some reason)

This says - I don't know you.

I am a member of the (Nigerian royal family|Armed forces in Iraq) etc.

These can all be reduced to tokens and then you just look for 
combination of tokens.




--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Trying to understand how bayes works.

2015-12-11 Thread Benny Pedersen

On December 11, 2015 9:38:49 AM Axb  wrote:


Again.. SA's Redis backend speed and ease of use can't be beat...


mariadb engine=memory

hack mariadb start stop to change engine before stop and after start for 
persistence db



There's some help in
https://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/


thanks anyway


Re: Trying to understand how bayes works.

2015-12-11 Thread Dianne Skoll
On Fri, 11 Dec 2015 09:05:10 -0800
Marc Perkel  wrote:

> What I was thinking about doing was creating a string of tokens that 
> represented key features of the message. Then run that through a
> program that created new tokens out of every possible combination of
> 2 tokens and adding that to the string. Then running bayes on that.
> My tokens will not be the text of the message but rules hit including
> a lot of rules I create not for points but just for tokens.

In our experience, it's generally a bad idea to manipulate what you
feed to Bayes too much.  If you're training correctly, Bayes will have
a far better idea than you do about what's a spam indicator and what
isn't.  It almost never pays to be too clever.

Regards,

Dianne.


Re: Trying to understand how bayes works.

2015-12-11 Thread Kris Deugau
Marc Perkel wrote:
> I've had bayes disabled in SA because it seems to not be able to stay
> working in a high volume situation. The MySQL server can't seem to keep
> up with it even on very fast computers.

I'm curious where you started seeing performance issues (number of
messages, users;  hardware platform stats), because it's something I'm
keeping in mind for "things to watch out for" locally.  Particularly in
light of:

> Yes MariaDB was better than MySQL but not good enough to keep up. I even 
> tried putting the database on ram disk and still didn't work. 

where we switched from MySQL (either ISAM or InnoDB) on disk to ramdisk,
and immediately dropped 90%+ of the performance issues we were having.

We also found that MySQL replication was such an I/O hog that it was
better to single-host the SA Bayes DB, keep reasonable backups, and plan
to restore from backup instead of trying for hot or warm failover.

-kgd


Re: Trying to understand how bayes works.

2015-12-11 Thread Reindl Harald



Am 11.12.2015 um 17:40 schrieb Kris Deugau:

Marc Perkel wrote:

I've had bayes disabled in SA because it seems to not be able to stay
working in a high volume situation. The MySQL server can't seem to keep
up with it even on very fast computers.


I'm curious where you started seeing performance issues (number of
messages, users;  hardware platform stats), because it's something I'm
keeping in mind for "things to watch out for" locally.  Particularly in
light of:


Yes MariaDB was better than MySQL but not good enough to keep up. I even tried 
putting the database on ram disk and still didn't work.


where we switched from MySQL (either ISAM or InnoDB) on disk to ramdisk,
and immediately dropped 90%+ of the performance issues we were having.

We also found that MySQL replication was such an I/O hog that it was
better to single-host the SA Bayes DB, keep reasonable backups, and plan
to restore from backup instead of trying for hot or warm failover


well, try the default bayes-backend on a tmpfs

when anything is producing load here, the bayes-db don't while it's 
likely one of the largest bayes databases out there caused by no 
auto-expire/auto-learn and no limited to just 15 tokens


last night it took 4300 seconds to test 75000 samples against the 
current bayes which means 17/second BUT the most expensive part is for 
sure fire up 75000 times the spamc-process by a script and not the bayes 
itself


0  54694SPAM
0  2HAM
02379839TOKEN

-rw--- 1 sa-milt sa-milt 10M 2015-12-11 17:39 bayes_seen
-rw--- 1 sa-milt sa-milt 81M 2015-12-11 17:39 bayes_toks
-rw--- 1 sa-milt sa-milt  39 2015-12-07 21:10 user_prefs

"/var/lib/spamass-milter/.spamassassin/" where the sit-wide bayes is 
stored is a read-only tmpfs (read-only for SA) restored at boot from a 
persistent folder and besides shut down rsynced also when the learning 
script is called


[spamass-milter@mail-gw:~]$ cat /etc/systemd/system/bayes.service
[Unit]
Description=Bayes RAM-Disk Manager
Before=spamassassin.service

[Service]
Type=oneshot
RemainAfterExit=yes
User=sa-milt
Group=sa-milt
ExecStart=/usr/bin/rsync --quiet --recursive --times 
/var/lib/bayes-persistent/ /var/lib/spamass-milter/.spamassassin/
ExecStop=/usr/bin/rsync --quiet --recursive --times 
/var/lib/spamass-milter/.spamassassin/ /var/lib/bayes-persistent/


[Install]
WantedBy=multi-user.target



signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-11 Thread Reindl Harald

summary to what you said below:

that's what bayes already does
just rain it properly instead re-invent the whell

Am 11.12.2015 um 18:05 schrieb Marc Perkel:

What I was thinking about doing was creating a string of tokens that
represented key features of the message. Then run that through a program
that created new tokens out of every possible combination of 2 tokens
and adding that to the string. Then running bayes on that. My tokens
will not be the text of the message but rules hit including a lot of
rules I create not for points but just for tokens.

For example. I create rules that look for many phrases about a subject
and the subject becomes a token. For examples:

JESUS
ROYALTY
MONEY

But themselves not an indicator of spam. But if you have all 3 then it's
definitely spam. The idea is to not look at words but look at the
meaning of phrases. For instance, introductions:

Dear (friend)
I am (someone)
I am contacting you because (some reason)

This says - I don't know you.

I am a member of the (Nigerian royal family|Armed forces in Iraq) etc.

These can all be reduced to tokens and then you just look for
combination of tokens





signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-11 Thread RW
On Thu, 10 Dec 2015 13:54:05 -0800
Marc Perkel wrote:

> Bayes breaks the message down into some sort of tokens and then does 
> statistics on those tokens as to tokens found in spam vs. tokens
> found in ham.
> 
> But what about combinations of tokens? I'm thinking that I'd like to 
> have something that says when it sees tokens X and Y and Z then
> that's spam even though X,Y,Z might be in ham when not combined.
> 
> Does bayes do that or is there anything that does?

In general making arbitrary combinations is not practical. What some
filters do is make tokens out of word combinations in a sliding window.
This can be very useful in catching difficult spams that are composed
of common neutral words, although in my experience it's a little more
prone to FPs than Bayes.

I use Bogofilter and DSPAM.

On Thu, 10 Dec 2015 21:28:44 -0800
Marc Perkel wrote:

> I'm thinking about incorporating Bogofilter but instead of feeding it 
> messages I'm thinking about feeding it the SpamAssassin results - the 
> rule names it hit + other data about the message and then let it
> score the rules. That's what I want to experiment with.

I thought of trying something like that myself, but my filtering became
practically perfect before I got around to it, so I never bothered. And
I think there are some problems with it.

The first is that FNs in SpamAssassin tend to come from a lack of
useful information rather than the scoring system failing to combine it
well.

The second is that most rules are either fairly neutral or strongly
spammy. There are few strong ham indicators to balance the rest. You
might be able to balance it with metadata, and reputation information,
but the trick is to do it without getting a high FP rate on new senders.

If you did wish to take account of rule combinations, you'd really have
to do it yourself because sliding-window tokenization wouldn't do it
well.







Re: Trying to understand how bayes works.

2015-12-11 Thread Reindl Harald



Am 11.12.2015 um 20:58 schrieb Axb:

I hate stale data... that's all


how can bayes data be stale?

a spam message is a spam message now, tomorrow and next year
the same especially for ham




signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-11 Thread Axb

On 12/11/2015 06:28 AM, Marc Perkel wrote:


On 12/10/15 18:31, Benny Pedersen wrote:

Marc Perkel skrev den 2015-12-10 22:54:

I've had bayes disabled in SA because it seems to not be able to stay
working in a high volume situation. The MySQL server can't seem to
keep up with it even on very fast computers.


i got a palm Zire that can do ocr on handwrited text :=)

pretty good for the kind of cpu it have


But - thinking about trying something interesting - doing my own bayes
in a different way.


i have tryed bogofilter with very good succes, and i see problems with
bayes here aswell, i remember you changed to mariadb ?`

at that time you sayed it worked better then mysql ?

did it fail again ?


Here's my question.

Bayes breaks the message down into some sort of tokens and then does
statistics on those tokens as to tokens found in spam vs. tokens found
in ham.

But what about combinations of tokens? I'm thinking that I'd like to
have something that says when it sees tokens X and Y and Z then that's
spam even though X,Y,Z might be in ham when not combined.

Does bayes do that or is there anything that does?


if z is scored as spam, and x and y is ham, then its ham basicly that
how bayes works, but a single mail might be lots of digest to compare
for this to say spam or not

test bogofilter

put 100 spam mails in a spam folder
put 100 non spam mails in a ham folder

train bogofilter with this 2 folders in one go, not first ham and then
spam, it must be done in one bogofilter call train, configure
bogofilter.cf plugin for spamassassin, test it :=)

YMMV




Yes MariaDB was better than MySQL but not good enough to keep up. I even
tried putting the database on ram disk and still didn't work.

I'm thinking about incorporating Bogofilter but instead of feeding it
messages I'm thinking about feeding it the spamassassin results - the
rule names it hit + other data about the message and then let it score
the rules. That's what I want to experiment with.


Bogofilter was designed to be used with a MUA. Shellout for each msg 
can't be very efficient and if you want to share the Bayes DB across 
several boxes, NFS doesn't seem like a fast option either.


Again.. SA's Redis backend speed and ease of use can't be beat...
There's some help in
https://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/

Axb



Re: Trying to understand how bayes works.

2015-12-11 Thread Axb

On 12/11/2015 07:29 PM, Joe Quinn wrote:

On 12/11/2015 1:24 PM, Reindl Harald wrote:



Am 11.12.2015 um 19:12 schrieb Axb:

On 12/11/2015 06:51 PM, Reindl Harald wrote:

well, how many of you trained chistmas spam this year while my bayes
did
know it from last year?


I like my Bayes fresh like bread out of the oven, new guitar strings and
clean sheets.


well, i like my bayes catch spam at every point in time without repeat
to slip things through once already caught - tell me one reason why i
should let phishing pass through to customers which was already detected

96% of all milter-rejected mails got 3.5-7.5 points from bayes while
at the same time 77% of all scanned mail got -3.5 points - in other
words most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's
what the bayes is supposed to do


Last years turkey doesn't appeal to me.


and what is last years spam making it now again through until relearn?

spammers would have so much more work if they didn't know that in a
few months they can re-use their templates after a large enough break,
as a spammer i would even schedule the usage of them automated


Agreed, and adding that we do see a large percentage of repeat seasonal
spam templates. You need at least some of your data to carry over for at
least a year, maybe two in order to stay effective.


We're obviously catering to a different user base, seeing different 
traffic, etc, and have different approaches.
Each method is valid if you're happy with the results and your arsenal 
does what you need it to do.


I hate stale data... that's all.








Re: Trying to understand how bayes works.

2015-12-11 Thread Reindl Harald



Am 11.12.2015 um 18:42 schrieb Martin Gregorie:

For instance, I have two portmanteau rules, SALE (contains sales
phrases like "huge discount") and PRODUCT (contains phrases like "fur
coat") that are ANDed by a meta called SALESPAM. The nice thing about
this approach is that, once the SALE and PRODUCT lists have grown to a
decent size the SALESPAM meta starts to fire on previously unseen
combinations without generating FPs. The only downside is that, unlike
Bayes, you have to build the lists manually but thats probably no worse
to do than building a hand-crafted Bayes DB like Reyndl does


hand crafted bayes?
worse?

what a nonsense

what's handcrafted there?
that i don't trust autolearn and don't like autoexpire?

well, how many of you trained chistmas spam this year while my bayes did 
know it from last year?


how many of you are train the same spam types again and again because 
spammers are aware of autoexpire and just need to stop using a campaign 
for some weeks until 99% of default setups has forgotten about it


what i do is just KEEP all training messages so that i can rebuild my 
bayes at every point in them without start learning from scratch


since "bayes_token_sources all" coming with the last release as well as 
"normalize_charset 1" enabled later and chnaged it#s behavior with the 
lastest release i know why - well, i did know that from the first moment 
"keep the corps if later something in the tokenizer changes"




signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-11 Thread Axb

On 12/11/2015 06:51 PM, Reindl Harald wrote:

well, how many of you trained chistmas spam this year while my bayes did
know it from last year?


I like my Bayes fresh like bread out of the oven, new guitar strings and 
clean sheets.


Last years turkey doesn't appeal to me.

SCR


Re: Trying to understand how bayes works.

2015-12-11 Thread Martin Gregorie
On Fri, 2015-12-11 at 09:05 -0800, Marc Perkel wrote:
> For example. I create rules that look for many phrases about a
> subject 
> and the subject becomes a token. For examples:
> 
> JESUS
> ROYALTY
> MONEY
> 
> But themselves not an indicator of spam. But if you have all 3 then
> it's 
> definitely spam. The idea is to not look at words but look at the 
> meaning of phrases. 
>
This approach works well for me too, but doesn't need Bayes to make it
perform: just two or more portmanteau rules[*] that are combined by a
meta with a relatively high score. The idea is that the triggering
phrases are not spam indicators by themselves, but that the combination
is something that virtually never occurs in ham but is a reliable spam
indicator.

For instance, I have two portmanteau rules, SALE (contains sales
phrases like "huge discount") and PRODUCT (contains phrases like "fur
coat") that are ANDed by a meta called SALESPAM. The nice thing about
this approach is that, once the SALE and PRODUCT lists have grown to a
decent size the SALESPAM meta starts to fire on previously unseen
combinations without generating FPs. The only downside is that, unlike
Bayes, you have to build the lists manually but thats probably no worse
to do than building a hand-crafted Bayes DB like Reyndl does. 

[*] My term: a portmanteau rule is rule with a very long alternate list
and a low score in the range 0.01 - 0.1. These things are hard to read
and maintain, so I have an awk script that generates a syntactically
correct SA rule from a file that names the rule, sets the score and has
all the regexes written one per line.  

Martin



Re: Trying to understand how bayes works.

2015-12-11 Thread Axb

On 12/11/2015 07:24 PM, Reindl Harald wrote:



Am 11.12.2015 um 19:12 schrieb Axb:

On 12/11/2015 06:51 PM, Reindl Harald wrote:

well, how many of you trained chistmas spam this year while my bayes did
know it from last year?


I like my Bayes fresh like bread out of the oven, new guitar strings and
clean sheets.


well, i like my bayes catch spam at every point in time without repeat
to slip things through once already caught - tell me one reason why i
should let phishing pass through to customers which was already detected


In my playpen, phishing is detected by Digests, autogenerated rules, URI 
strings & domain BLs, etc. Don't need to babysit Bayes for that.



96% of all milter-rejected mails got 3.5-7.5 points from bayes while at
the same time 77% of all scanned mail got -3.5 points - in other words
most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's what
the bayes is supposed to do


Thats' what YOU want it to do. Your kitchen, your sauce.
For me, Bayes is just another spice in the rack. And my rack has lots of 
jars.


My trap flow keeps it all well fed and happy.
Without knowing it, even you are using some of my data. I just see it 20 
sec before you can make use of it.


Re: Trying to understand how bayes works.

2015-12-11 Thread Joe Quinn

On 12/11/2015 1:24 PM, Reindl Harald wrote:



Am 11.12.2015 um 19:12 schrieb Axb:

On 12/11/2015 06:51 PM, Reindl Harald wrote:
well, how many of you trained chistmas spam this year while my bayes 
did

know it from last year?


I like my Bayes fresh like bread out of the oven, new guitar strings and
clean sheets.


well, i like my bayes catch spam at every point in time without repeat 
to slip things through once already caught - tell me one reason why i 
should let phishing pass through to customers which was already detected


96% of all milter-rejected mails got 3.5-7.5 points from bayes while 
at the same time 77% of all scanned mail got -3.5 points - in other 
words most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's 
what the bayes is supposed to do



Last years turkey doesn't appeal to me.


and what is last years spam making it now again through until relearn?

spammers would have so much more work if they didn't know that in a 
few months they can re-use their templates after a large enough break, 
as a spammer i would even schedule the usage of them automated


Agreed, and adding that we do see a large percentage of repeat seasonal 
spam templates. You need at least some of your data to carry over for at 
least a year, maybe two in order to stay effective.


Re: Trying to understand how bayes works.

2015-12-11 Thread Reindl Harald



Am 11.12.2015 um 19:12 schrieb Axb:

On 12/11/2015 06:51 PM, Reindl Harald wrote:

well, how many of you trained chistmas spam this year while my bayes did
know it from last year?


I like my Bayes fresh like bread out of the oven, new guitar strings and
clean sheets.


well, i like my bayes catch spam at every point in time without repeat 
to slip things through once already caught - tell me one reason why i 
should let phishing pass through to customers which was already detected


96% of all milter-rejected mails got 3.5-7.5 points from bayes while at 
the same time 77% of all scanned mail got -3.5 points - in other words 
most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's what 
the bayes is supposed to do



Last years turkey doesn't appeal to me.


and what is last years spam making it now again through until relearn?

spammers would have so much more work if they didn't know that in a few 
months they can re-use their templates after a large enough break, as a 
spammer i would even schedule the usage of them automated




signature.asc
Description: OpenPGP digital signature


Re: Trying to understand how bayes works.

2015-12-10 Thread Axb

On 12/10/2015 10:54 PM, Marc Perkel wrote:

I've had bayes disabled in SA because it seems to not be able to stay
working in a high volume situation. The MySQL server can't seem to keep
up with it even on very fast computers.


Redis is your friend.
Redis over the wire is faster than any local SDBM/DB file based backend.
All you need is ram, the more the better

I use site wide autolearn, auto feed spam from traps - atm, token TTL is 4d

# Clients
connected_clients:35
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:3454088112
used_memory_human:3.22G
used_memory_rss:3528310784
used_memory_peak:3454661016
used_memory_peak_human:3.22G
used_memory_lua:116736
mem_fragmentation_ratio:1.02
mem_allocator:jemalloc-3.6.0

if I switch Bayes on or off I notice zero SA scan speed change.
Average SA scan time  is 0.8 sec/msg



But - thinking about trying something interesting - doing my own bayes
in a different way.

Here's my question.

Bayes breaks the message down into some sort of tokens and then does
statistics on those tokens as to tokens found in spam vs. tokens found
in ham.

But what about combinations of tokens? I'm thinking that I'd like to
have something that says when it sees tokens X and Y and Z then that's
spam even though X,Y,Z might be in ham when not combined.

Does bayes do that or is there anything that does?


There's tons of Bayes documentation on the net. Different 
implementations, etc.

Enjoy the math... Not really a pure SA topic.






Trying to understand how bayes works.

2015-12-10 Thread Marc Perkel
I've had bayes disabled in SA because it seems to not be able to stay 
working in a high volume situation. The MySQL server can't seem to keep 
up with it even on very fast computers.


But - thinking about trying something interesting - doing my own bayes 
in a different way.


Here's my question.

Bayes breaks the message down into some sort of tokens and then does 
statistics on those tokens as to tokens found in spam vs. tokens found 
in ham.


But what about combinations of tokens? I'm thinking that I'd like to 
have something that says when it sees tokens X and Y and Z then that's 
spam even though X,Y,Z might be in ham when not combined.


Does bayes do that or is there anything that does?


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400



Re: Trying to understand how bayes works.

2015-12-10 Thread Dianne Skoll
On Thu, 10 Dec 2015 13:54:05 -0800
Marc Perkel  wrote:

> But what about combinations of tokens? I'm thinking that I'd like to 
> have something that says when it sees tokens X and Y and Z then
> that's spam even though X,Y,Z might be in ham when not combined.

The SpamAssassin Bayes implementation does not do that.  Some other
Bayes implementations do.

Regards,

Dianne.


Re: Trying to understand how bayes works.

2015-12-10 Thread Benny Pedersen

Marc Perkel skrev den 2015-12-10 22:54:

I've had bayes disabled in SA because it seems to not be able to stay
working in a high volume situation. The MySQL server can't seem to
keep up with it even on very fast computers.


i got a palm Zire that can do ocr on handwrited text :=)

pretty good for the kind of cpu it have


But - thinking about trying something interesting - doing my own bayes
in a different way.


i have tryed bogofilter with very good succes, and i see problems with 
bayes here aswell, i remember you changed to mariadb ?`


at that time you sayed it worked better then mysql ?

did it fail again ?


Here's my question.

Bayes breaks the message down into some sort of tokens and then does
statistics on those tokens as to tokens found in spam vs. tokens found
in ham.

But what about combinations of tokens? I'm thinking that I'd like to
have something that says when it sees tokens X and Y and Z then that's
spam even though X,Y,Z might be in ham when not combined.

Does bayes do that or is there anything that does?


if z is scored as spam, and x and y is ham, then its ham basicly that 
how bayes works, but a single mail might be lots of digest to compare 
for this to say spam or not


test bogofilter

put 100 spam mails in a spam folder
put 100 non spam mails in a ham folder

train bogofilter with this 2 folders in one go, not first ham and then 
spam, it must be done in one bogofilter call train, configure 
bogofilter.cf plugin for spamassassin, test it :=)


YMMV


Re: Trying to understand how bayes works.

2015-12-10 Thread Dianne Skoll
On Fri, 11 Dec 2015 03:31:56 +0100
Benny Pedersen  wrote:

> if z is scored as spam, and x and y is ham, then its ham basicly
> that how bayes works, but a single mail might be lots of digest to
> compare for this to say spam or not

The thing is, the probability of token Y is not independent of the
previous token, and single-token Bayes misses out on those conditional
probabilities.

The example I like to give is that the tokens "red" and "hot" are
probably neutral to slightly spammy, and "sex" is probably mildly
spammy, but "red hot sex" is way spammier than the individual tokens
far apart as in "Yeah, the red chili peppers are hot.  Oh, by the way,
what was the sex of the baby?"

Regards,

Dianne.


Re: Trying to understand how bayes works.

2015-12-10 Thread Marc Perkel


On 12/10/15 18:31, Benny Pedersen wrote:

Marc Perkel skrev den 2015-12-10 22:54:

I've had bayes disabled in SA because it seems to not be able to stay
working in a high volume situation. The MySQL server can't seem to
keep up with it even on very fast computers.


i got a palm Zire that can do ocr on handwrited text :=)

pretty good for the kind of cpu it have


But - thinking about trying something interesting - doing my own bayes
in a different way.


i have tryed bogofilter with very good succes, and i see problems with 
bayes here aswell, i remember you changed to mariadb ?`


at that time you sayed it worked better then mysql ?

did it fail again ?


Here's my question.

Bayes breaks the message down into some sort of tokens and then does
statistics on those tokens as to tokens found in spam vs. tokens found
in ham.

But what about combinations of tokens? I'm thinking that I'd like to
have something that says when it sees tokens X and Y and Z then that's
spam even though X,Y,Z might be in ham when not combined.

Does bayes do that or is there anything that does?


if z is scored as spam, and x and y is ham, then its ham basicly that 
how bayes works, but a single mail might be lots of digest to compare 
for this to say spam or not


test bogofilter

put 100 spam mails in a spam folder
put 100 non spam mails in a ham folder

train bogofilter with this 2 folders in one go, not first ham and then 
spam, it must be done in one bogofilter call train, configure 
bogofilter.cf plugin for spamassassin, test it :=)


YMMV




Yes MariaDB was better than MySQL but not good enough to keep up. I even 
tried putting the database on ram disk and still didn't work.


I'm thinking about incorporating Bogofilter but instead of feeding it 
messages I'm thinking about feeding it the spamassassin results - the 
rule names it hit + other data about the message and then let it score 
the rules. That's what I want to experiment with.


--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400