Re: Trying to understand how bayes works.
On 12/12/2015 02:07 AM, Reindl Harald wrote: Am 11.12.2015 um 20:58 schrieb Axb: I hate stale data... that's all how can bayes data be stale? a spam message is a spam message now, tomorrow and next year the same especially for ham over time... header patterns change url patterns change html templates change rcvd headers change all this is what bayes uses when it learns from your mailflow. What isn't seen/used within a cautious period of time can be defined as stale data and therefore safely expired. or.. how many fresh spams have you seen lately using forged Netscape Communicator headers which were detected by SA rules over a decade ago? or why do you think SARE rules became useless? (even dangerous). Why would anyone want to sit on such unused tokens? While for YOU, Bayes seems to be mission critical, for most setups it's the extra bit to push a score to learning threshold. There's also practical/financial reasons not to sit on old Bayes data but these should be pretty obvious. I think this horse can be buried and instead celebrate that Perkel seems happy with his new Redis toy. .-) Axb
Re: Trying to understand how bayes works.
Am 12.12.2015 um 17:13 schrieb RW: On Sat, 12 Dec 2015 13:29:40 +0100 Axb wrote: On 12/12/2015 01:08 PM, Reindl Harald wrote: I hate stale data... that's all But you do keep stale data in the retained tokens, what you are getting rid of is the contribution from old mails that's least likely to make a difference to any classifications. Expiry is about managing database size; if it were about expiring stale information it would be implemented differently. correct practical reasons? it's a computer performance... If I keep accessing X years of stale data my scanning times go to the roof. The time taken to look-up n tokens from a database containing m tokens shouldn't strongly depend on m. There's something wrong if it does. correct a message has a fixed number of tokens which are querie against the database and it's primary key - it don't matter if that database has 150 thousand or 2 mio tokens - proven by the automated mass-test passing every corpus message agianst spamd, there is no change in performance, it only takes longer because the number of messages to test that's how databases are working by design financial reasons? if you mean performance no... money.. If I see 15 million msgs/day and keep the Bayes data which those millions provided over a decade or more, I'd be in the TB amount of data... I couldn't really justify requesting servers with TBs RAM. Accounting would put me in the looney house. The number of tokens depends on how many you train, not on how many you scan correct and to say it clear: the need to train goes down when you don't lose data which re-appear in two months again - seasonal data and so on i only need to train around 15-20 ham messages each week which are not BAYES_00 and there is not that much more spam below the milter-reject score what i currently do is train milter-rejects which are not BAYES_99 by pass them through spamc in the feed-script and ignore anything which has already BAYES_99/999 - most likely i could even stop that now after a year of training *because* it catchs practically anything so the real *need* of training has gone down to around 50 mails per week and it don't matter if it's 1000, 1 or 15 million msg/day, the number only increases with users, the 75000 samples are covering them all on a site-wide setup what's the difference to a default setups: * you need to invest time at the begin * it catchs "new" campaigns from the first message on which are in fact not really new, spam topics are always the same over years * it don't get false-positives on seasonal ham because you did not lose the ham-tokens from the last season summary: you can score bayes much higher without false-positives and so hit messages before the sending servers are on enough blackslists or RUIBL hits them - finally: your overall system works much more accurate after you paied the price of inital feeding signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
On Sat, 12 Dec 2015 13:29:40 +0100 Axb wrote: > On 12/12/2015 01:08 PM, Reindl Harald wrote: > >> I hate stale data... that's all But you do keep stale data in the retained tokens, what you are getting rid of is the contribution from old mails that's least likely to make a difference to any classifications. Expiry is about managing database size; if it were about expiring stale information it would be implemented differently. > > practical reasons? > > it's a computer > performance... If I keep accessing X years of stale data my scanning > times go to the roof. The time taken to look-up n tokens from a database containing m tokens shouldn't strongly depend on m. There's something wrong if it does. > > financial reasons? > > if you mean performance > > no... money.. If I see 15 million msgs/day and keep the Bayes data > which those millions provided over a decade or more, I'd be in the TB > amount of data... I couldn't really justify requesting servers with > TBs RAM. Accounting would put me in the looney house. The number of tokens depends on how many you train, not on how many you scan.
Re: Trying to understand how bayes works.
On 12/12/2015 05:13 PM, RW wrote: The number of tokens depends on how many you train, not on how many you scan. Obvious... via autolearn my Bayes gets a constant feed of +500k "forced learn" spams/day. Works for me to expire those after 3 or 7 days, depending on the trap feed. Production traffic's auto learn spam/ham tokens token have a 14day TTL. Works for me, very well. To each his own...
Re: Trying to understand how bayes works.
Am 12.12.2015 um 20:12 schrieb Axb: On 12/12/2015 05:13 PM, RW wrote: The number of tokens depends on how many you train, not on how many you scan. Obvious... via autolearn my Bayes gets a constant feed of +500k "forced learn" spams/day. Works for me to expire those after 3 or 7 days, depending on the trap feed. Production traffic's auto learn spam/ham tokens token have a 14day TTL. Works for me, very well. To each his own... well "autolearn" is the reason that you *require* "autoexpire" that's a setup working for most users / admins out-of-the-box good enoguh without maintain something - so be it - that doe snot mean it delivers the best possible results signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
On 12/11/15 06:58, RW wrote: On Thu, 10 Dec 2015 13:54:05 -0800 Marc Perkel wrote: Bayes breaks the message down into some sort of tokens and then does statistics on those tokens as to tokens found in spam vs. tokens found in ham. But what about combinations of tokens? I'm thinking that I'd like to have something that says when it sees tokens X and Y and Z then that's spam even though X,Y,Z might be in ham when not combined. Does bayes do that or is there anything that does? In general making arbitrary combinations is not practical. What some filters do is make tokens out of word combinations in a sliding window. This can be very useful in catching difficult spams that are composed of common neutral words, although in my experience it's a little more prone to FPs than Bayes. I use Bogofilter and DSPAM. On Thu, 10 Dec 2015 21:28:44 -0800 Marc Perkel wrote: I'm thinking about incorporating Bogofilter but instead of feeding it messages I'm thinking about feeding it the SpamAssassin results - the rule names it hit + other data about the message and then let it score the rules. That's what I want to experiment with. I thought of trying something like that myself, but my filtering became practically perfect before I got around to it, so I never bothered. And I think there are some problems with it. The first is that FNs in SpamAssassin tend to come from a lack of useful information rather than the scoring system failing to combine it well. The second is that most rules are either fairly neutral or strongly spammy. There are few strong ham indicators to balance the rest. You might be able to balance it with metadata, and reputation information, but the trick is to do it without getting a high FP rate on new senders. If you did wish to take account of rule combinations, you'd really have to do it yourself because sliding-window tokenization wouldn't do it well. What I was thinking about doing was creating a string of tokens that represented key features of the message. Then run that through a program that created new tokens out of every possible combination of 2 tokens and adding that to the string. Then running bayes on that. My tokens will not be the text of the message but rules hit including a lot of rules I create not for points but just for tokens. For example. I create rules that look for many phrases about a subject and the subject becomes a token. For examples: JESUS ROYALTY MONEY But themselves not an indicator of spam. But if you have all 3 then it's definitely spam. The idea is to not look at words but look at the meaning of phrases. For instance, introductions: Dear (friend) I am (someone) I am contacting you because (some reason) This says - I don't know you. I am a member of the (Nigerian royal family|Armed forces in Iraq) etc. These can all be reduced to tokens and then you just look for combination of tokens. -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400
Re: Trying to understand how bayes works.
On December 11, 2015 9:38:49 AM Axbwrote: Again.. SA's Redis backend speed and ease of use can't be beat... mariadb engine=memory hack mariadb start stop to change engine before stop and after start for persistence db There's some help in https://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/ thanks anyway
Re: Trying to understand how bayes works.
On Fri, 11 Dec 2015 09:05:10 -0800 Marc Perkelwrote: > What I was thinking about doing was creating a string of tokens that > represented key features of the message. Then run that through a > program that created new tokens out of every possible combination of > 2 tokens and adding that to the string. Then running bayes on that. > My tokens will not be the text of the message but rules hit including > a lot of rules I create not for points but just for tokens. In our experience, it's generally a bad idea to manipulate what you feed to Bayes too much. If you're training correctly, Bayes will have a far better idea than you do about what's a spam indicator and what isn't. It almost never pays to be too clever. Regards, Dianne.
Re: Trying to understand how bayes works.
Marc Perkel wrote: > I've had bayes disabled in SA because it seems to not be able to stay > working in a high volume situation. The MySQL server can't seem to keep > up with it even on very fast computers. I'm curious where you started seeing performance issues (number of messages, users; hardware platform stats), because it's something I'm keeping in mind for "things to watch out for" locally. Particularly in light of: > Yes MariaDB was better than MySQL but not good enough to keep up. I even > tried putting the database on ram disk and still didn't work. where we switched from MySQL (either ISAM or InnoDB) on disk to ramdisk, and immediately dropped 90%+ of the performance issues we were having. We also found that MySQL replication was such an I/O hog that it was better to single-host the SA Bayes DB, keep reasonable backups, and plan to restore from backup instead of trying for hot or warm failover. -kgd
Re: Trying to understand how bayes works.
Am 11.12.2015 um 17:40 schrieb Kris Deugau: Marc Perkel wrote: I've had bayes disabled in SA because it seems to not be able to stay working in a high volume situation. The MySQL server can't seem to keep up with it even on very fast computers. I'm curious where you started seeing performance issues (number of messages, users; hardware platform stats), because it's something I'm keeping in mind for "things to watch out for" locally. Particularly in light of: Yes MariaDB was better than MySQL but not good enough to keep up. I even tried putting the database on ram disk and still didn't work. where we switched from MySQL (either ISAM or InnoDB) on disk to ramdisk, and immediately dropped 90%+ of the performance issues we were having. We also found that MySQL replication was such an I/O hog that it was better to single-host the SA Bayes DB, keep reasonable backups, and plan to restore from backup instead of trying for hot or warm failover well, try the default bayes-backend on a tmpfs when anything is producing load here, the bayes-db don't while it's likely one of the largest bayes databases out there caused by no auto-expire/auto-learn and no limited to just 15 tokens last night it took 4300 seconds to test 75000 samples against the current bayes which means 17/second BUT the most expensive part is for sure fire up 75000 times the spamc-process by a script and not the bayes itself 0 54694SPAM 0 2HAM 02379839TOKEN -rw--- 1 sa-milt sa-milt 10M 2015-12-11 17:39 bayes_seen -rw--- 1 sa-milt sa-milt 81M 2015-12-11 17:39 bayes_toks -rw--- 1 sa-milt sa-milt 39 2015-12-07 21:10 user_prefs "/var/lib/spamass-milter/.spamassassin/" where the sit-wide bayes is stored is a read-only tmpfs (read-only for SA) restored at boot from a persistent folder and besides shut down rsynced also when the learning script is called [spamass-milter@mail-gw:~]$ cat /etc/systemd/system/bayes.service [Unit] Description=Bayes RAM-Disk Manager Before=spamassassin.service [Service] Type=oneshot RemainAfterExit=yes User=sa-milt Group=sa-milt ExecStart=/usr/bin/rsync --quiet --recursive --times /var/lib/bayes-persistent/ /var/lib/spamass-milter/.spamassassin/ ExecStop=/usr/bin/rsync --quiet --recursive --times /var/lib/spamass-milter/.spamassassin/ /var/lib/bayes-persistent/ [Install] WantedBy=multi-user.target signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
summary to what you said below: that's what bayes already does just rain it properly instead re-invent the whell Am 11.12.2015 um 18:05 schrieb Marc Perkel: What I was thinking about doing was creating a string of tokens that represented key features of the message. Then run that through a program that created new tokens out of every possible combination of 2 tokens and adding that to the string. Then running bayes on that. My tokens will not be the text of the message but rules hit including a lot of rules I create not for points but just for tokens. For example. I create rules that look for many phrases about a subject and the subject becomes a token. For examples: JESUS ROYALTY MONEY But themselves not an indicator of spam. But if you have all 3 then it's definitely spam. The idea is to not look at words but look at the meaning of phrases. For instance, introductions: Dear (friend) I am (someone) I am contacting you because (some reason) This says - I don't know you. I am a member of the (Nigerian royal family|Armed forces in Iraq) etc. These can all be reduced to tokens and then you just look for combination of tokens signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
On Thu, 10 Dec 2015 13:54:05 -0800 Marc Perkel wrote: > Bayes breaks the message down into some sort of tokens and then does > statistics on those tokens as to tokens found in spam vs. tokens > found in ham. > > But what about combinations of tokens? I'm thinking that I'd like to > have something that says when it sees tokens X and Y and Z then > that's spam even though X,Y,Z might be in ham when not combined. > > Does bayes do that or is there anything that does? In general making arbitrary combinations is not practical. What some filters do is make tokens out of word combinations in a sliding window. This can be very useful in catching difficult spams that are composed of common neutral words, although in my experience it's a little more prone to FPs than Bayes. I use Bogofilter and DSPAM. On Thu, 10 Dec 2015 21:28:44 -0800 Marc Perkel wrote: > I'm thinking about incorporating Bogofilter but instead of feeding it > messages I'm thinking about feeding it the SpamAssassin results - the > rule names it hit + other data about the message and then let it > score the rules. That's what I want to experiment with. I thought of trying something like that myself, but my filtering became practically perfect before I got around to it, so I never bothered. And I think there are some problems with it. The first is that FNs in SpamAssassin tend to come from a lack of useful information rather than the scoring system failing to combine it well. The second is that most rules are either fairly neutral or strongly spammy. There are few strong ham indicators to balance the rest. You might be able to balance it with metadata, and reputation information, but the trick is to do it without getting a high FP rate on new senders. If you did wish to take account of rule combinations, you'd really have to do it yourself because sliding-window tokenization wouldn't do it well.
Re: Trying to understand how bayes works.
Am 11.12.2015 um 20:58 schrieb Axb: I hate stale data... that's all how can bayes data be stale? a spam message is a spam message now, tomorrow and next year the same especially for ham signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
On 12/11/2015 06:28 AM, Marc Perkel wrote: On 12/10/15 18:31, Benny Pedersen wrote: Marc Perkel skrev den 2015-12-10 22:54: I've had bayes disabled in SA because it seems to not be able to stay working in a high volume situation. The MySQL server can't seem to keep up with it even on very fast computers. i got a palm Zire that can do ocr on handwrited text :=) pretty good for the kind of cpu it have But - thinking about trying something interesting - doing my own bayes in a different way. i have tryed bogofilter with very good succes, and i see problems with bayes here aswell, i remember you changed to mariadb ?` at that time you sayed it worked better then mysql ? did it fail again ? Here's my question. Bayes breaks the message down into some sort of tokens and then does statistics on those tokens as to tokens found in spam vs. tokens found in ham. But what about combinations of tokens? I'm thinking that I'd like to have something that says when it sees tokens X and Y and Z then that's spam even though X,Y,Z might be in ham when not combined. Does bayes do that or is there anything that does? if z is scored as spam, and x and y is ham, then its ham basicly that how bayes works, but a single mail might be lots of digest to compare for this to say spam or not test bogofilter put 100 spam mails in a spam folder put 100 non spam mails in a ham folder train bogofilter with this 2 folders in one go, not first ham and then spam, it must be done in one bogofilter call train, configure bogofilter.cf plugin for spamassassin, test it :=) YMMV Yes MariaDB was better than MySQL but not good enough to keep up. I even tried putting the database on ram disk and still didn't work. I'm thinking about incorporating Bogofilter but instead of feeding it messages I'm thinking about feeding it the spamassassin results - the rule names it hit + other data about the message and then let it score the rules. That's what I want to experiment with. Bogofilter was designed to be used with a MUA. Shellout for each msg can't be very efficient and if you want to share the Bayes DB across several boxes, NFS doesn't seem like a fast option either. Again.. SA's Redis backend speed and ease of use can't be beat... There's some help in https://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/ Axb
Re: Trying to understand how bayes works.
On 12/11/2015 07:29 PM, Joe Quinn wrote: On 12/11/2015 1:24 PM, Reindl Harald wrote: Am 11.12.2015 um 19:12 schrieb Axb: On 12/11/2015 06:51 PM, Reindl Harald wrote: well, how many of you trained chistmas spam this year while my bayes did know it from last year? I like my Bayes fresh like bread out of the oven, new guitar strings and clean sheets. well, i like my bayes catch spam at every point in time without repeat to slip things through once already caught - tell me one reason why i should let phishing pass through to customers which was already detected 96% of all milter-rejected mails got 3.5-7.5 points from bayes while at the same time 77% of all scanned mail got -3.5 points - in other words most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's what the bayes is supposed to do Last years turkey doesn't appeal to me. and what is last years spam making it now again through until relearn? spammers would have so much more work if they didn't know that in a few months they can re-use their templates after a large enough break, as a spammer i would even schedule the usage of them automated Agreed, and adding that we do see a large percentage of repeat seasonal spam templates. You need at least some of your data to carry over for at least a year, maybe two in order to stay effective. We're obviously catering to a different user base, seeing different traffic, etc, and have different approaches. Each method is valid if you're happy with the results and your arsenal does what you need it to do. I hate stale data... that's all.
Re: Trying to understand how bayes works.
Am 11.12.2015 um 18:42 schrieb Martin Gregorie: For instance, I have two portmanteau rules, SALE (contains sales phrases like "huge discount") and PRODUCT (contains phrases like "fur coat") that are ANDed by a meta called SALESPAM. The nice thing about this approach is that, once the SALE and PRODUCT lists have grown to a decent size the SALESPAM meta starts to fire on previously unseen combinations without generating FPs. The only downside is that, unlike Bayes, you have to build the lists manually but thats probably no worse to do than building a hand-crafted Bayes DB like Reyndl does hand crafted bayes? worse? what a nonsense what's handcrafted there? that i don't trust autolearn and don't like autoexpire? well, how many of you trained chistmas spam this year while my bayes did know it from last year? how many of you are train the same spam types again and again because spammers are aware of autoexpire and just need to stop using a campaign for some weeks until 99% of default setups has forgotten about it what i do is just KEEP all training messages so that i can rebuild my bayes at every point in them without start learning from scratch since "bayes_token_sources all" coming with the last release as well as "normalize_charset 1" enabled later and chnaged it#s behavior with the lastest release i know why - well, i did know that from the first moment "keep the corps if later something in the tokenizer changes" signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
On 12/11/2015 06:51 PM, Reindl Harald wrote: well, how many of you trained chistmas spam this year while my bayes did know it from last year? I like my Bayes fresh like bread out of the oven, new guitar strings and clean sheets. Last years turkey doesn't appeal to me. SCR
Re: Trying to understand how bayes works.
On Fri, 2015-12-11 at 09:05 -0800, Marc Perkel wrote: > For example. I create rules that look for many phrases about a > subject > and the subject becomes a token. For examples: > > JESUS > ROYALTY > MONEY > > But themselves not an indicator of spam. But if you have all 3 then > it's > definitely spam. The idea is to not look at words but look at the > meaning of phrases. > This approach works well for me too, but doesn't need Bayes to make it perform: just two or more portmanteau rules[*] that are combined by a meta with a relatively high score. The idea is that the triggering phrases are not spam indicators by themselves, but that the combination is something that virtually never occurs in ham but is a reliable spam indicator. For instance, I have two portmanteau rules, SALE (contains sales phrases like "huge discount") and PRODUCT (contains phrases like "fur coat") that are ANDed by a meta called SALESPAM. The nice thing about this approach is that, once the SALE and PRODUCT lists have grown to a decent size the SALESPAM meta starts to fire on previously unseen combinations without generating FPs. The only downside is that, unlike Bayes, you have to build the lists manually but thats probably no worse to do than building a hand-crafted Bayes DB like Reyndl does. [*] My term: a portmanteau rule is rule with a very long alternate list and a low score in the range 0.01 - 0.1. These things are hard to read and maintain, so I have an awk script that generates a syntactically correct SA rule from a file that names the rule, sets the score and has all the regexes written one per line. Martin
Re: Trying to understand how bayes works.
On 12/11/2015 07:24 PM, Reindl Harald wrote: Am 11.12.2015 um 19:12 schrieb Axb: On 12/11/2015 06:51 PM, Reindl Harald wrote: well, how many of you trained chistmas spam this year while my bayes did know it from last year? I like my Bayes fresh like bread out of the oven, new guitar strings and clean sheets. well, i like my bayes catch spam at every point in time without repeat to slip things through once already caught - tell me one reason why i should let phishing pass through to customers which was already detected In my playpen, phishing is detected by Digests, autogenerated rules, URI strings & domain BLs, etc. Don't need to babysit Bayes for that. 96% of all milter-rejected mails got 3.5-7.5 points from bayes while at the same time 77% of all scanned mail got -3.5 points - in other words most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's what the bayes is supposed to do Thats' what YOU want it to do. Your kitchen, your sauce. For me, Bayes is just another spice in the rack. And my rack has lots of jars. My trap flow keeps it all well fed and happy. Without knowing it, even you are using some of my data. I just see it 20 sec before you can make use of it.
Re: Trying to understand how bayes works.
On 12/11/2015 1:24 PM, Reindl Harald wrote: Am 11.12.2015 um 19:12 schrieb Axb: On 12/11/2015 06:51 PM, Reindl Harald wrote: well, how many of you trained chistmas spam this year while my bayes did know it from last year? I like my Bayes fresh like bread out of the oven, new guitar strings and clean sheets. well, i like my bayes catch spam at every point in time without repeat to slip things through once already caught - tell me one reason why i should let phishing pass through to customers which was already detected 96% of all milter-rejected mails got 3.5-7.5 points from bayes while at the same time 77% of all scanned mail got -3.5 points - in other words most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's what the bayes is supposed to do Last years turkey doesn't appeal to me. and what is last years spam making it now again through until relearn? spammers would have so much more work if they didn't know that in a few months they can re-use their templates after a large enough break, as a spammer i would even schedule the usage of them automated Agreed, and adding that we do see a large percentage of repeat seasonal spam templates. You need at least some of your data to carry over for at least a year, maybe two in order to stay effective.
Re: Trying to understand how bayes works.
Am 11.12.2015 um 19:12 schrieb Axb: On 12/11/2015 06:51 PM, Reindl Harald wrote: well, how many of you trained chistmas spam this year while my bayes did know it from last year? I like my Bayes fresh like bread out of the oven, new guitar strings and clean sheets. well, i like my bayes catch spam at every point in time without repeat to slip things through once already caught - tell me one reason why i should let phishing pass through to customers which was already detected 96% of all milter-rejected mails got 3.5-7.5 points from bayes while at the same time 77% of all scanned mail got -3.5 points - in other words most ham has BAYES_00 most spam hast BAYES_80-BAYES_999 - that's what the bayes is supposed to do Last years turkey doesn't appeal to me. and what is last years spam making it now again through until relearn? spammers would have so much more work if they didn't know that in a few months they can re-use their templates after a large enough break, as a spammer i would even schedule the usage of them automated signature.asc Description: OpenPGP digital signature
Re: Trying to understand how bayes works.
On 12/10/2015 10:54 PM, Marc Perkel wrote: I've had bayes disabled in SA because it seems to not be able to stay working in a high volume situation. The MySQL server can't seem to keep up with it even on very fast computers. Redis is your friend. Redis over the wire is faster than any local SDBM/DB file based backend. All you need is ram, the more the better I use site wide autolearn, auto feed spam from traps - atm, token TTL is 4d # Clients connected_clients:35 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0 # Memory used_memory:3454088112 used_memory_human:3.22G used_memory_rss:3528310784 used_memory_peak:3454661016 used_memory_peak_human:3.22G used_memory_lua:116736 mem_fragmentation_ratio:1.02 mem_allocator:jemalloc-3.6.0 if I switch Bayes on or off I notice zero SA scan speed change. Average SA scan time is 0.8 sec/msg But - thinking about trying something interesting - doing my own bayes in a different way. Here's my question. Bayes breaks the message down into some sort of tokens and then does statistics on those tokens as to tokens found in spam vs. tokens found in ham. But what about combinations of tokens? I'm thinking that I'd like to have something that says when it sees tokens X and Y and Z then that's spam even though X,Y,Z might be in ham when not combined. Does bayes do that or is there anything that does? There's tons of Bayes documentation on the net. Different implementations, etc. Enjoy the math... Not really a pure SA topic.
Trying to understand how bayes works.
I've had bayes disabled in SA because it seems to not be able to stay working in a high volume situation. The MySQL server can't seem to keep up with it even on very fast computers. But - thinking about trying something interesting - doing my own bayes in a different way. Here's my question. Bayes breaks the message down into some sort of tokens and then does statistics on those tokens as to tokens found in spam vs. tokens found in ham. But what about combinations of tokens? I'm thinking that I'd like to have something that says when it sees tokens X and Y and Z then that's spam even though X,Y,Z might be in ham when not combined. Does bayes do that or is there anything that does? -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400
Re: Trying to understand how bayes works.
On Thu, 10 Dec 2015 13:54:05 -0800 Marc Perkelwrote: > But what about combinations of tokens? I'm thinking that I'd like to > have something that says when it sees tokens X and Y and Z then > that's spam even though X,Y,Z might be in ham when not combined. The SpamAssassin Bayes implementation does not do that. Some other Bayes implementations do. Regards, Dianne.
Re: Trying to understand how bayes works.
Marc Perkel skrev den 2015-12-10 22:54: I've had bayes disabled in SA because it seems to not be able to stay working in a high volume situation. The MySQL server can't seem to keep up with it even on very fast computers. i got a palm Zire that can do ocr on handwrited text :=) pretty good for the kind of cpu it have But - thinking about trying something interesting - doing my own bayes in a different way. i have tryed bogofilter with very good succes, and i see problems with bayes here aswell, i remember you changed to mariadb ?` at that time you sayed it worked better then mysql ? did it fail again ? Here's my question. Bayes breaks the message down into some sort of tokens and then does statistics on those tokens as to tokens found in spam vs. tokens found in ham. But what about combinations of tokens? I'm thinking that I'd like to have something that says when it sees tokens X and Y and Z then that's spam even though X,Y,Z might be in ham when not combined. Does bayes do that or is there anything that does? if z is scored as spam, and x and y is ham, then its ham basicly that how bayes works, but a single mail might be lots of digest to compare for this to say spam or not test bogofilter put 100 spam mails in a spam folder put 100 non spam mails in a ham folder train bogofilter with this 2 folders in one go, not first ham and then spam, it must be done in one bogofilter call train, configure bogofilter.cf plugin for spamassassin, test it :=) YMMV
Re: Trying to understand how bayes works.
On Fri, 11 Dec 2015 03:31:56 +0100 Benny Pedersenwrote: > if z is scored as spam, and x and y is ham, then its ham basicly > that how bayes works, but a single mail might be lots of digest to > compare for this to say spam or not The thing is, the probability of token Y is not independent of the previous token, and single-token Bayes misses out on those conditional probabilities. The example I like to give is that the tokens "red" and "hot" are probably neutral to slightly spammy, and "sex" is probably mildly spammy, but "red hot sex" is way spammier than the individual tokens far apart as in "Yeah, the red chili peppers are hot. Oh, by the way, what was the sex of the baby?" Regards, Dianne.
Re: Trying to understand how bayes works.
On 12/10/15 18:31, Benny Pedersen wrote: Marc Perkel skrev den 2015-12-10 22:54: I've had bayes disabled in SA because it seems to not be able to stay working in a high volume situation. The MySQL server can't seem to keep up with it even on very fast computers. i got a palm Zire that can do ocr on handwrited text :=) pretty good for the kind of cpu it have But - thinking about trying something interesting - doing my own bayes in a different way. i have tryed bogofilter with very good succes, and i see problems with bayes here aswell, i remember you changed to mariadb ?` at that time you sayed it worked better then mysql ? did it fail again ? Here's my question. Bayes breaks the message down into some sort of tokens and then does statistics on those tokens as to tokens found in spam vs. tokens found in ham. But what about combinations of tokens? I'm thinking that I'd like to have something that says when it sees tokens X and Y and Z then that's spam even though X,Y,Z might be in ham when not combined. Does bayes do that or is there anything that does? if z is scored as spam, and x and y is ham, then its ham basicly that how bayes works, but a single mail might be lots of digest to compare for this to say spam or not test bogofilter put 100 spam mails in a spam folder put 100 non spam mails in a ham folder train bogofilter with this 2 folders in one go, not first ham and then spam, it must be done in one bogofilter call train, configure bogofilter.cf plugin for spamassassin, test it :=) YMMV Yes MariaDB was better than MySQL but not good enough to keep up. I even tried putting the database on ram disk and still didn't work. I'm thinking about incorporating Bogofilter but instead of feeding it messages I'm thinking about feeding it the spamassassin results - the rule names it hit + other data about the message and then let it score the rules. That's what I want to experiment with. -- Marc Perkel - Sales/Support supp...@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400