Re: [Dspam-user] training time?
On Fri, 16 Apr 2010 15:41:04 +0800 Michael Alger wrote: [...] > Okay, now I get why you see a big difference between the modes. > Since I live in a perfect fantasy world where all classification > errors are corrected, I wasn't seeing the significance. :) > The perfect fantasy world is where we all (the one being responsible for DSPAM. aka: the admins) live. But lazy users are unfortunately the reality. > I was browsing through the README looking to see if dspam had any > nice hooks for helping to build my own corpus, > # dspam_admin change preference ds...@mm.quex.org "makeCorpus" "on" Or default for everyone: # dspam_admin change preference default "makeCorpus" "on" Or if you don't use the preference extension then set it in dspam.conf or in individual user prefs files. > and came across this: > > tum: Train-until-Mature. This training mode is a hybrid >between the other two training modes and provides a great >balance between volatility and static metadata. > > So apparently I'm not the only one that sees TUM as something of a > combination between TEFT and TOE. However, the explanation of TUM > in the README doesn't mention TL as affecting whether it learns or > not. > From the learning method viewpoint it is not a hybrid. TEFT and TUM learn without you telling them to learn. TOE does only learn when you tell it to learn. That is the reason why I said that TUM should not be compared to TOE. But from the way how TUM work as whole it is indeed a hybrid. > Is this out of date? > No. It is still valid. > The explanation (abridged from the README version): > > TuM will train on a per-token basis only tokens which have had > fewer than 50 "hits" on them, unless an error is being retrained > in which case all tokens are trained. > > NOTE: You should corpus train before using tum. > > suggests to me that it actually learns a little differently than > TEFT (and without regard to TL), in that tokens that already have 50 > hits on them will be ignored. > The documentation is not 100% clear in this regard. Only default tokens (BNR tokens don't fall into that category. They are another token type and not a default token) having ((spam_hits + innocent_hits) < 50) are automatically trained by TUM. But for the classification any token is used in TUM. The code that does the magic in regards to training is this here: --- if (ds_term->type == 'D' && ( CTX->training_mode != DST_TUM || CTX->source == DSS_ERROR || CTX->source == DSS_INOCULATION || ds_term->s.spam_hits + ds_term->s.innocent_hits < 50 || ds_term->key == diction->whitelist_token || CTX->confidence < 0.70)) { ds_term->s.status |= TST_DIRTY; } --- Translated that means: if ([current token type] is [default token]) and ( ([training mode] is not [TUM]) or ([current message source] is [ERROR]) or ([current message source] is [INOCULATION]) or ([current token spam hits] + [current token innocent hits] is less than [50]) or ([current token key] is [WHITELIST]) or ([current message condifence] is less than [0.70 (aka 70%)]) ) then mark [current token] as [DIRTY] end if Marking a token as dirty instructs DSPAM to save back the updated token data to the used storage backend. Let's take an example: * training mode is TUM * the message source is not ERROR * the message source is not INOCULATION * for simplicity let us assume all default tokens of the message have 20 innocent hits and 20 spam hits * for simplicity let us assume the message has no whitelist token * the whole message has a confidence of 0.80 Then the above condition would result in (for each individual token): (true) and (false or false or false or true or false or false) -> (true) and (true) => true So each of the tokens would be marked dirty (aka learn the token) because we get a TRUE for ((spam_hits + innocent_hits) < 50). Now using the same values but this time each token has 40 spam hits and 40 innocent hits AND the whole message has a confidence of 0.65. Then the above condition would result in (for each individual token): (true) and (false or false or false or false or false or true) -> (true) and (true) => true As you see the individual tokens would still be trained by TUM because the whole message has a confidence less then 0.70. The training is performed regardless that each of the individual tokens has a (spam_hits + innocent_hits) above 50 (in our example 80). To sum it up: TUM would train a message (respectively parts of a message) if one of the following conditions applies: * the source is ERROR (aka --source=error) * the source is INOCULATION (aka --source=inoculation) * individual tokens have (spam_hits + innocent_hits) < 50 * individual token is a whitelist token * t
Re: [Dspam-user] training time?
On Thu, Apr 15, 2010 at 12:27:47PM +0200, Stevan Bajić wrote: > On Thu, 15 Apr 2010 17:35:43 +0800 > Michael Alger wrote: > > [...] > > Learning really only happens if you tell DSPAM that a message > needs to be reclassified or a message needs to be corpusfed. Or > when using TEFT (regardless of TL) or TUM (only if TL < 2500). > > But in order to be able to use Bayesian DSPAM needs as well to > know how many messages it has seen in total. So it is logical that > it needs to keep track of that by updating the table dspam_stats > and incrementing "spam_classified" and/or "innocent_classified". Thanks. That makes sense. Also thanks for the other explanations of the statistical theory behind it all, which makes things a lot clearer for me as well. > > I think saying "TOE is totally different from {NOTRAIN, TEFT, > > TUM}" is a little strong. It seems to me that TEFT and TOE are > > quite different, while TUM is a combination of the two: TEFT > > until it has enough data, and then TOE. Or have I misunderstood? > > Yes. You have missunderstood. TUM and TEFT could possibly learn > something wrong. While TOE would only learn something when you > tell it to learn. TUM and TEFT are learning by them self. They > FIRST learn and then depend on you to FIX errors. TOE does not do > that. TOE only learns when you want it to learn. Okay, now I get why you see a big difference between the modes. Since I live in a perfect fantasy world where all classification errors are corrected, I wasn't seeing the significance. :) I was browsing through the README looking to see if dspam had any nice hooks for helping to build my own corpus, and came across this: tum: Train-until-Mature. This training mode is a hybrid between the other two training modes and provides a great balance between volatility and static metadata. So apparently I'm not the only one that sees TUM as something of a combination between TEFT and TOE. However, the explanation of TUM in the README doesn't mention TL as affecting whether it learns or not. Is this out of date? The explanation (abridged from the README version): TuM will train on a per-token basis only tokens which have had fewer than 50 "hits" on them, unless an error is being retrained in which case all tokens are trained. NOTE: You should corpus train before using tum. suggests to me that it actually learns a little differently than TEFT (and without regard to TL), in that tokens that already have 50 hits on them will be ignored. Thanks again for all your explanations. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user
Re: [Dspam-user] training time?
On Thu, 15 Apr 2010 17:47:41 +0200 Stevan Bajić wrote: > On Thu, 15 Apr 2010 17:35:43 +0800 > Michael Alger wrote: > > [...] > > However, I don't understand why simply classifying a message using > > TOE decrements the Training Left counter. My understanding is that > > token statistics are only updated when retraining a misclassified > > message; classifying a message shouldn't cause any changes here, and > > thus logically shouldn't be construed as "training" the system. > > > > Is this done purely so the statistical sedation is deactivated in > > TOE mode after 2,500 messages have been processed, or are there > > other reasons? > > > You have the classical problem understanding statistical thinking. There is a > example that you will find in a lot of psychological literature that > demonstrates the problem most humans have with statistical thinking. The > problem is known in the sociopsychological literature as the "taxi/cab > problem". Let me quickly show you the example: > > Two taxi companies are active in a city. The taxis of the company A are > green, those of the company B blue. The company A places 15% of the taxis, > the company B the remaining 85%. An at night it comes to an accident with hit > and run. The fleeing car was a taxi. A witness states that it was a green > taxi. > > The court orders to examine the ability of the witnesses to be able to > differentiate between green and blue taxies under night view conditions. The > test result is: in 80% of the cases the witness was able to identify the > correct color and was wrong in the remaining 20% of the cases. > > How high is the probability that the fleeing taxi the witness has seen at > that night was a taxi (green) from company A? > > > Most people would answer here spontaneous with 80%. In fact a study has shown > that a majority of asked persons (among them physicians, judges and studying > of elite universities) answer the question with 80%. > > But the correct answer is not 80% :) > > Allow me to explain: > The whole city has 1'000 taxies. 150 (green) belong to company A and 850 > (blue) belong to company B. One of those 1'000 taxies is responsible for the > accident. The witness says he saw a green taxi and we know that he is correct > in 80% of the cases. That means in addition that he calls a blue taxi in 20% > of the cases green. From the 850 blue taxis he will thus call (false > positive) 170 green. And from the 150 green taxies he will correctly prove > (true positive) 120 taxies as green. In order to calculate the probability > that he actually saw a green taxi when he identifies a taxi (at night viewing > conditions) as green you need to devide all correct answers (TP) of "green" > with all answers (FP + TP) of "green". Therefore the probability is: 120 / ( > 170 + 120) = 0.41 > > The probability that a green taxi caused the accident if the withness means > to have seen a green taxi is therefore less then 50%. This probability > depends completely crucially on the distribution of the green and blue taxis > in the city. Would there be equal amount of green and blue taxies in the city > then the correct answer would indeed be 80%. > > Most humans however incline to ignore the initial distribution (also apriori, > origin or initial probability). Psychologists speak in this connection of > "base rate neglect". > Here a more detailed description from Wikipedia about "base rate neglect": http://en.wikipedia.org/wiki/Base_rate_fallacy > And now back to your original statement: > > However, I don't understand why simply classifying a message using > TOE decrements the Training Left counter. My understanding is that > token statistics are only updated when retraining a misclassified > message; classifying a message shouldn't cause any changes here, and > thus logically shouldn't be construed as "training" the system. > > > Without DSPAM keeping track of the TP/TN the whole calculation from above > would not be possible. DSPAM would not know that there are 1'000 taxies. It > would only know about 30 green taxies and 170 blue taxies. You might now ask > yourself why 30 green and why 170 blue? Easy (assuming green = bad/Spam and > blue = good/Ham)): > * 1'000 taxies (processed messages) -> TP + TN > * 170 taxies identified as green (Spam) but they where blue (Ham) -> FP > * 30 taxies identified as blue (Ham) but they where green (Spam) -> FN > > Without knowing TP and TN the whole Bayes theorem calculation would not be > possible. So DSPAM must keep track of them. It is indeed not a learning thing > but for the computation of the probability it is crucial to know that value. > > And since the statistical sedation implemented in DSPAM is watering down the > result in order to minimize FP the
Re: [Dspam-user] training time?
On Thu, 15 Apr 2010 17:35:43 +0800 Michael Alger wrote: [...] > However, I don't understand why simply classifying a message using > TOE decrements the Training Left counter. My understanding is that > token statistics are only updated when retraining a misclassified > message; classifying a message shouldn't cause any changes here, and > thus logically shouldn't be construed as "training" the system. > > Is this done purely so the statistical sedation is deactivated in > TOE mode after 2,500 messages have been processed, or are there > other reasons? > You have the classical problem understanding statistical thinking. There is a example that you will find in a lot of psychological literature that demonstrates the problem most humans have with statistical thinking. The problem is known in the sociopsychological literature as the "taxi/cab problem". Let me quickly show you the example: Two taxi companies are active in a city. The taxis of the company A are green, those of the company B blue. The company A places 15% of the taxis, the company B the remaining 85%. An at night it comes to an accident with hit and run. The fleeing car was a taxi. A witness states that it was a green taxi. The court orders to examine the ability of the witnesses to be able to differentiate between green and blue taxies under night view conditions. The test result is: in 80% of the cases the witness was able to identify the correct color and was wrong in the remaining 20% of the cases. How high is the probability that the fleeing taxi the witness has seen at that night was a taxi (green) from company A? Most people would answer here spontaneous with 80%. In fact a study has shown that a majority of asked persons (among them physicians, judges and studying of elite universities) answer the question with 80%. But the correct answer is not 80% :) Allow me to explain: The whole city has 1'000 taxies. 150 (green) belong to company A and 850 (blue) belong to company B. One of those 1'000 taxies is responsible for the accident. The witness says he saw a green taxi and we know that he is correct in 80% of the cases. That means in addition that he calls a blue taxi in 20% of the cases green. From the 850 blue taxis he will thus call (false positive) 170 green. And from the 150 green taxies he will correctly prove (true positive) 120 taxies as green. In order to calculate the probability that he actually saw a green taxi when he identifies a taxi (at night viewing conditions) as green you need to devide all correct answers (TP) of "green" with all answers (FP + TP) of "green". Therefore the probability is: 120 / ( 170 + 120) = 0.41 The probability that a green taxi caused the accident if the withness means to have seen a green taxi is therefore less then 50%. This probability depends completely crucially on the distribution of the green and blue taxis in the city. Would there be equal amount of green and blue taxies in the city then the correct answer would indeed be 80%. Most humans however incline to ignore the initial distribution (also apriori, origin or initial probability). Psychologists speak in this connection of "base rate neglect". And now back to your original statement: However, I don't understand why simply classifying a message using TOE decrements the Training Left counter. My understanding is that token statistics are only updated when retraining a misclassified message; classifying a message shouldn't cause any changes here, and thus logically shouldn't be construed as "training" the system. Without DSPAM keeping track of the TP/TN the whole calculation from above would not be possible. DSPAM would not know that there are 1'000 taxies. It would only know about 30 green taxies and 170 blue taxies. You might now ask yourself why 30 green and why 170 blue? Easy (assuming green = bad/Spam and blue = good/Ham)): * 1'000 taxies (processed messages) -> TP + TN * 170 taxies identified as green (Spam) but they where blue (Ham) -> FP * 30 taxies identified as blue (Ham) but they where green (Spam) -> FN Without knowing TP and TN the whole Bayes theorem calculation would not be possible. So DSPAM must keep track of them. It is indeed not a learning thing but for the computation of the probability it is crucial to know that value. And since the statistical sedation implemented in DSPAM is watering down the result in order to minimize FP the whole Training Left (TL) value was introduced in DSPAM to have a way to limit that watering down phase. So the more DSPAM has done a positive/negative classification the more mature the tokens are considered to be. So after 2'500 TP/TN the statistical sedation gets automatically disabled. I hope you understand now better why we need to update the stati
Re: [Dspam-user] training time?
On Thu, 15 Apr 2010 17:35:43 +0800 Michael Alger wrote: [...] > Thank you for this explanation and after a quick test I see that the > TL counter does decrement (and TN increments) when I process mail > using TOE. If I set it to NOTRAIN, then none of the statistics are > updated when the messages is processed. > Right. > However, I don't understand why simply classifying a message using > TOE decrements the Training Left counter. My understanding is that > token statistics are only updated when retraining a misclassified > message; classifying a message shouldn't cause any changes here, and > thus logically shouldn't be construed as "training" the system. > You are right and wrong. When classifying a message then 1) if using TEFT (regardless of TL) or TUM (only if TL < 2500) then table dspam_token_data gets updated and/or new entries are added. 2) if using TOE or NOTRAIN then table dspam_token_data does NOT get any new entries. 3) if using TOE then existing entries in table dspam_token_data (aka tokens) will get their "last_hit" updated but "spam_hits" nor "innocent_hits" will be updated. 4) if using TOE or TEFT or TUM then the table dspam_stats will be updated. But only fields "spam_classified" and/or "innocent_classified". Learning is another issue. When doing learning then the stats get updated (fields: "spam_learned", "innocent_learned", "spam_misclassified", "innocent_misclassified", "spam_corpusfed", "innocent_corpusfed"). Learning really only happens if you tell DSPAM that a message needs to be reclassified or a message needs to be corpusfed. Or when using TEFT (regardless of TL) or TUM (only if TL < 2500). But in order to be able to use Bayesian DSPAM needs as well to know how many messages it has seen in total. So it is logical that it needs to keep track of that by updating the table dspam_stats and incrementing "spam_classified" and/or "innocent_classified". > Is this done purely so the statistical sedation is deactivated in > TOE mode after 2,500 messages have been processed, or are there > other reasons? > Yes. It's only for the statistical sedation. > Does TUM base its decision to learn purely on the value of the TL > counter (i.e. stops learning once that reaches 0), or is the TL just > a hint and TUM actually uses some heuristic based on the number of > tokens available to it and their scores? > No. TUM is 100% like TEFT until it reaches TL = 0. So TUM and TEFT are FORCING A LEARNING on each message they see. TOE is really only learning if you tell it to learn (no implicit learning. Only explicit learning). To sum it up: * TEFT (regardless of TL) or TUM (only if TL < 2500) are LEARNING EVERY message they see. * TOE is only learning if you TELL IT TO LEARN. * TEFT (regardless of TL) or TUM (only if TL < 2500) could even LEARN WRONG and depend on you to fix their errors. If you have TEFT or TUM (until TL = 0) and you DON'T correct errors then the quality of your tokens can decrease (but it can increase as well. But only if no classified message was a FP or a FN). > Is TL used by anything other than the statistical sedation feature? > No. > I think saying "TOE is totally different from {NOTRAIN, TEFT, TUM}" > is a little strong. It seems to me that TEFT and TOE are quite > different, while TUM is a combination of the two: TEFT until it has > enough data, and then TOE. Or have I misunderstood? > Yes. You have missunderstood. TUM and TEFT could possibly learn something wrong. While TOE would only learn something when you tell it to learn. TUM and TEFT are learning by them self. They FIRST learn and then depend on you to FIX errors. TOE does not do that. TOE only learns when you want it to learn. Allow me to illustrate something. Assume you have 1000 tokens in DSPAM. And assume you have a corpus A with 100 messages and corpus B with 100 messages. Test case 1) Now assume you use TEFT/TUM and you check all those mails from corpus A. And assume you get 100% accuracy. Test case 2) Now assume you use TOE and you check all those mails from corpus A. And assume you get as well 100% accuracy. So far, so god. Now assume you only CLASSIFY corpus B with test case 1 and with test case 2. And assume we don't care about the result we got by just classifying mails from corpus B. Now go back and repeat the classification the same way as done above with corpus A. With test case 2 you will get again 100%. For sure! With test case 1 you have a high chance to NOT get again 100%. The reason for that is that TEFT and TUM would have changed "spam_learned" and "innocent_learned" while they only CLASSIFIED corpus B. They have learned even if you have told it to only classify corpus B. Do you understand what I mean? -- Kind Regards from Switzerland, Stevan Bajić -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs pr
Re: [Dspam-user] training time?
On Mon, Apr 12, 2010 at 09:18:43PM +0200, Stevan Bajić wrote: > On Sat, 10 Apr 2010 17:59:25 +0800 > Michael Alger wrote: > > On Fri, Apr 09, 2010 at 11:23:16PM -0700, Terry Barnum wrote: > > >>> I've been running DSPAM for approximately 2 weeks and looking > > >>> at the output of dspam_stats, I'm curious how long training > > >>> normally takes. > > >>> > > >>> $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$ > > >>> > > >>> TrainingMode toe > > >>> Preference "trainingMode=TOE" > > > > Your default settings are TOE mode. Are you overriding this for any > > of the users in their preferences? If not, this would explain why > > it's only learning from errors: because you told it to. > > > > Try switching this to TUM or TEFT. > > > I think most users here don't understand what training is in the > context of Anti-Spam. So I am going to try to explain quickly what > all those different training modes are. Thank you for this explanation and after a quick test I see that the TL counter does decrement (and TN increments) when I process mail using TOE. If I set it to NOTRAIN, then none of the statistics are updated when the messages is processed. However, I don't understand why simply classifying a message using TOE decrements the Training Left counter. My understanding is that token statistics are only updated when retraining a misclassified message; classifying a message shouldn't cause any changes here, and thus logically shouldn't be construed as "training" the system. Is this done purely so the statistical sedation is deactivated in TOE mode after 2,500 messages have been processed, or are there other reasons? > TUM is exactly like TEFT. He takes the test and after the test he > as well is buying a book (+/- 100 pages) about the tested topic > and reading/learning the book. But as soon as he has successfully > passed 2'500 tests he changes his strategy and stops buying books > after he has passed a test. He is only buying and reading/learning > a book if he has failed on a test. Does TUM base its decision to learn purely on the value of the TL counter (i.e. stops learning once that reaches 0), or is the TL just a hint and TUM actually uses some heuristic based on the number of tokens available to it and their scores? Is TL used by anything other than the statistical sedation feature? > TOE is totally different from the above 3. He is taking a test and > if he is failing to pass the test he goes on and buys a book (+/- > 100 pages) about the tested topic and reads/learns the book. He > does that for ever. Every test he takes he is doing the same. If > he passes the test he does not buy the book and he does not read > those +/- 100 pages. He just has passed the test and he knows that > he has passed. So no need for him to invest time in reading 100 > pages for nothing. He is already knowledgeable in that topic he > tested (remeber: he passed the test). I think saying "TOE is totally different from {NOTRAIN, TEFT, TUM}" is a little strong. It seems to me that TEFT and TOE are quite different, while TUM is a combination of the two: TEFT until it has enough data, and then TOE. Or have I misunderstood? -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user
Re: [Dspam-user] training time?
On Sat, 10 Apr 2010 17:59:25 +0800 Michael Alger wrote: > On Fri, Apr 09, 2010 at 11:23:16PM -0700, Terry Barnum wrote: > >>> I've been running DSPAM for approximately 2 weeks and looking > >>> at the output of dspam_stats, I'm curious how long training > >>> normally takes. > >>> > >>> $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$ > >>> > >>> TrainingMode toe > >>> Preference "trainingMode=TOE" > > Your default settings are TOE mode. Are you overriding this for any > of the users in their preferences? If not, this would explain why > it's only learning from errors: because you told it to. > > Try switching this to TUM or TEFT. > I think most users here don't understand what training is in the context of Anti-Spam. So I am going to try to explain quickly what all those different training modes are. I will try to avoid this technical/mathematical/statistical mabo-jambo and use something else. Sorry if I make to many grammatical errors. I have a hard working day behind me and I am just going to type here without taking much care about proper English. The example I will use here is way oversimplified but good enough to explain the topic. Okay. DSPAM has the following training modes: * NOTRAIN => Do not do training * TEFT => Train Everything (some say: Train Every F***ing Time) * TUM => Train Until Mature * TOE => Train On Error * UNLEARN => Unlearn the (previous) training Now my example: Let us assume we have a joung human that wants to be a specialist in a specific knowledge area/domain. At the beginning that joung human does not know anything about the specific area. Let us assume that that specific area has a lot of material that can be learned. That learning material is immense. Infinite. You never stop to learn. But let us assume that in general a human is considered to be specialist in that area/domain after he/she has passed 2'500 tests. Now let us assume that each of this training material is a book with +/- 100 pages. And let us assume that you can take for each topic a test. Now let us assume we have 4 joung boys trying to become specialists. They are called (I know, I know. Stupid names but anyway): * NOTRAIN * TEFT * TUM * TOE NOTRAIN is never training. He just relies on what he has learned in the past and takes any test without learning before the test and he does not learn after the test. He just takes the test and regardless of the result he just continues to take the next test. TEFT on the other hand is taking the test like NOTRAIN but each time after he has taken the test he is buying a book (+/- 100 pages) about the tested topic and reads/learns the book. And he continues this for each and every test. He does not stop after he has successfully passed 2'500 topic tests. He takes test 2'500 and 2'501 and 2'502 and and and. He never ever stops to learn (FORCED LEARNING). TUM is exactly like TEFT. He takes the test and after the test he as well is buying a book (+/- 100 pages) about the tested topic and reading/learning the book. But as soon as he has successfully passed 2'500 tests he changes his strategy and stops buying books after he has passed a test. He is only buying and reading/learning a book if he has failed on a test. TOE is totally different from the above 3. He is taking a test and if he is failing to pass the test he goes on and buys a book (+/- 100 pages) about the tested topic and reads/learns the book. He does that for ever. Every test he takes he is doing the same. If he passes the test he does not buy the book and he does not read those +/- 100 pages. He just has passed the test and he knows that he has passed. So no need for him to invest time in reading 100 pages for nothing. He is already knowledgeable in that topic he tested (remeber: he passed the test). So now allow me to glue together DSPAM with the above example. In DSPAM world those 2'500 tests would be TL (Training Left). And in DSPAM world each of the trainee from above (except NOTRAIN and obviously UNTRAIN) would take extra care while they have not passed at least 2'500 tests. The extra care is that in DSPAM you have the option called "statisticalSedation". This is a parameter that allows DSPAM to water down the catch rate (catch of Spam). This parameter exists for those out there that are absolutely paranoid about FPs (false positives). I could now go on and explain the mathematical/statistical reason behind that parameter but I save my self some time not explaining it. For now just accept that the parameter is there and that it allows you to tune how aggressive DSPAM will try to catch Spam while it has not at least processed 2'500 innocent messages. Okay. I think that now most of you should +/- understand what those training modes are and how they work in DSPAM. And each of those modes has a reason to be there. A lot of you might now think that some of those modes are useless and others are more useful. Right. All of them h
Re: [Dspam-user] training time?
On Sat, 10 Apr 2010 11:33:15 -0700 Terry Barnum wrote: > [...] > > That's what I'm wondering too. Could the train.dspam script somehow trigger a > reset of those fields? > As with everything in life: Everything is possible. But quickly looking over the script I don't see anything that would explain a reset. > It's very possible I have a stupid mis-configuration problem and I very much > appreciate the help. This is my first postfix/dovecot install and I'm > learning something every day. > That is possible too. Could you directly send me your main.cf and your master.cf? And your dovecot.conf? What are you using to manage users? Any specific tool? What tool? > [...] > > $ cat dspam_filter_access > /./ FILTER dspam:dspam > Okay. I see. > [...] > > Yes. Is this not a good approach? > It is not something that one would say it's a bad approach. I however had issues in the past when using FILTER. Especially when piping to DSPAM (or any other application) and using mails that have non latin characters. Then the FILTER in conjuction with pipe breaks very often. > Also, I'm not sure if this helps the diagnosis, but here's dspam_admin list > preference default output that shows the change you suggested to force > signatureLocation into the header. > > $ sudo dspam_admin list preference default > signatureLocation=headers > > Thanks, > -Terry > -- Kind Regards from Switzerland, Stevan Bajić -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user
Re: [Dspam-user] training time?
On Apr 10, 2010, at 3:27 AM, Stevan Bajić wrote: > On Fri, 9 Apr 2010 23:23:16 -0700 > Terry Barnum wrote: > >> >> On Apr 9, 2010, at 7:21 PM, Stevan Bajić wrote: >> >>> On Fri, 9 Apr 2010 19:00:54 -0700 >>> Terry Barnum wrote: >>> I've been running DSPAM for approximately 2 weeks and looking at the output of dspam_stats, I'm curious how long training normally takes. A script is run nightly to check .Junk mailboxes for false negatives and .NotJunk mailboxes for false positives and retrains on error. (Richard Valk's http://switch.richard5.net/serverinstall/train.dspam) Here's sample output from dspam_stats -H x...@dop.com: TP True Positives: 0 TN True Negatives:19 FP False Positives:0 FN False Negatives: 348 SC Spam Corpusfed: 0 NC Nonspam Corpusfed: 0 TL Training Left: 2481 SHR Spam Hit Rate 0.00% HSR Ham Strike Rate: 0.00% PPV Positive predictive value: 100.00% OCA Overall Accuracy: 5.18% y...@dop.com: TP True Positives: 0 TN True Negatives: 0 FP False Positives:0 FN False Negatives: 3035 SC Spam Corpusfed: 0 NC Nonspam Corpusfed: 0 TL Training Left: 2500 SHR Spam Hit Rate 0.00% HSR Ham Strike Rate: 100.00% PPV Positive predictive value: 100.00% OCA Overall Accuracy: 0.00% z...@dop.com: TP True Positives: 0 TN True Negatives: 0 FP False Positives:0 FN False Negatives: 358 SC Spam Corpusfed: 0 NC Nonspam Corpusfed: 0 TL Training Left: 2500 SHR Spam Hit Rate 0.00% HSR Ham Strike Rate: 100.00% PPV Positive predictive value: 100.00% OCA Overall Accuracy: 0.00% te...@dop.com: TP True Positives: 0 TN True Negatives: 3 FP False Positives:0 FN False Negatives: 5108 SC Spam Corpusfed: 0 NC Nonspam Corpusfed: 0 TL Training Left: 2497 SHR Spam Hit Rate 0.00% HSR Ham Strike Rate: 0.00% PPV Positive predictive value: 100.00% OCA Overall Accuracy: 0.09% >>> This all looks to me that you are not using DSPAM at all. Seems to me that >>> only the script from http://switch.richard5.net/serverinstall/train.dspam >>> is feeding DSPAM with data in your setup. >> >> Thank you for your help Stevan. My understanding of how this is supposed to >> eventually work is DSPAM analyzes and adds a header to email as Innocent or >> Spam and the MUA, which is configured to trust the Spam header, moves mail >> into the Junk mailbox if DSPAM classified it as Spam. The MUA has its own >> Junk filtering and moves mail it considers spam into the Junk mailbox too. >> So the nightly script may run across mail in the Junk mailbox that it >> mis-classified as Innocent but is actually spam and is retrained as a false >> negative. Conversely, if DSPAM incorrectly classifies mail as spam, the user >> moves that email from the Junk mailbox into the NotJunk mailbox so the >> nightly script can retrain as a false positive. >> > So what it does is basically what the Dovecot anti-spam plugin does. The > plugin however does it in real time while the script you have there does it > on a scheduled basis. > > >> DSPAM appears to be correctly adding headers but so far I've seen only >> Whitelisted and Innocent. >> > But how is it possible that you almost have everywhere 0 for TN/TP. If DSPAM > would work properly then TP/TN would need to increase every time you get a > mail. That's what I'm wondering too. Could the train.dspam script somehow trigger a reset of those fields? It's very possible I have a stupid mis-configuration problem and I very much appreciate the help. This is my first postfix/dovecot install and I'm learning something every day. >
Re: [Dspam-user] training time?
On Sat, 10 Apr 2010 17:59:25 +0800 Michael Alger wrote: > On Fri, Apr 09, 2010 at 11:23:16PM -0700, Terry Barnum wrote: > >>> I've been running DSPAM for approximately 2 weeks and looking > >>> at the output of dspam_stats, I'm curious how long training > >>> normally takes. > >>> > >>> $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$ > >>> > >>> TrainingMode toe > >>> Preference "trainingMode=TOE" > > Your default settings are TOE mode. Are you overriding this for any > of the users in their preferences? If not, this would explain why > it's only learning from errors: because you told it to. > > Try switching this to TUM or TEFT. > I would advise AGAINST going to TEFT. The problem he is describing has not much to do with the training mode. Even in TOE the TP/TN counters should increase each time he gets a new mail. So something is fishy in his setup. Those TP/TN numbers should increase with each inbound mail regardless of the training mode. -- Kind Regards from Switzerland, Stevan Bajić -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user
Re: [Dspam-user] training time?
On Fri, 9 Apr 2010 23:23:16 -0700 Terry Barnum wrote: > > On Apr 9, 2010, at 7:21 PM, Stevan Bajić wrote: > > > On Fri, 9 Apr 2010 19:00:54 -0700 > > Terry Barnum wrote: > > > >> I've been running DSPAM for approximately 2 weeks and looking at the > >> output of dspam_stats, I'm curious how long training normally takes. > >> > >> A script is run nightly to check .Junk mailboxes for false negatives and > >> .NotJunk mailboxes for false positives and retrains on error. (Richard > >> Valk's http://switch.richard5.net/serverinstall/train.dspam) > >> > >> Here's sample output from dspam_stats -H > >> > >> x...@dop.com: > >>TP True Positives: 0 > >>TN True Negatives:19 > >>FP False Positives:0 > >>FN False Negatives: 348 > >>SC Spam Corpusfed: 0 > >>NC Nonspam Corpusfed: 0 > >>TL Training Left: 2481 > >>SHR Spam Hit Rate 0.00% > >>HSR Ham Strike Rate: 0.00% > >>PPV Positive predictive value: 100.00% > >>OCA Overall Accuracy: 5.18% > >> > >> y...@dop.com: > >>TP True Positives: 0 > >>TN True Negatives: 0 > >>FP False Positives:0 > >>FN False Negatives: 3035 > >>SC Spam Corpusfed: 0 > >>NC Nonspam Corpusfed: 0 > >>TL Training Left: 2500 > >>SHR Spam Hit Rate 0.00% > >>HSR Ham Strike Rate: 100.00% > >>PPV Positive predictive value: 100.00% > >>OCA Overall Accuracy: 0.00% > >> > >> z...@dop.com: > >>TP True Positives: 0 > >>TN True Negatives: 0 > >>FP False Positives:0 > >>FN False Negatives: 358 > >>SC Spam Corpusfed: 0 > >>NC Nonspam Corpusfed: 0 > >>TL Training Left: 2500 > >>SHR Spam Hit Rate 0.00% > >>HSR Ham Strike Rate: 100.00% > >>PPV Positive predictive value: 100.00% > >>OCA Overall Accuracy: 0.00% > >> > >> te...@dop.com: > >>TP True Positives: 0 > >>TN True Negatives: 3 > >>FP False Positives:0 > >>FN False Negatives: 5108 > >>SC Spam Corpusfed: 0 > >>NC Nonspam Corpusfed: 0 > >>TL Training Left: 2497 > >>SHR Spam Hit Rate 0.00% > >>HSR Ham Strike Rate: 0.00% > >>PPV Positive predictive value: 100.00% > >>OCA Overall Accuracy: 0.09% > >> > > This all looks to me that you are not using DSPAM at all. Seems to me that > > only the script from http://switch.richard5.net/serverinstall/train.dspam > > is feeding DSPAM with data in your setup. > > Thank you for your help Stevan. My understanding of how this is supposed to > eventually work is DSPAM analyzes and adds a header to email as Innocent or > Spam and the MUA, which is configured to trust the Spam header, moves mail > into the Junk mailbox if DSPAM classified it as Spam. The MUA has its own > Junk filtering and moves mail it considers spam into the Junk mailbox too. So > the nightly script may run across mail in the Junk mailbox that it > mis-classified as Innocent but is actually spam and is retrained as a false > negative. Conversely, if DSPAM incorrectly classifies mail as spam, the user > moves that email from the Junk mailbox into the NotJunk mailbox so the > nightly script can retrain as a false positive. > So what it does is basically what the Dovecot anti-spam plugin does. The plugin however does it in real time while the script you have there does it on a scheduled basis. > DSPAM appears to be correctly adding headers but so far I've seen only > Whitelisted and Innocent. > But how is it possible that you almost have everywhere 0 for TN/TP. If DSPAM would work properly then TP/TN would need to increase every time you get a mail. > >> Is so much "Training Left" normal? Do I have something misconfigured? Will > >> DSPAM start tagging email as SPAM only after 2500 successfully classified > >> emails? > >> > > No. DSPAM is fully functional from day one. The tagging can be turned > > on/off inside dspam.conf or with the preference extension. However... > > turning on/off the tagging has nothing to
Re: [Dspam-user] training time?
On Fri, Apr 09, 2010 at 11:23:16PM -0700, Terry Barnum wrote: >>> I've been running DSPAM for approximately 2 weeks and looking >>> at the output of dspam_stats, I'm curious how long training >>> normally takes. >>> >>> $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$ >>> >>> TrainingMode toe >>> Preference "trainingMode=TOE" Your default settings are TOE mode. Are you overriding this for any of the users in their preferences? If not, this would explain why it's only learning from errors: because you told it to. Try switching this to TUM or TEFT. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user
Re: [Dspam-user] training time?
On Apr 9, 2010, at 7:21 PM, Stevan Bajić wrote: > On Fri, 9 Apr 2010 19:00:54 -0700 > Terry Barnum wrote: > >> I've been running DSPAM for approximately 2 weeks and looking at the output >> of dspam_stats, I'm curious how long training normally takes. >> >> A script is run nightly to check .Junk mailboxes for false negatives and >> .NotJunk mailboxes for false positives and retrains on error. (Richard >> Valk's http://switch.richard5.net/serverinstall/train.dspam) >> >> Here's sample output from dspam_stats -H >> >> x...@dop.com: >> TP True Positives: 0 >> TN True Negatives:19 >> FP False Positives:0 >> FN False Negatives: 348 >> SC Spam Corpusfed: 0 >> NC Nonspam Corpusfed: 0 >> TL Training Left: 2481 >> SHR Spam Hit Rate 0.00% >> HSR Ham Strike Rate: 0.00% >> PPV Positive predictive value: 100.00% >> OCA Overall Accuracy: 5.18% >> >> y...@dop.com: >> TP True Positives: 0 >> TN True Negatives: 0 >> FP False Positives:0 >> FN False Negatives: 3035 >> SC Spam Corpusfed: 0 >> NC Nonspam Corpusfed: 0 >> TL Training Left: 2500 >> SHR Spam Hit Rate 0.00% >> HSR Ham Strike Rate: 100.00% >> PPV Positive predictive value: 100.00% >> OCA Overall Accuracy: 0.00% >> >> z...@dop.com: >> TP True Positives: 0 >> TN True Negatives: 0 >> FP False Positives:0 >> FN False Negatives: 358 >> SC Spam Corpusfed: 0 >> NC Nonspam Corpusfed: 0 >> TL Training Left: 2500 >> SHR Spam Hit Rate 0.00% >> HSR Ham Strike Rate: 100.00% >> PPV Positive predictive value: 100.00% >> OCA Overall Accuracy: 0.00% >> >> te...@dop.com: >> TP True Positives: 0 >> TN True Negatives: 3 >> FP False Positives:0 >> FN False Negatives: 5108 >> SC Spam Corpusfed: 0 >> NC Nonspam Corpusfed: 0 >> TL Training Left: 2497 >> SHR Spam Hit Rate 0.00% >> HSR Ham Strike Rate: 0.00% >> PPV Positive predictive value: 100.00% >> OCA Overall Accuracy: 0.09% >> > This all looks to me that you are not using DSPAM at all. Seems to me that > only the script from http://switch.richard5.net/serverinstall/train.dspam is > feeding DSPAM with data in your setup. Thank you for your help Stevan. My understanding of how this is supposed to eventually work is DSPAM analyzes and adds a header to email as Innocent or Spam and the MUA, which is configured to trust the Spam header, moves mail into the Junk mailbox if DSPAM classified it as Spam. The MUA has its own Junk filtering and moves mail it considers spam into the Junk mailbox too. So the nightly script may run across mail in the Junk mailbox that it mis-classified as Innocent but is actually spam and is retrained as a false negative. Conversely, if DSPAM incorrectly classifies mail as spam, the user moves that email from the Junk mailbox into the NotJunk mailbox so the nightly script can retrain as a false positive. DSPAM appears to be correctly adding headers but so far I've seen only Whitelisted and Innocent. >> Is so much "Training Left" normal? Do I have something misconfigured? Will >> DSPAM start tagging email as SPAM only after 2500 successfully classified >> emails? >> > No. DSPAM is fully functional from day one. The tagging can be turned on/off > inside dspam.conf or with the preference extension. However... turning on/off > the tagging has nothing to do with the training left number. > > >> $ dspam --version >> >> DSPAM Anti-Spam Suite 3.9.0 (agent/library) >> >> Copyright (c) 2002-2009 DSPAM Project >> http://dspam.sourceforge.net. >> >> DSPAM may be copied only under the terms of the GNU General Public License, >> a copy of which can be found with the DSPAM distribution kit. >> >> $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$ >> >> Home /usr/local/var/dspam >> StorageDriver /usr/local/lib/dspam/libmysql_drv.dyli
Re: [Dspam-user] training time?
On Fri, 9 Apr 2010 19:00:54 -0700 Terry Barnum wrote: > I've been running DSPAM for approximately 2 weeks and looking at the output > of dspam_stats, I'm curious how long training normally takes. > > A script is run nightly to check .Junk mailboxes for false negatives and > .NotJunk mailboxes for false positives and retrains on error. (Richard Valk's > http://switch.richard5.net/serverinstall/train.dspam) > > Here's sample output from dspam_stats -H > > x...@dop.com: > TP True Positives: 0 > TN True Negatives:19 > FP False Positives:0 > FN False Negatives: 348 > SC Spam Corpusfed: 0 > NC Nonspam Corpusfed: 0 > TL Training Left: 2481 > SHR Spam Hit Rate 0.00% > HSR Ham Strike Rate: 0.00% > PPV Positive predictive value: 100.00% > OCA Overall Accuracy: 5.18% > > y...@dop.com: > TP True Positives: 0 > TN True Negatives: 0 > FP False Positives:0 > FN False Negatives: 3035 > SC Spam Corpusfed: 0 > NC Nonspam Corpusfed: 0 > TL Training Left: 2500 > SHR Spam Hit Rate 0.00% > HSR Ham Strike Rate: 100.00% > PPV Positive predictive value: 100.00% > OCA Overall Accuracy: 0.00% > > z...@dop.com: > TP True Positives: 0 > TN True Negatives: 0 > FP False Positives:0 > FN False Negatives: 358 > SC Spam Corpusfed: 0 > NC Nonspam Corpusfed: 0 > TL Training Left: 2500 > SHR Spam Hit Rate 0.00% > HSR Ham Strike Rate: 100.00% > PPV Positive predictive value: 100.00% > OCA Overall Accuracy: 0.00% > > te...@dop.com: > TP True Positives: 0 > TN True Negatives: 3 > FP False Positives:0 > FN False Negatives: 5108 > SC Spam Corpusfed: 0 > NC Nonspam Corpusfed: 0 > TL Training Left: 2497 > SHR Spam Hit Rate 0.00% > HSR Ham Strike Rate: 0.00% > PPV Positive predictive value: 100.00% > OCA Overall Accuracy: 0.09% > This all looks to me that you are not using DSPAM at all. Seems to me that only the script from http://switch.richard5.net/serverinstall/train.dspam is feeding DSPAM with data in your setup. > Is so much "Training Left" normal? Do I have something misconfigured? Will > DSPAM start tagging email as SPAM only after 2500 successfully classified > emails? > No. DSPAM is fully functional from day one. The tagging can be turned on/off inside dspam.conf or with the preference extension. However... turning on/off the tagging has nothing to do with the training left number. > $ dspam --version > > DSPAM Anti-Spam Suite 3.9.0 (agent/library) > > Copyright (c) 2002-2009 DSPAM Project > http://dspam.sourceforge.net. > > DSPAM may be copied only under the terms of the GNU General Public License, > a copy of which can be found with the DSPAM distribution kit. > > $ cat /usr/local/dspam.conf | grep -v ^# | grep -v ^$ > > Home /usr/local/var/dspam > StorageDriver /usr/local/lib/dspam/libmysql_drv.dylib > TrustedDeliveryAgent "/usr/bin/procmail" > DeliveryHost 127.0.0.1 > DeliveryPort 10026 > DeliveryIdent localhost > DeliveryProto SMTP > OnFail error > Trust root > Trust dspam > Trust apache > Trust mail > Trust mailnull > Trust smmsp > Trust daemon > Trust _dspam > Trust _postfix > Trust _www > TrainingMode toe > TestConditionalTraining on > Feature whitelist > Algorithm graham burton > Tokenizer osb > PValue bcr > WebStats on > Preference "trainingMode=TOE" # { TOE | TUM | TEFT | NOTRAIN } -> > default:teft > Preference "spamAction=tag" # { quarantine | tag | deliver } -> > default:quarantine > Preference "spamSubject=[SPAM]" # { string } -> default:[SPAM] > Preference "statisticalSedation=5"# { 0 - 10 } -> default:0 > Preference "enableBNR=on" # { on | off } -> default:off > Preference "enableWhitelist=on" # { on | o