Re: Chickenpoxed subjects
On 10/20/11 8:24 PM, Adam Katz wrote: On 10/19/2011 04:43 AM, Mynabbler wrote: You are kidding, right? 50% of this crap comes from FREEMAIL addresses, and even more specific: 44% of this crap is delivered by aol.com. The aol deliveries have about 85% unique from@aol addresses, so they pretty much 'own' aol. We're writing spam filters, not idiot filters. The fact that there is so much overlap is often useful, bit the overlap is not complete. There is also a decent amount of overlap between the mostly-computer-illiterate and freemail users. I think this drives your current line of thinking. There are a lot of people that do very spammy things. It is a testament to SA and other filters that such non-spam doesn't so commonly flag as spam. Sorry to come to the party late on this, was traveling a bit. It seems to me that if you have lines like: Subject: T R +A N/N!l :ES, P \0 R N Subject: S C/H ,O 0=LG)l :R$L$S ) P -0 RN Then the solution is to use agrep. Make deletions of punctuation very low cost, as well as the usual transformations like: 0 = O 1 = l $ = S ... also be low-cost. (Of course, then you end up with the possibility of clash between deleting $ and replacing it with 'S', but agrep is good about checking both)... they you just grep through a dictionary of the usual offenders: lesbian cash meds porn ... I'm not familiar with perl-String-Approx... reading up on it, it uses the Levenshtein distances just like agrep does... so it would be ideal for doing approximate matches. http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm -Philip
Re: Chickenpoxed subjects
On 10/19/2011 04:43 AM, Mynabbler wrote: You are kidding, right? 50% of this crap comes from FREEMAIL addresses, and even more specific: 44% of this crap is delivered by aol.com. The aol deliveries have about 85% unique from@aol addresses, so they pretty much 'own' aol. We're writing spam filters, not idiot filters. The fact that there is so much overlap is often useful, bit the overlap is not complete. There is also a decent amount of overlap between the mostly-computer-illiterate and freemail users. I think this drives your current line of thinking. There are a lot of people that do very spammy things. It is a testament to SA and other filters that such non-spam doesn't so commonly flag as spam. signature.asc Description: OpenPGP digital signature
Re: Chickenpoxed subjects
RW-15 wrote: MN As I explained, even if the rule would have fired, it adds a whopping MN 0.1 score. It only shows teeth when combined with other findings... RW So, why isn't it worth scoring if it's a useful rule? Because mail with odd characters is not per se spam RW And why score it so high with FREEMAIL? You are kidding, right? 50% of this crap comes from FREEMAIL addresses, and even more specific: 44% of this crap is delivered by aol.com. The aol deliveries have about 85% unique from@aol addresses, so they pretty much 'own' aol. RW The danger here is that you end-up with a lot FREEMAIL WEAK_RULE metas RW that are prone to high-scoring FPs that BAYES_00 can't save. As most spammers try to find something other than BOTNET's at the moment, I think it's only fair to be very critical about FREEMAIL. RW If FREEMAIL_FROM is a good indicator then score it up, and score other rules RW on their merits. Well... in itself FREEMAIL isn't spam a priori. It's just that chances are a lot higher that it is. Hence my method of meta-ing FREEMAIL with fairly low scoring rules, like links to free blogsites, free websites, tumblr, odd punctuation in Subject rules, stuff like that. Interestingly enough the most used subject from valid freemail is Re: and none. I don't see a problem with being picky about freemail. The only free email provider succesfully fighting _out_going spam is gmail.com. -- View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32681681.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Chickenpoxed subjects
On Wed, 19 Oct 2011 04:43:52 -0700 (PDT) Mynabbler wrote: RW-15 wrote: MN As I explained, even if the rule would have fired, it adds a MN whopping 0.1 score. It only shows teeth when combined with other MN findings... RW So, why isn't it worth scoring if it's a useful rule? Because mail with odd characters is not per se spam But if you really believed your rule had merit, you wouldn't score it at 0.1 RW And why score it so high with FREEMAIL? You are kidding, right? 50% of this crap comes from FREEMAIL addresses, But there should be some logic to it, and there's no real connection between FREEMAIL and Chickenpox. If anything it should be the other way around, your rule FPs most frequently in mailing lists where freemail addresses are very common. You'd be much better-off using decent chickenpox rules that are worth scoring in their own right.
Re: Chickenpoxed subjects
Adam Katz wrote: On Mon, 17 Oct 2011, Adam Katz wrote: Time for F-U-N I like DD and rockroll /var/spool/mail is full ... those examples don't get a hit with the rule I cooked up (since it needs three different odd characters), and besides, an MN_PUNCTUATION hits only scores in meta combinations. Note I commented out [] and () since they score too easily in valid email. header __MN_PUNC00 Subject =~ /~/ header __MN_PUNC02 Subject =~ /`/ header __MN_PUNC03 Subject =~ /\#/ header __MN_PUNC04 Subject =~ /\$/ header __MN_PUNC05 Subject =~ /%/ header __MN_PUNC06 Subject =~ /\^/ header __MN_PUNC07 Subject =~ // header __MN_PUNC08 Subject =~ /\*/ # header __MN_PUNC09 Subject =~ /\(|\)/ header __MN_PUNC10 Subject =~ /\?/ header __MN_PUNC11 Subject =~ /\+/ header __MN_PUNC12 Subject =~ /=/ header __MN_PUNC13 Subject =~ /\{|\}/ # header __MN_PUNC14 Subject =~ /\[|\]/ header __MN_PUNC15 Subject =~ /\|/ header __MN_PUNC16 Subject =~ /\/ header __MN_PUNC17 Subject =~ /\;/ header __MN_PUNC18 Subject =~ /\:/ header __MN_PUNC19 Subject =~ /\// header __MN_PUNC20 Subject =~ /_/ meta MN_PUNCTUATION (__MN_PUNC01 + __MN_PUNC02 + __MN_PUNC03 + __MN_PUNC04 + __MN_PUNC05 + __MN_PUNC06 + __MN_PUNC07 + __MN_PUNC08 + __MN_PUNC10 + __MN_PUNC11 + __MN_PUNC12 + __MN_PUNC13 + __MN_PUNC15 + __MN_PUNC16 + __MN_PUNC17 + __MN_PUNC18 + __MN_PUNC19 + __MN_PUNC20 = 3) score MN_PUNCTUATION 0.1 # # Now, let's go hunt with this: meta MN_PUNCS1 (MN_PUNCTUATION (FREEWEB || HAS_SHORT_URL || MN_TUMBLR)) score MN_PUNCS1 6 describe MN_PUNCS1 Garbled subject with free website or blogsite, SHORT_URL or tumblr link meta MN_PUNCS2 (MN_PUNCTUATION FREEMAIL) score MN_PUNCS2 3 describe MN_PUNCS2 Garbled subject from a free mail address -- View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32672891.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Chickenpoxed subjects
On Tue, 18 Oct 2011 01:21:36 -0700 (PDT) Mynabbler wrote: Adam Katz wrote: On Mon, 17 Oct 2011, Adam Katz wrote: Time for F-U-N I like DD and rockroll /var/spool/mail is full ... those examples don't get a hit with the rule I cooked up (since it needs three different odd characters), It would hit: Re: Did you pick-up the dry-cleaning? I think it needs more work, maybe combine it with tests for lots of very short words or adjacent punctuation pairs.
Re: Chickenpoxed subjects
RW-15 wrote: It would hit: Re: Did you pick-up the dry-cleaning? Nope. Scores just two (one ':' and a '?') and the rule needs three different odd characters. RW-15 wrote: I think it needs more work, maybe combine it with tests for lots of very short words or adjacent punctuation pairs. As I explained, even if the rule would have fired, it adds a whopping 0.1 score. It only shows teeth when combined with other findings... -- View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32677140.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: Chickenpoxed subjects
On Tue, 18 Oct 2011 13:07:21 -0700 (PDT) Mynabbler wrote: RW-15 wrote: It would hit: Re: Did you pick-up the dry-cleaning? Nope. Scores just two (one ':' and a '?') and the rule needs three different odd characters. OK the font I'm using makes ~ look very like a -, but the point remains. If a subject starts with FW: or Re: and has a [!?], which is pretty common, you are then triggering on only one extra character. If you look back through this list you will find numerous such replies. RW-15 wrote: I think it needs more work, maybe combine it with tests for lots of very short words or adjacent punctuation pairs. As I explained, even if the rule would have fired, it adds a whopping 0.1 score. It only shows teeth when combined with other findings... So, why isn't it worth scoring if it's a useful rule? And why score it so high with FREEMAIL?. The danger here is that you end-up with a lot FREEMAIL WEAK_RULE metas that are prone to high-scoring FPs that BAYES_00 can't save. If FREEMAIL_FROM is a good indicator then score it up, and score other rules on their merits.
Re: Chickenpoxed subjects
On 10/15/2011 03:37 PM, John Hardin wrote: On Thu, 13 Oct 2011, Mynabbler wrote: Typically the chickenpox rules do not get a lot of love abroad, since they tend to trip over other languages than English. However, does someone have an idea how to use the logic in chickenpox for subjects like these: ... or does someone have a decent rule to tag this kind of crap? I've got something in local masscheck right now, should commit later today. Check my sandbox tomorrow. header __SUBJ_OBFU_PUNCT Subject =~ /(?:[-~`!@\#$%^*()_+={}|\\\/?,.:;][a-z][-~`!@\#$%^*()_+={}|\\\/?,.:;\s]|[a-z][~`!@\#$%^*()_+={}|\\\/?,.:;][a-z])/i How does this differ from a negation, like: /[^\[\]'\w\s][a-z][^\[\]'\w]|[a-z][^\[\]'\w\s-][a-z]/i and how does this not FP all over the place with subjects like: Time for F-U-N I like DD and rockroll /var/spool/mail is full I think this would satisfy the original request: header __SUBJ_LACKS_WORDS Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/ (I have not checked that in, feel free if you like it.) signature.asc Description: OpenPGP digital signature
Re: Chickenpoxed subjects
On 10/17/2011 02:29 PM, Adam Katz wrote: I think this would satisfy the original request: header __SUBJ_LACKS_WORDS Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/ (I have not checked that in, feel free if you like it.) Okay, that needed a little work (boo to double-negatives). Also, I hadn't noticed the new thread (sorry). Just checked this in: header __SUBJ_NOT_SHORTSubject =~ /^.{16}/ header __SUBJ_HAS_WORDSSubject =~ /(?:^|\s)[^\W0-9_]{3,15}(?:\s|$)/ meta SUBJ_LACKS_WORDS __SUBJ_NOT_SHORT !__SUBJ_HAS_WORDS !__SUBJECT_ENCODED_B64 describe SUBJ_LACKS_WORDS Non-short subject lacks words Even this will hit a fair amount of ham, especially with foreign languages (I tried to work around this with [^\W0-9_] instead of [a-z] in the event a locale is in use). signature.asc Description: OpenPGP digital signature
Re: Chickenpoxed subjects
On Mon, 17 Oct 2011, Adam Katz wrote: header __SUBJ_OBFU_PUNCT Subject =~ /(?:[-~`!@\#$%^*()_+={}|\\\/?,.:;][a-z][-~`!@\#$%^*()_+={}|\\\/?,.:;\s]|[a-z][~`!@\#$%^*()_+={}|\\\/?,.:;][a-z])/i How does this differ from a negation, like: /[^\[\]'\w\s][a-z][^\[\]'\w]|[a-z][^\[\]'\w\s-][a-z]/i I suppose which you'd choose would be based on how conservative you want to be. Matching on specific types of obfuscation (as mine does), or being less selective (as yours does). and how does this not FP all over the place with subjects like: Time for F-U-N I like DD and rockroll /var/spool/mail is full It must hit more than a specified number of times. __SUBJ_OBFU_PUNCT isn't scored, SUBJ_OBFU_PUNCT_FEW and SUBJ_OBFU_PUNCT_MANY are. I think this would satisfy the original request: header __SUBJ_LACKS_WORDS Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/ (I have not checked that in, feel free if you like it.) When I get home tonight. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Insofar as the police deter by their presence, they are very, very good. Criminals take great pains not to commit a crime in front of them. -- Jeffrey Snyder --- 312 days since the first successful private orbital launch (SpaceX)
Re: Chickenpoxed subjects
On 10/17/2011 04:36 PM, John Hardin wrote: On Mon, 17 Oct 2011, Adam Katz wrote: Time for F-U-N I like DD and rockroll /var/spool/mail is full It must hit more than a specified number of times. __SUBJ_OBFU_PUNCT isn't scored, SUBJ_OBFU_PUNCT_FEW and SUBJ_OBFU_PUNCT_MANY are. Each of my examples hits SUBJ_OBFU_PUNCT_FEW, and it wouldn't be hard for them to hit SUBJ_OBFU_PUNCT_MANY either. I think this would satisfy the original request: header __SUBJ_LACKS_WORDS Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/ (I have not checked that in, feel free if you like it.) When I get home tonight. See my other email, already checked in :-) signature.asc Description: OpenPGP digital signature
Re: Chickenpoxed subjects
On Thu, 13 Oct 2011, Mynabbler wrote: Typically the chickenpox rules do not get a lot of love abroad, since they tend to trip over other languages than English. However, does someone have an idea how to use the logic in chickenpox for subjects like these: ... or does someone have a decent rule to tag this kind of crap? I've got something in local masscheck right now, should commit later today. Check my sandbox tomorrow. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- One death is a tragedy; thirty is a media sensation; a million is a statistic. -- Joseph Stalin, modernized --- 310 days since the first successful private orbital launch (SpaceX)