Re: Spam Pattern

2014-02-14 Thread Amir Caspi
On Feb 14, 2014, at 1:04 PM, Adam Katz wrote: > Noo, don't do that. (?:\s*\w+)+ is a ReDoS bomb (and you have it ten > times!) which will destroy your Whoops, you're very right. Removing the + after the \w (that is, turning it to (?:\s*\w)+ ) should match the same things but without th

Re: Spam Pattern

2014-02-14 Thread Adam Katz
On 02/14/2014 11:23 AM, Amir Caspi wrote: > To be clear, that wasn't my sample; I am not the originator of this > thread. Whoops, my bad. My point was clear anyway. > What about this, a variant of what I posted earlier? It requires 10 > matches, but I believe it does the same thing as yours exc

Re: Spam Pattern

2014-02-14 Thread Amir Caspi
On Feb 14, 2014, at 11:53 AM, Adam Katz wrote: > some of your sample's strings had an extra character on the end. > To be clear, that wasn't my sample; I am not the originator of this thread. > This version of the rule is more expensive, but is safer to score higher > (maybe 3-4 points): body

Re: Spam Pattern

2014-02-14 Thread Adam Katz
Ha! I checked my mail before sending this; we're on the same wavelength yet our emails are out of sync. You just suggested the same thing I was leaning on. On 02/14/2014 10:53 AM, John Hardin wrote: > S/O is a little surprising: > > http://ruleqa.spamassassin.org/?daterev=20140213-r1567864-n&rul

Re: Spam Pattern

2014-02-14 Thread John Hardin
On Fri, 14 Feb 2014, Adam Katz wrote: Yes, there is an increased FP risk due to the ability to match different hex strings (e.g. a list of checksums). That's probably where the current Rule QA FPs come from. Good point. Perhaps it should be /\s[

Re: Spam Pattern

2014-02-14 Thread Adam Katz
On Feb 14, 2014, at 11:00 AM, Adam Katz mailto:antis...@khopis.com>> wrote: >> >> Given the nature of the content, I'd go the other direction and not >> require the word boundary. This removes the wildcard, though it >> doesn't short circuit as quickly, so one could debate which version >> is more

Re: Spam Pattern

2014-02-14 Thread John Hardin
On Fri, 14 Feb 2014, Amir Caspi wrote: Another problem with the above code is that you require only a short word (1-10 chars) prior to the hex string. Some perfectly legitimate, or even illegitimate, words could be longer than 10 chars. I'd increase the upper limit to something like 15ish

Re: Spam Pattern

2014-02-14 Thread John Hardin
On Fri, 14 Feb 2014, Adam Katz wrote: Given the nature of the content, I'd go the other direction and not require the word boundary. This removes the wildcard, though it doesn't short circuit as quickly, so one could debate which version is more efficient. body __HEXHASHWORD /\b[a-z]{1,

Re: Spam Pattern

2014-02-14 Thread Amir Caspi
On Feb 14, 2014, at 11:00 AM, Adam Katz wrote: > Given the nature of the content, I'd go the other direction and not require > the word boundary. This removes the wildcard, though it doesn't short > circuit as quickly, so one could debate which version is more efficient. > body __HEXHASHW

Re: Spam Pattern

2014-02-14 Thread Adam Katz
On 02/12/2014 01:46 PM, John Hardin wrote: > On Wed, 12 Feb 2014, Axb wrote: >> On 02/12/2014 10:06 PM, John Hardin wrote: >>> Perhaps something like this: >>> >>> body __HEXHASHWORD /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/ >>> tflags__HEXHASHWORD multiple maxhits=5 >>> meta HEXHASH_W

Re: Spam Pattern

2014-02-12 Thread Axb
On 02/12/2014 10:46 PM, John Hardin wrote: On Wed, 12 Feb 2014, Axb wrote: On 02/12/2014 10:06 PM, John Hardin wrote: Perhaps something like this: body __HEXHASHWORD /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/ tflags__HEXHASHWORD multiple maxhits=5 meta HEXHASH_WORD__HEXHASHWO

Re: Spam Pattern

2014-02-12 Thread John Hardin
On Wed, 12 Feb 2014, Axb wrote: On 02/12/2014 10:46 PM, John Hardin wrote: On Wed, 12 Feb 2014, Axb wrote: > On 02/12/2014 10:06 PM, John Hardin wrote: > > > > Perhaps something like this: > > > > body __HEXHASHWORD /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/ > > tflags__HEXHASHWORD

Re: Spam Pattern

2014-02-12 Thread John Hardin
On Wed, 12 Feb 2014, Axb wrote: On 02/12/2014 10:06 PM, John Hardin wrote: Perhaps something like this: body __HEXHASHWORD /\b[0-9a-f]{30,}\s[a-z]{1,10}\b/ tflags__HEXHASHWORD multiple maxhits=5 meta HEXHASH_WORD__HEXHASHWORD > 4 describe HEXHASH_WORDHexadecima

Re: Spam Pattern

2014-02-12 Thread Amir Caspi
On Feb 12, 2014, at 2:13 PM, Axb wrote: > Isn't {30,} (without a limit) dangerously expensive? It has a limit -- the whitespace at the end of the string is required. In this case, it should be fine, the regexp cannot match "infinitely" many characters, and it's also sort of required, because

Effectiveness of Bayes poisoning (was Re: Spam Pattern)

2014-02-12 Thread David F. Skoll
On Wed, 12 Feb 2014 13:11:19 -0800 (PST) John Hardin wrote: > That only works if your hammy mail stream contains text that looks > like the random garbage they put in to try to spoof bayes. Indeed. Just for kicks, I ran the OP's pastebin example through our Bayes database and it scored 99.99% l

Re: Spam Pattern

2014-02-12 Thread Axb
On 02/12/2014 10:06 PM, John Hardin wrote: On Wed, 12 Feb 2014, Joe Quinn wrote: On 2/12/2014 3:15 PM, John Hardin wrote: On Wed, 12 Feb 2014, Joe Quinn wrote: > This pattern has been showing up in a good 80% of spam I have looked at > in the past month. > > Spammers take a few paragraphs

Re: Spam Pattern

2014-02-12 Thread John Hardin
On Wed, 12 Feb 2014, Amir Caspi wrote: On Feb 12, 2014, at 1:15 PM, John Hardin wrote: Bayes. Well, yes and no. Bayes isn't very good about detecting this kind of thing per se because it's full of random crap... in fact, they specifically pull text from innocuous things like web reviews,

Re: Spam Pattern

2014-02-12 Thread John Hardin
On Wed, 12 Feb 2014, Joe Quinn wrote: On 2/12/2014 3:15 PM, John Hardin wrote: On Wed, 12 Feb 2014, Joe Quinn wrote: > This pattern has been showing up in a good 80% of spam I have looked at > in the past month. > > Spammers take a few paragraphs out of a large body of text and put it at

Re: Spam Pattern

2014-02-12 Thread Axb
On 02/12/2014 09:02 PM, Joe Quinn wrote: This pattern has been showing up in a good 80% of spam I have looked at in the past month. Spammers take a few paragraphs out of a large body of text and put it at the end of their email. My favorite is one that had the scene where Daisy first meets Jay G

Re: Spam Pattern

2014-02-12 Thread Axb
On 02/12/2014 09:02 PM, Joe Quinn wrote: This pattern has been showing up in a good 80% of spam I have looked at in the past month. Spammers take a few paragraphs out of a large body of text and put it at the end of their email. My favorite is one that had the scene where Daisy first meets Jay G

Re: Spam Pattern

2014-02-12 Thread Amir Caspi
On Feb 12, 2014, at 1:15 PM, John Hardin wrote: > Bayes. Well, yes and no. Bayes isn't very good about detecting this kind of thing per se because it's full of random crap... in fact, they specifically pull text from innocuous things like web reviews, movie reviews, news articles, etc. in th

Re: Spam Pattern

2014-02-12 Thread RW
On Wed, 12 Feb 2014 15:02:20 -0500 Joe Quinn wrote: > This pattern has been showing up in a good 80% of spam I have looked > at in the past month. > > Spammers take a few paragraphs out of a large body of text and put it > at the end of their email. My favorite is one that had the scene > where D

Re: Spam Pattern

2014-02-12 Thread Joe Quinn
On 2/12/2014 3:15 PM, John Hardin wrote: On Wed, 12 Feb 2014, Joe Quinn wrote: This pattern has been showing up in a good 80% of spam I have looked at in the past month. Spammers take a few paragraphs out of a large body of text and put it at the end of their email. My favorite is one that h

Re: Spam Pattern

2014-02-12 Thread John Hardin
On Wed, 12 Feb 2014, Joe Quinn wrote: This pattern has been showing up in a good 80% of spam I have looked at in the past month. Spammers take a few paragraphs out of a large body of text and put it at the end of their email. My favorite is one that had the scene where Daisy first meets Jay