Re: Yahoo/URL spam

2010-04-13 Thread Alex
Hi,

I'm having some additional difficulty with body URI rules and hoped
someone could help.

 rawbody  __BODY_ONLY_URI
  /^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
 ]{0,20}[a-z]{0,10}$/msi

This doesn't seem to catch a quoted-printable body and I can't figure
out how to adapt it to allow for the 'Content-Transfer-Encoding: that
precedes the URL, if that's even the right approach. Here's an
example:

http://pastebin.com/NDR1n4sN

Ideas greatly appreciated.
Thanks,
Alex


Re: [sa] Re: Yahoo/URL spam

2010-03-24 Thread Mike Grau

On 3/23/2010 2:49 PM the voices made Charles Gregory write:

On Tue, 23 Mar 2010, Alex wrote:

This is what I have:
/^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
]{0,20}[a-z]{0,10}$/msi


My bad. I got an option wrong. Please remove the 'm' above.
I always get it backwards. According to 'man perlre' (the definitive
resource for SA regexes!) the 'm' makes '^' match every newline!
We want it to only match the beginning of the body.

So just remove it, and, as noted by others, add the '^' that was
missing... like so

... ]{0,20}[^a-z]{0,10}$/si


Hello,

You might want to change  (\w+\.)+  to  ([\w-]+\.)+  to account for 
domains like polster-jj.de


-- MG


Re: Yahoo/URL spam

2010-03-23 Thread Alex
Hi Charles,

 /^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
 ]{0,20}[^a-z]{0,10}$/msi
 This allows for some amount (up to ten chars?) of text before and
 after the URI if I'm reading that right, correct?

 Nope. With the /ms flags ^ and $ at beginning and end match the *whole* body
 as a single 'string' and permit 'any character' (. or [^x]) matches to also
 match newlines. So the above regex translates to:

This was very helpful, thanks. I might be doing something wrong, or
there's a typo somewhere. It seems to catch situations where there's
more than just a URL in the body, such as just additional text.

This is what I have:

/^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
]{0,20}[a-z]{0,10}$/msi

Thanks again,
Alex


Re: Yahoo/URL spam

2010-03-23 Thread John Horne
On Tue, 2010-03-23 at 13:18 -0400, Alex wrote:
 Hi Charles,
 
  /^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
  ]{0,20}[^a-z]{0,10}$/msi

 
 This is what I have:
 
 /^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
 ]{0,20}[a-z]{0,10}$/msi
 ^

The original had [^a-z]



John.

-- 
John Horne, University of Plymouth, UK
Tel: +44 (0)1752 587287Fax: +44 (0)1752 587001



Re: [sa] Re: Yahoo/URL spam

2010-03-23 Thread Charles Gregory

On Tue, 23 Mar 2010, Alex wrote:

This is what I have:
/^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
]{0,20}[a-z]{0,10}$/msi


My bad. I got an option wrong. Please remove the 'm' above.
I always get it backwards. According to 'man perlre' (the definitive 
resource for SA regexes!) the 'm' makes '^' match every newline!

We want it to only match the beginning of the body.

So just remove it, and, as noted by others, add the '^' that was 
missing... like so


... ]{0,20}[^a-z]{0,10}$/si

- Charles


Re: Yahoo/URL spam

2010-03-22 Thread Charles Gregory

On Mon, 22 Mar 2010, Alex wrote:

rawbody __BODY_ONLY_URI

/^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^ 
]{0,20}[^a-z]{0,10}$/msi
This allows for some amount (up to ten chars?) of text before and
after the URI if I'm reading that right, correct?


Nope. With the /ms flags ^ and $ at beginning and end match the *whole* 
body as a single 'string' and permit 'any character' (. or [^x]) matches 
to also match newlines. So the above regex translates to:


/^ - Beginning of body
[^a-z]{0,10} - match 0-10 non-alpha characters *including* newlines
(http:\/\/|www\.) - match a uri beginning with http *or* www
(\w+\.)+ - match multiple occurences of word followed by .
(this will match 'domain.' *or* 'www.domain.')
(com|net|biz|org|cn|ru) - match TLD (adjust to fit your mail)
\/? - match a slash if there is one
[^ ]{0,20} - match 0-20 non-blank characters (page name, if given)
[^a-z]{0,10} - match 0-10 non-alpha chars including newlines
 (did I TYPO in my OP and leave out the '^'?)
$ - match end of body
/msi


Is it possible to determine the beginning of the line with a body rule?


Insert '\n' into the above regex where you want to match newline.

I didn't think that was possible. I believe this is also what this is 
trying to do?


It's possible, but NOT what this regex does. Essentially this regex 
matches against a complete body that consists of nothing more than a 
single URI on a line, with possible blank lines before or after.
Rather than test for newlines, I test for non-alpha so that a stray space 
or tab or LF code does not fail to match.


This simple regex can also be 'dressed up' with elements of the form
(\[^\\]+\ +)+ to match any HTML code inserted before or after the 
URI. A regex could also check for a link consisting of text 
enclosed by a href=... ... /a


They key is to be sure that you don't use '*' or '+' in any context where 
it could 'run away' and try to match large message bodies This way as 
soon as the body exceeds 40 characters on either side of an unbroken 
string of characters it stops the test. Relatively efficient for a rawbody

test

- C


Re: Yahoo/URL spam

2010-03-21 Thread Alex
Hi,

 Lots of ham may contain a URI, but how much ham contains ONLY a URI?

 Rough outline of rule, untested.

 rawbody  __BODY_ONLY_URI
  /^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^
 ]{0,20}[a-z]{0,10}$/msi

 Combine that with 'frequent abusers' like Yahoo, and you've got something
 you can give a few points

This allows for some amount (up to ten chars?) of text before and
after the URI if I'm reading that right, correct? Is it possible to
determine the beginning of the line with a body rule? I didn't think
that was possible. I believe this is also what this is trying to do?

Thanks,
Alex


Re: Yahoo/URL spam

2010-03-19 Thread Charles Gregory

On Thu, 18 Mar 2010, Ned Slider wrote:
If that's not an option, how about a meta rule for FROM_YAHOO and 
__HAS_ANY_URI (this rule exists in SA).


Lots of ham may contain a URI, but how much ham contains ONLY a URI?

Rough outline of rule, untested.

rawbody  __BODY_ONLY_URI
  /^[^a-z]{0,10}(http:\/\/|www\.)(\w+\.)+(com|net|org|biz|cn|ru)\/?[^ 
]{0,20}[a-z]{0,10}$/msi

Combine that with 'frequent abusers' like Yahoo, and you've got something 
you can give a few points


There will probably need to be a variant on this to account for HTML mail 
and/or the 'standard' footers inserted by free mail agents. Which 
incidentally, suprises me here. I thought Yahoo always added a tagline?


- C


Yahoo/URL spam

2010-03-18 Thread Alex
Hi,

I'm having a real problem with this persistent spam that contains just
a URL as the body, and is always from yahoo. I've got an example here:

http://pastebin.com/UqzhDHEu

'example.com' is my change. I'm using SA v3.2.5 with postfix/amavis.
I'm concerned that the bayes score is always low. I can't determine
any other patterns from this message to key on for other rules. Ideas
most welcome.

Thanks!
Best,
Alex


Re: Yahoo/URL spam

2010-03-18 Thread Martin Gregorie
On Thu, 2010-03-18 at 18:05 -0400, Alex wrote:
 Hi,
 
 I'm having a real problem with this persistent spam that contains just
 a URL as the body, and is always from yahoo. I've got an example here:
 
 http://pastebin.com/UqzhDHEu
 
 'example.com' is my change. I'm using SA v3.2.5 with postfix/amavis.
 I'm concerned that the bayes score is always low. I can't determine
 any other patterns from this message to key on for other rules. Ideas
 most welcome.
 
There's something odd about the message as posted: I'm getting hits on
MISSING_SUBJECT and MISSING_DATE (SA 3.3.0).


Martin





Re: Yahoo/URL spam

2010-03-18 Thread RW
On Thu, 18 Mar 2010 22:31:04 +
Martin Gregorie mar...@gregorie.org wrote:


 There's something odd about the message as posted: I'm getting hits on
 MISSING_SUBJECT and MISSING_DATE (SA 3.3.0).
 

Some of the wrapped headers aren't properly indented. Probably happened
on editing.