Re: HTML link regex

Bowie Bailey Thu, 27 Sep 2012 11:34:26 -0700


On 9/27/2012 1:48 PM, Alexandre Boyer wrote:

Alex, from prypiat.
Yes, I recycle.



On 12-09-27 11:09 AM, Bowie Bailey wrote:

On 9/27/2012 10:41 AM, Alexandre Boyer wrote:

Hello all,

Here is a small ruleset that I'm working with. I added it to our
local ruleset in prod:

     # BAD LINKS N-NG ;-) ;
     # Canada Post

&n

     b sp;
     uri_detail   AJB_CANPOST_BADLINK             raw !~ /canadapost\./
     text =~ /(?:https?:\/\/(?:www\.)?|www\.)canadapost\./ type =~ /^a$/
     describe     AJB_CANPOST_BADLINK             Found a mismatch
     between href and anchored text pretending to link to
www.canadapost.ca
     score        AJB_CANPOST_BADLINK             1.0
     meta         AJB_CANPOST_PHISH_BADTRACKNUM   Z_CANPOST_BADLINK &&
     !Z_CANPOST_TRACKNUM
     describe     AJB_CANPOST_PHISH_BADTRACKNUM   Mismatch between href
     and anchored + unofficial tracking number from CanadaPost
     score        AJB_CANPOST_PHISH_BADTRACKNUM   2.0
     #

youtube

&
     n bsp;
     uri_detail   AJB_UTUBE_BADLINK   raw !~ /youtube\./ text =~
     /(?:https?:\/\/(?:www\.)?|www\.)youtube\./ type =~ /^a$/
     describe     AJB_UTUBE_BADLINK   Found a mismatch between href and
     anchored text pretending to link to www.youtube.com
     score        AJB_UTUBE_BADLINK   0.5
     # because of link trackers (from massmailer for example), we must
     meta this with other rulz to be sure we face our fake yutube botnet
     meta      AJB_FK_UTUBE_BOTNET     Z_UTUBE_BADLINK && Z_EMPTY_SUBJ
     && MIME_HTML_ONLY
     describe  AJB_FK_UTUBE_BOTNET     mismatch between href and
     anchored + empty subject = botnet
     score     AJB_FK_UTUBE_BOTNET     5.5
     ## & nbsp;
     # TODO: check if we could workwith  DKIM, exists:List-Unsubscribe,
     SPF_PASS, RCVD_IN_RP_SAFE, RCVD_IN_RP_CERTIFIED and others
     #    in order to avoid FPs from MassMailers.

Note the TODO ;-)

Don't know if it makes much difference in this case, but...

(?:https?:\/\/(?:www\.)?|www\.)

Should catch:
http://
https://
http://www.
https://www.
www.

can be simplified to:

(?:https?:\/\/|www\.)

While this catches:
http://
https://
www.

Covering less. It's may be overkill, but my regex has one and only
purpose: match any kind of "valid" web link, as per common user
experience (ie. "as seen on TV").

The spammer will try to lure the common user by mimic what the common
user is habituated to see, no?

Check again. "http://www."; and "https://www."; are caught by the "www."pattern. Matching the "https?://" as well is not needed. That's why Imentioned anchoring. If you were anchoring the front of the regexp, youwould need this match. Since you are not, the extra specificity is notneeded. My regexp matches exactly the same strings as yours.

Since you're not anchoring the front of the regexp or trying to
capture the match, the results will be the same.

Not capturing because not using thereafter. On a small system, this
makes no difference. On large systems (millions+ emails filtered a day),
this is probably making a difference. I take a guess here, I don't want
to prove this on my own systems :-)

Right. No need to capture here or in most SA rules. I only mentionedit since there would be a difference between your original regexp and mysuggestion if you were doing some capturing.

As I said, it may not make any real difference here, I was simplypointing out the possible simplification of the regexp.


--
Bowie

Re: HTML link regex

Reply via email to