Re: URI Basics

Matt Kettler Mon, 24 Apr 2006 06:22:31 -0700

Dan Patnode wrote:
> Another Newbie question here,
>
> So IRIs find links in the body.  I'm trying to get a handle on URI
> syntax and have found several disparate examples:
>
>
> 1) uri HTTP_CTRL_CHARS_HOST       
> /^https?\:\/\/[^\/\s]*[\x00-\x08\x0b\x0c\x0e-\x1f]/
>
> 2) uri NORMAL_HTTP_TO_IP        m{^https?://\d+\.\d+\.\d+\.\d+}i
>
> 3) uri URI_4YOU            [EMAIL PROTECTED](?:https?://|mailto:)[^\/[EMAIL 
> PROTECTED]
>
> 4) uri HTTP_77            /http:\/\/.{0,2}\%77/
>
> 5) uri BARGAIN_URL        /bargain([sz]|-\S+)?\.(?:com|biz)/
>
> 6) uri URI_OFFERS            m/offer([sz]|-\S+)?\.(?:com|bi?z)/i
>
> 7) uri URI_AFFILIATE        /aff\w+id=/i
>
>
> I have a few questions and welcome other tips.  What do m{, m/, and m@
> mean?  
Those are the "match" operator.. It's basically used so you can use
something other than / to delimit the start and end of your regex. It is
very common to do this for URIs so you can do http:// instead of having
to escape it into http:\/\/, as in example 4.


Why example 6 uses m/ is beyond me, as / is the default.

> Are m||, m(), and m{} interchangeable or does each mean something
> different?  
Interchangeable
> Does it matter if the ^ is on the outside (3) or the inside (1&2) of
> the beginning?
In 3 ^ is the first character of the regex, just as it is in 1 and 2. It
is also inside the delimiters, just like 1 and 2. In example 3 @ is
being used as a delimiter,  and ^ is the first character after it. You
can't put a ^ outside your delimiter and have it act as an anchor.
> I see the value of URIs with 5-7 so an anchor is not needed,
I don't believe the use of anchors is a significant performance penalty.
In general, they may actually cause a rule to run faster than one
without. That said, make your choice about anchors based on accuracy
needs, not performance.
> is there an improvement over rawbody when http is used as in 1-4? 

There is definitely a VERY significant performance penalty to using
rawbody over URI, for any rule.

Consider the size of input. A rawbody regex must be run against the
entire text of the body after QP decoding. A uri regex must be run
against all the text of the URIs that SA found. There is likely to be at
least a 100:1 difference in size of input. There's no "penalty" for
using a uri rule, as SA will always extract all the URIs and build the
input text, even if you aren't using it.

However, there are some cases where rawbody is useful, particularly when
you want to examine the formatting of newlines inserted into a HTML tag.

rawbody is also useful when you're looking for a "new trick" the
obfuscates URIs in such a way that SA can't parse them, but outlook can
still open them. This used to be common enough that most folks used
rawbody for all their URI type rules. However, nowadays most of them are
caught.

>
> Thanks,
> Dan
>

Re: URI Basics

Reply via email to