Dan Patnode wrote: > Another Newbie question here, > > So IRIs find links in the body. I'm trying to get a handle on URI > syntax and have found several disparate examples: > > > 1) uri HTTP_CTRL_CHARS_HOST > /^https?\:\/\/[^\/\s]*[\x00-\x08\x0b\x0c\x0e-\x1f]/ > > 2) uri NORMAL_HTTP_TO_IP m{^https?://\d+\.\d+\.\d+\.\d+}i > > 3) uri URI_4YOU [EMAIL PROTECTED](?:https?://|mailto:)[^\/[EMAIL > PROTECTED] > > 4) uri HTTP_77 /http:\/\/.{0,2}\%77/ > > 5) uri BARGAIN_URL /bargain([sz]|-\S+)?\.(?:com|biz)/ > > 6) uri URI_OFFERS m/offer([sz]|-\S+)?\.(?:com|bi?z)/i > > 7) uri URI_AFFILIATE /aff\w+id=/i > > > I have a few questions and welcome other tips. What do m{, m/, and m@ > mean? Those are the "match" operator.. It's basically used so you can use something other than / to delimit the start and end of your regex. It is very common to do this for URIs so you can do http:// instead of having to escape it into http:\/\/, as in example 4.
Why example 6 uses m/ is beyond me, as / is the default. > Are m||, m(), and m{} interchangeable or does each mean something > different? Interchangeable > Does it matter if the ^ is on the outside (3) or the inside (1&2) of > the beginning? In 3 ^ is the first character of the regex, just as it is in 1 and 2. It is also inside the delimiters, just like 1 and 2. In example 3 @ is being used as a delimiter, and ^ is the first character after it. You can't put a ^ outside your delimiter and have it act as an anchor. > I see the value of URIs with 5-7 so an anchor is not needed, I don't believe the use of anchors is a significant performance penalty. In general, they may actually cause a rule to run faster than one without. That said, make your choice about anchors based on accuracy needs, not performance. > is there an improvement over rawbody when http is used as in 1-4? There is definitely a VERY significant performance penalty to using rawbody over URI, for any rule. Consider the size of input. A rawbody regex must be run against the entire text of the body after QP decoding. A uri regex must be run against all the text of the URIs that SA found. There is likely to be at least a 100:1 difference in size of input. There's no "penalty" for using a uri rule, as SA will always extract all the URIs and build the input text, even if you aren't using it. However, there are some cases where rawbody is useful, particularly when you want to examine the formatting of newlines inserted into a HTML tag. rawbody is also useful when you're looking for a "new trick" the obfuscates URIs in such a way that SA can't parse them, but outlook can still open them. This used to be common enough that most folks used rawbody for all their URI type rules. However, nowadays most of them are caught. > > Thanks, > Dan >