On Sat, 2011-11-26 at 22:11 +0000, Ned Slider wrote:
> On 26/11/11 21:36, Karsten Bräckelmann wrote:
> > On Sat, 2011-11-26 at 19:46 +0000, Ned Slider wrote:
> > > # URIs matching http://some.domain.com/profile/12FirstLastname/
> > > uri               LOCAL_URI_PROFILE
> > > m{https?://.{1,40}/profile/\d\d[A-Z][a-z]{1,20}[A-Z][a-z]{1,20}/}
> >              ^^^^^^^
> >
> > Using [^/]+ for the domain part instead should be slightly more
> > efficient, since it will never result in backtracking. Also stricter,
> > since it anchors the /profile directory at the domain.
> 
> Now you made me go away and read the perl RE manual!

That was not my intention, but a good outcome nonetheless. :)

> Nice - [^/] meaning to match any character that is not a "/", one or 
> more (+) times. I'd not come across negated character classes before so 
> thanks for the tip!

Negated char classes, or more precisely mutually exclusive char classes
often are useful, if they are possible at all. In an example like the
above, one can think of it as full-speed slurp mode. No possible
alternative matches, no backtracking, no ambiguity.

The point is, that any range is greedy by default. So regardless whether
one uses the general + or *, or limited {x,y}, the regex engine first
will attempt to consume as much chars as possible -- gradually trying
with less, if the following sub-RE doesn't match.

So .{1,40}/ will first consume 40 chars, check if the next char is a
slash -- and give up a char, if it is no slash, to try again...

The amount of backtracking possibly involved becomes more obvious with
an example URI like this:
  example.com/profile/foo/bar/baz/

With mutually exclusive char sets, the greedy [^/]+ will consume all
non-slash chars, and without any backtracking or futile matching
attempts swallow exactly the domain part in one go.


> And anchoring the /profile directory at the domain definitely reduces 
> the likelihood of FPs.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to