Re: Regexp. Should be easy I think..

Januk Aggarwal Tue, 20 Apr 2004 23:11:06 -0700

Hello Leif,

On Tuesday, April 20, 2004 at 20:13 GMT -0600, a stampede was started
when Leif Gregory hollered:


It just occurred to me, if you're using PHP, can't you get all the
info you need without using regexps?  I seem to recall that there are
built in functions that can get you pretty much any info you want,
though I can't find it in the phpinfo() function...

> I'm pretty sure the HTTP request type will always be first so I can
> anchor with ^, but IIS likes to put the SERVER before the DATE, and
> Sambar is the reverse. So, my thinking was that if I did the HTTP part
> and then moved up to the first : and backed up to the first
> whitespace, I could grab the next chunk (either DATE or SERVER) up to
> the next : (and then back to the first whitespace), and continue that
> until I hit the EOL.

You can do that with the chunk that I wrote. What I don't know is how
php handles regexps. I know in VBScript, when you do a regexp, all
possible matches are stored in an array, so it is pretty easy to
get out all the parts you want. In TB, that isn't the case, so I tend
to forget about that option. If php will populate an array, then
you're golden. The regexp could be fairly simple.

JA>> (?i)(Date:\s*.*?)((\s\S+:\s)|\z)
JA>> But increasing accuracy has the price of decreasing tolerance for
JA>> errors in the string.

> Exactly.

The way I wrote the above regexp, you should be pretty accurate
without losing any generality.

> I didn't check an Apache server (I'll do that tomorrow) to
> see how it outputs its HTTP headers. I am looking for something
> generic, hence my hoping I could use the : as jump points to back up
> from.

If you really want to do that, you should use a look-ahead assertion.
Something like:
(\S*:\s*.*?)\s(?=\S*:\s)

I haven't tried this in PHP, but in principle it should work.

> Right. That shouldn't be a problem. I have a list of the atoms for PHP
> and they are close to TB.

Excellent.  Do you mind sending me either a link or the list (off list
if you like)?  I was slowly learning some PHP stuff myself, so that
could be very useful.

> I had considered that (just doing multiple reg matches), but wondered
> if there was a better way. It is a very small script, so it wouldn't
> really kill the performance by doing multiple reg matches.

Like I mentioned above, if PHP fills an array with all the matches,
you get the best of both worlds.

> So far, this one has always been first. It'll get ugly if it pops up
> somewhere else on some strange webserver.

Well then, it doesn't have to be hard, just use:
^(.*?)\s+(\S*:\s)

> The order does change with exception to HTTP that I've discovered so
> far anyways.

Well, with multiple regexps, this isn't an issue.  A single TB style
match is more difficult with this restriction.  The only way around it
would be to use If..then statements, but the question becomes: which
is worse?  Running several matches, or processing the matches through
a conditional cascade?

> This might just be the best way to do it.

It certainly is the easiest, though you will probably pay in
performance if every clock cycle counts.

> Yeah, but then I'd have to read at least ten posts telling me to
> Google it. Like I really hadn't thought of that! <grin>

<sigh> That's why we need TBPHP, TBEverything_Under_The_Sun.  You'd be
willing to moderate a few more lists, right? ;-)

-- 
Thanks for writing,
 Januk Aggarwal




________________________________________________________

http://www.silverstones.com/thebat/TBUDLInfo.html

Re: Regexp. Should be easy I think..

Reply via email to