Hi Januk,

On Tue, 20 Apr 2004, at 15:16:34 [GMT -0700] (which was 4:16 PM where
I live) you wrote:
JA> I'll try. What really helps is if you can determine what parts of
JA> your string are constant and which parts change. For example, the
JA> Date part is probably always going to be:
JA> Date: DAY, DD MMM YYYY HH:mm:ss GMT

I'm pretty sure the HTTP request type will always be first so I can
anchor with ^, but IIS likes to put the SERVER before the DATE, and
Sambar is the reverse. So, my thinking was that if I did the HTTP part
and then moved up to the first : and backed up to the first
whitespace, I could grab the next chunk (either DATE or SERVER) up to
the next : (and then back to the first whitespace), and continue that
until I hit the EOL.

I just wasn't sure what I should do to get it started.


JA> (?i)(Date:\s*.*?)((\s\S+:\s)|\z)
JA> But increasing accuracy has the price of decreasing tolerance for
JA> errors in the string.

Exactly. I didn't check an Apache server (I'll do that tomorrow) to
see how it outputs its HTTP headers. I am looking for something
generic, hence my hoping I could use the : as jump points to back up
from.


JA> Note that I'm using TB specific atoms. You may have to modify the
JA> syntax of these to work in PHP, I don't know. "\s" means any white
JA> space character. "\S" means any non-whitespace character. "\z" is
JA> end of subject (independent of multiline settings). The "(?i)"
JA> just sets the regexp to be case-insensitive. I don't know if php
JA> requires a different method for internal option setting. The "+"
JA> means, match one or more characters of the preceding type.

Right. That shouldn't be a problem. I have a list of the atoms for PHP
and they are close to TB.

JA> So in this case, your basic regexp would be:
JA> (?i)(Date:\s*.*?)((\s\S*:\s*)|\z)

JA> And you could just change the term "Date" to "Server",
JA> "Last-modified", or "Connection" as necessary.  The desired
JA> information should be in subpattern 1

I had considered that (just doing multiple reg matches), but wondered
if there was a better way. It is a very small script, so it wouldn't
really kill the performance by doing multiple reg matches.


JA> Now the HTTP one is a bit trickier. If you know that the HTTP
JA> section is always first, and the next field is always the date
JA> field, then your easiest bet is:

So far, this one has always been first. It'll get ugly if it pops up
somewhere else on some strange webserver.


JA> That seems fairly reasonable. What would also work is if you
JA> definitely know the order that the tokens will be listed in. Then
JA> you could search for everything between two labels.

The order does change with exception to HTTP that I've discovered so
far anyways.

JA> No need to do that if you do a few simple searches instead of one
JA> complex one.

This might just be the best way to do it.

JA> But they would have a better shot at correct syntax... ;-)

Yeah, but then I'd have to read at least ten posts telling me to
Google it. Like I really hadn't thought of that! <grin>


Thank you very much for the help.



-- 
Cheers,
Leif Gregory 

List Moderator (and fellow registered end-user)
PCWize Editor  /  ICQ 216395  /  PGP Key ID 0x7CD4926F
Web Site <http://www.PCWize.com>
TB FAQ   <http://www.silverstones.com/thebat/FAQ.html>
Using The Bat! 2.05 Beta/16 under Windows 2000 5.0 Build 2195 Service Pack 4 
on a P4 1.6Ghz OC'd to 2.32Ghz with 512MB.

Tagline of the day:
A bad day: "Transfer completed (5720468 bytes, 56651 errors, 1 CPS)"




________________________________________________________

http://www.silverstones.com/thebat/TBUDLInfo.html

Reply via email to