Hi Leif,

On Tuesday, April 20, 2004 at 14:47 GMT -0600, an infinite number of
monkeys posting as Leif Gregory typed:

>   At any rate, can anyone come up with a regexp to break this down:

I'll try.  What really helps is if you can determine what parts of
your string are constant and which parts change.  For example, the
Date part is probably always going to be:

Date: DAY, DD MMM YYYY HH:mm:ss GMT

So now you can make a regexp that pulls out the info you need.  Since
you really want all of the date info in the format provided, then your
best bet is probably something fairly generic, like:

(?i)(Date:\s*.*?)((\s\S+:\s)|\z)

That uses your idea of looking for the colon after next title and
backing up to the preceding whitespace (see below).  You could use
more of the constant text to really anchor the match much better.
That would improve the accuracy of the matches if your matches are
giving too much or too little text.  But increasing accuracy has the
price of decreasing tolerance for errors in the string.


Note that I'm using TB specific atoms.  You may have to modify the
syntax of these to work in PHP, I don't know.  "\s" means any white
space character.  "\S" means any non-whitespace character.  "\z" is
end of subject (independent of multiline settings).  The "(?i)" just
sets the regexp to be case-insensitive.  I don't know if php requires
a different method for internal option setting.  The "+" means, match
one or more characters of the preceding type.

<snip>
>   HTTP/1.1 200 OK
>   Date: Tue, 20 Apr 2004 17:28:23 GMT
>   Server: SAMBAR
>   Last-modified: Thu, 01 Jan 2004 19:56:39 GMT
>   Connection: close Content-type: text/html

I don't know a bunch about PHP or ASP, but if you can feed the
original string to a bunch of relatively easy regexps, that would
probably be a lot simpler than trying to come up with a long one that
does all the fields.  It would also be easier to move things around as
necessary. 

So in this case, your basic regexp would be:
(?i)(Date:\s*.*?)((\s\S*:\s*)|\z)

And you could just change the term "Date" to "Server",
"Last-modified", or "Connection" as necessary.  The desired
information should be in subpattern 1

Now the HTTP one is a bit trickier.  If you know that the HTTP section
is always first, and the next field is always the date field, then
your easiest bet is: 

(?i)^(.*?)\s+Date:\s

Again, the match is in subpattern 1

>   I don't know if I should home in on the colon and then back up to
>   the first whitespace, or what.

That seems fairly reasonable.  What would also work is if you
definitely know the order that the tokens will be listed in.  Then you
could search for everything between two labels.

>   My plan is to use if...then...elseif to output them regardless of
>   what order they get pulled out of the initial string (i.e. IIS
>   switches the order of the items in the original string).

No need to do that if you do a few simple searches instead of one
complex one.

>   And yes, I know this is a TB list. The PHP list is less than
>   friendly.

But they would have a better shot at correct syntax... ;-)

-- 
Thanks for writing,
 Januk Aggarwal


________________________________________________________

http://www.silverstones.com/thebat/TBUDLInfo.html

Reply via email to