Hi Leif, On Tuesday, April 20, 2004 at 14:47 GMT -0600, an infinite number of monkeys posting as Leif Gregory typed:
> At any rate, can anyone come up with a regexp to break this down: I'll try. What really helps is if you can determine what parts of your string are constant and which parts change. For example, the Date part is probably always going to be: Date: DAY, DD MMM YYYY HH:mm:ss GMT So now you can make a regexp that pulls out the info you need. Since you really want all of the date info in the format provided, then your best bet is probably something fairly generic, like: (?i)(Date:\s*.*?)((\s\S+:\s)|\z) That uses your idea of looking for the colon after next title and backing up to the preceding whitespace (see below). You could use more of the constant text to really anchor the match much better. That would improve the accuracy of the matches if your matches are giving too much or too little text. But increasing accuracy has the price of decreasing tolerance for errors in the string. Note that I'm using TB specific atoms. You may have to modify the syntax of these to work in PHP, I don't know. "\s" means any white space character. "\S" means any non-whitespace character. "\z" is end of subject (independent of multiline settings). The "(?i)" just sets the regexp to be case-insensitive. I don't know if php requires a different method for internal option setting. The "+" means, match one or more characters of the preceding type. <snip> > HTTP/1.1 200 OK > Date: Tue, 20 Apr 2004 17:28:23 GMT > Server: SAMBAR > Last-modified: Thu, 01 Jan 2004 19:56:39 GMT > Connection: close Content-type: text/html I don't know a bunch about PHP or ASP, but if you can feed the original string to a bunch of relatively easy regexps, that would probably be a lot simpler than trying to come up with a long one that does all the fields. It would also be easier to move things around as necessary. So in this case, your basic regexp would be: (?i)(Date:\s*.*?)((\s\S*:\s*)|\z) And you could just change the term "Date" to "Server", "Last-modified", or "Connection" as necessary. The desired information should be in subpattern 1 Now the HTTP one is a bit trickier. If you know that the HTTP section is always first, and the next field is always the date field, then your easiest bet is: (?i)^(.*?)\s+Date:\s Again, the match is in subpattern 1 > I don't know if I should home in on the colon and then back up to > the first whitespace, or what. That seems fairly reasonable. What would also work is if you definitely know the order that the tokens will be listed in. Then you could search for everything between two labels. > My plan is to use if...then...elseif to output them regardless of > what order they get pulled out of the initial string (i.e. IIS > switches the order of the items in the original string). No need to do that if you do a few simple searches instead of one complex one. > And yes, I know this is a TB list. The PHP list is less than > friendly. But they would have a better shot at correct syntax... ;-) -- Thanks for writing, Januk Aggarwal ________________________________________________________ http://www.silverstones.com/thebat/TBUDLInfo.html
