Re: [PHP] Parsing HTML href-Attribute
On 16/1/09 23:41, Shawn McKenzie wrote: Again, I say that it won't work on URLs with spaces, like my web page.html. When I get a minute I'll fix it. I thought spaces in URLs weren't valid markup, but it seems to validate. Some small points of information: An HTML4 validator will only check that a HREF value is CDATA, as required by the DTD: http://www.w3.org/TR/REC-html40/struct/links.html#adef-href http://www.w3.org/TR/REC-html40/sgml/dtd.html#URI http://www.w3.org/TR/REC-html40/types.html#type-cdata Plenty of things can be CDATA without being a valid URI: http://gbiv.com/protocols/uri/rfc/rfc3986.html Space characters (U+0020) that are not percent encoded are not valid in a URI: http://gbiv.com/protocols/uri/rfc/rfc3986.html#collected-abnf That's not to say that browsers haven't developed error handling for space characters (and other illegal characters) in HREF values. The HTML5 draft proposes an algorithm for parsing and resolving HREF values that includes such error handling: http://www.whatwg.org/specs/web-apps/current-work/#parsing-urls http://www.whatwg.org/specs/web-apps/current-work/#resolving-urls -- Benjamin Hawkes-Lewis -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
Depending on the goal, using the base tag in the head section might help: http://www.w3.org/TR/REC-html40/struct/links.html#h-12.4 Thank you, Micah Gersten onShore Networks Internal Developer http://www.onshore.com Edmund Hertle wrote: Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Thanks -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Parsing HTML href-Attribute
-Original Message- From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To: PHP - General Subject: [PHP] Parsing HTML href-Attribute Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Just spitballing here, but this is probably how I would start: RegEx pattern: /a.*? href=(.+?)/ig Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace. So... ?php # here is where $anchorText (text for the a tag) would be assigned # here is where $curDir (text for the current directory) would be assigned # find the href attribute $matches = Array(); preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches); # determine if it has surrounding quotes if($matches[1][0] == '\'' || $matches[1][0] == '') { # pull everything but the first and last character $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); } else { # pull up to the first space (if there is one) $spacePos = strpos($anchorText, ' '); if($spacePos !== false) $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) } # now, check to see if it is relative or absolute # (regex pattern searches for protocol spec (i.e., http://), which will be # treated as an absolute path for the purpose of this algorithm) if($anchorText[0] != '/' preg_match('#^\w+://#', $anchorText) == 0) { # add current directory to the beginning of the relative path # (nothing is done to absolute paths or URLs with protocol spec) $anchorText = $curDir . '/' . $anchorText; } echo $anchorText; ? ...UNTESTED. HTH, // Todd
Re: [PHP] Parsing HTML href-Attribute
Boyd, Todd M. wrote: -Original Message- From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To: PHP - General Subject: [PHP] Parsing HTML href-Attribute Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Just spitballing here, but this is probably how I would start: RegEx pattern: /a.*? href=(.+?)/ig Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace. So... ?php # here is where $anchorText (text for the a tag) would be assigned # here is where $curDir (text for the current directory) would be assigned # find the href attribute $matches = Array(); preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches); # determine if it has surrounding quotes if($matches[1][0] == '\'' || $matches[1][0] == '') { # pull everything but the first and last character $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); } else { # pull up to the first space (if there is one) $spacePos = strpos($anchorText, ' '); if($spacePos !== false) $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) } # now, check to see if it is relative or absolute # (regex pattern searches for protocol spec (i.e., http://), which will be # treated as an absolute path for the purpose of this algorithm) if($anchorText[0] != '/' preg_match('#^\w+://#', $anchorText) == 0) { # add current directory to the beginning of the relative path # (nothing is done to absolute paths or URLs with protocol spec) $anchorText = $curDir . '/' . $anchorText; } echo $anchorText; ? ...UNTESTED. HTH, // Todd Wow, that's alot! This should work with or without quotes and assumes no spaces in the URL: $prefix = http://example.com/;; $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
On Thu, Jan 15, 2009 at 5:13 PM, Edmund Hertle edmund.her...@student.kit.edu wrote: Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Thanks You could also use DOM for this. http://us2.php.net/manual/en/domdocument.getelementsbytagname.php -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
On Fri, Jan 16, 2009 at 10:54 AM, Eric Butera eric.but...@gmail.com wrote: You could also use DOM for this. http://us2.php.net/manual/en/domdocument.getelementsbytagname.php only if it's parseable xml :) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
On Fri, Jan 16, 2009 at 10:58 AM, mike mike...@gmail.com wrote: only if it's parseable xml :) Or not! Ignore me. Supposedly this can handle HTML too. I'll have to try it next time. Normally I wind up having to use tidy to scrub a document and try to get it into xhtml and then use simplexml. I wonder how well this would work with [crappy] HTML input. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
On Fri, Jan 16, 2009 at 1:59 PM, mike mike...@gmail.com wrote: On Fri, Jan 16, 2009 at 10:58 AM, mike mike...@gmail.com wrote: only if it's parseable xml :) Or not! Ignore me. Supposedly this can handle HTML too. I'll have to try it next time. Normally I wind up having to use tidy to scrub a document and try to get it into xhtml and then use simplexml. I wonder how well this would work with [crappy] HTML input. Great if you use @. ;) I'd try to make sure all of my input was stored as proper x/html in the db before I really tried parsing it, so I'm not sure of his setup, but I use getElementsByTagName all the time and love it. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
Shawn McKenzie wrote: Boyd, Todd M. wrote: -Original Message- From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To: PHP - General Subject: [PHP] Parsing HTML href-Attribute Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Just spitballing here, but this is probably how I would start: RegEx pattern: /a.*? href=(.+?)/ig Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace. So... ?php # here is where $anchorText (text for the a tag) would be assigned # here is where $curDir (text for the current directory) would be assigned # find the href attribute $matches = Array(); preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches); # determine if it has surrounding quotes if($matches[1][0] == '\'' || $matches[1][0] == '') { # pull everything but the first and last character $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); } else { # pull up to the first space (if there is one) $spacePos = strpos($anchorText, ' '); if($spacePos !== false) $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) } # now, check to see if it is relative or absolute # (regex pattern searches for protocol spec (i.e., http://), which will be # treated as an absolute path for the purpose of this algorithm) if($anchorText[0] != '/' preg_match('#^\w+://#', $anchorText) == 0) { # add current directory to the beginning of the relative path # (nothing is done to absolute paths or URLs with protocol spec) $anchorText = $curDir . '/' . $anchorText; } echo $anchorText; ? ...UNTESTED. HTH, // Todd Wow, that's alot! This should work with or without quotes and assumes no spaces in the URL: $prefix = http://example.com/;; $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); Might need to keep a preceding slash out of there: $html = preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Parsing HTML href-Attribute
-Original Message- From: Shawn McKenzie [mailto:nos...@mckenzies.net] Sent: Friday, January 16, 2009 1:08 PM To: php-general@lists.php.net Subject: Re: [PHP] Parsing HTML href-Attribute Shawn McKenzie wrote: Boyd, Todd M. wrote: -Original Message- From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To: PHP - General Subject: [PHP] Parsing HTML href-Attribute Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Just spitballing here, but this is probably how I would start: RegEx pattern: /a.*? href=(.+?)/ig Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace. So... ?php # here is where $anchorText (text for the a tag) would be assigned # here is where $curDir (text for the current directory) would be assigned # find the href attribute $matches = Array(); preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches); # determine if it has surrounding quotes if($matches[1][0] == '\'' || $matches[1][0] == '') { # pull everything but the first and last character $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); } else { # pull up to the first space (if there is one) $spacePos = strpos($anchorText, ' '); if($spacePos !== false) $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) } # now, check to see if it is relative or absolute # (regex pattern searches for protocol spec (i.e., http://), which will be # treated as an absolute path for the purpose of this algorithm) if($anchorText[0] != '/' preg_match('#^\w+://#', $anchorText) == 0) { # add current directory to the beginning of the relative path # (nothing is done to absolute paths or URLs with protocol spec) $anchorText = $curDir . '/' . $anchorText; } echo $anchorText; ? ...UNTESTED. HTH, // Todd Wow, that's alot! This should work with or without quotes and assumes no spaces in the URL: $prefix = http://example.com/;; $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); Might need to keep a preceding slash out of there: $html = preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); I believe the OP wanted to leave already-absolute paths alone (i.e., only convert relative paths). The regex does not take into account fully-qualified URLs (i.e., http://www.google.com/search?q=php) and it does not determine if a given path is relative or absolute. He was wanting to take the href attribute of an anchor tag and, **IF** it was a relative path, turn it into an absolute path (meaning to append the relative path to the absolute path of the current script). That was my understanding. Perhaps you saw it differently, but I don't believe your pattern is enough to accomplish what the OP was asking for--hence a lot of code was in my reply. ;) Believe me, I'm the first guy to hop on the do it with a regex! bandwagon... but there are just some circumstances where regex can't do what you need to do (such as more-than-superficial contextual logic). HTH, // Todd
Re: [PHP] Parsing HTML href-Attribute
Boyd, Todd M. wrote: -Original Message- From: Shawn McKenzie [mailto:nos...@mckenzies.net] Sent: Friday, January 16, 2009 1:08 PM To: php-general@lists.php.net Subject: Re: [PHP] Parsing HTML href-Attribute Shawn McKenzie wrote: Boyd, Todd M. wrote: -Original Message- From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To: PHP - General Subject: [PHP] Parsing HTML href-Attribute Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Just spitballing here, but this is probably how I would start: RegEx pattern: /a.*? href=(.+?)/ig Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace. So... ?php # here is where $anchorText (text for the a tag) would be assigned # here is where $curDir (text for the current directory) would be assigned # find the href attribute $matches = Array(); preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches); # determine if it has surrounding quotes if($matches[1][0] == '\'' || $matches[1][0] == '') { # pull everything but the first and last character $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); } else { # pull up to the first space (if there is one) $spacePos = strpos($anchorText, ' '); if($spacePos !== false) $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) } # now, check to see if it is relative or absolute # (regex pattern searches for protocol spec (i.e., http://), which will be # treated as an absolute path for the purpose of this algorithm) if($anchorText[0] != '/' preg_match('#^\w+://#', $anchorText) == 0) { # add current directory to the beginning of the relative path # (nothing is done to absolute paths or URLs with protocol spec) $anchorText = $curDir . '/' . $anchorText; } echo $anchorText; ? ...UNTESTED. HTH, // Todd Wow, that's alot! This should work with or without quotes and assumes no spaces in the URL: $prefix = http://example.com/;; $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); Might need to keep a preceding slash out of there: $html = preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); I believe the OP wanted to leave already-absolute paths alone (i.e., only convert relative paths). The regex does not take into account fully-qualified URLs (i.e., http://www.google.com/search?q=php) and it does not determine if a given path is relative or absolute. He was wanting to take the href attribute of an anchor tag and, **IF** it was a relative path, turn it into an absolute path (meaning to append the relative path to the absolute path of the current script). That's exactly what this regex does :-) The (?!$prefix) negative lookahead assertion fails the match if it's already an absolute URL. That was my understanding. Perhaps you saw it differently, but I don't believe your pattern is enough to accomplish what the OP was asking for--hence a lot of code was in my reply. ;) Believe me, I'm the first guy to hop on the do it with a regex! bandwagon... but there are just some circumstances where regex can't do what you need to do (such as more-than-superficial contextual logic). HTH, // Todd -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
I believe the OP wanted to leave already-absolute paths alone (i.e., only convert relative paths). The regex does not take into account fully-qualified URLs (i.e., http://www.google.com/search?q=php) and it does not determine if a given path is relative or absolute. He was wanting to take the href attribute of an anchor tag and, **IF** it was a relative path, turn it into an absolute path (meaning to append the relative path to the absolute path of the current script). That's exactly what this regex does :-) The (?!$prefix) negative lookahead assertion fails the match if it's already an absolute URL. That was my understanding. Perhaps you saw it differently, but I don't believe your pattern is enough to accomplish what the OP was asking for--hence a lot of code was in my reply. ;) Believe me, I'm the first guy to hop on the do it with a regex! bandwagon... but there are just some circumstances where regex can't do what you need to do (such as more-than-superficial contextual logic). HTH, // Todd Ahh, but you uncovered a problem for me if the href contains an absolute URL that doesn't contain the prefix. Here's the fix: $html = preg_replace(|(href=['\]?)(?!http(?:s)?://)[/]?([^'\\s]+)(\s)?|, $1http://www.example.com/2$3;, $html); -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Parsing HTML href-Attribute
-Original Message- From: Shawn McKenzie [mailto:nos...@mckenzies.net] Sent: Friday, January 16, 2009 2:37 PM To: php-general@lists.php.net Subject: Re: [PHP] Parsing HTML href-Attribute Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Just spitballing here, but this is probably how I would start: RegEx pattern: /a.*? href=(.+?)/ig Then, using the capture group, determine if the href attribute uses quotes (single or double, doesn't matter). If it does, you don't need to worry about splitting the capture group at the first white space. If it doesn't, then you must assume the first whitespace is the end of the URL and the beginning of additional attributes, and just grab the URL up to (but not including) the first whitespace. So... ?php # here is where $anchorText (text for the a tag) would be assigned # here is where $curDir (text for the current directory) would be assigned # find the href attribute $matches = Array(); preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches); # determine if it has surrounding quotes if($matches[1][0] == '\'' || $matches[1][0] == '') { # pull everything but the first and last character $anchorText = substr($anchorText, 1, strlen($anchorText) - 3); } else { # pull up to the first space (if there is one) $spacePos = strpos($anchorText, ' '); if($spacePos !== false) $anchorText = substr($anchorText, 0, strpos($anchorText, ' ')) } # now, check to see if it is relative or absolute # (regex pattern searches for protocol spec (i.e., http://), which will be # treated as an absolute path for the purpose of this algorithm) if($anchorText[0] != '/' preg_match('#^\w+://#', $anchorText) == 0) { # add current directory to the beginning of the relative path # (nothing is done to absolute paths or URLs with protocol spec) $anchorText = $curDir . '/' . $anchorText; } echo $anchorText; ? Wow, that's alot! This should work with or without quotes and assumes no spaces in the URL: $prefix = http://example.com/;; $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); Might need to keep a preceding slash out of there: $html = preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|, $1$prefix$2$3, $html); I believe the OP wanted to leave already-absolute paths alone (i.e., only convert relative paths). The regex does not take into account fully-qualified URLs (i.e., http://www.google.com/search?q=php) and it does not determine if a given path is relative or absolute. He was wanting to take the href attribute of an anchor tag and, **IF** it was a relative path, turn it into an absolute path (meaning to append the relative path to the absolute path of the current script). That's exactly what this regex does :-) The (?!$prefix) negative lookahead assertion fails the match if it's already an absolute URL. I see that now. I didn't notice the negative look-ahead the first go 'round. However, I still have qualms with it. :) You are only checking for http://, and only for the local server. What I meant by absolute path was, for example, /index.php (the index in the root directory of the server) as opposed to somefolder/index.php (the index in a subfolder of the current directory named 'somefolder'). * http://www.google.com/search?q=php ... absolute path (yes, it's a URL, but treat it as absolute) * https://www.example.com/index.php ... absolute path (yes, it's a URL, but to the local server) * /index.php ... absolute path (no protocol given, true absolute path) * index.php ... relative path (relative to current directory on current server) * somefolder/index.php ... relative path (same reason) That is indeed a nifty use of look-ahead, though. That will work for any anchor tag that doesn't reference the server (or any other server) with a protocol spec preceding it. However, if you want to run it through an entire list of anchor tags with any spec (http://, https://, udp://, ftp://, aim://, rss://, etc.)--or lack of spec--and only mess with those that don't have a spec and don't use absolute paths, it needs to get a bit more complex. You've convinced me, however, that it can be done entirely with one regex pattern. Ooh--one more
Re: [PHP] Parsing HTML href-Attribute
* http://www.google.com/search?q=php ... absolute path (yes, it's a URL, but treat it as absolute) * https://www.example.com/index.php ... absolute path (yes, it's a URL, but to the local server) * /index.php ... absolute path (no protocol given, true absolute path) * index.php ... relative path (relative to current directory on current server) * somefolder/index.php ... relative path (same reason) That is indeed a nifty use of look-ahead, though. That will work for any anchor tag that doesn't reference the server (or any other server) with a protocol spec preceding it. However, if you want to run it through an entire list of anchor tags with any spec (http://, https://, udp://, ftp://, aim://, rss://, etc.)--or lack of spec--and only mess with those that don't have a spec and don't use absolute paths, it needs to get a bit more complex. You've convinced me, however, that it can be done entirely with one regex pattern. // Todd Hey! Wow, I think that was exactly what I was looking for... thank all of you... although I've not tested it, will do that tomorrow, but sounds very nice But Todd just confused me quite a bit with the statement: Is /index.php a case where the RegEx will fail? To add some background: It is about dynamiclly creating pdf files out of html source code and then the links should also work in the pdf file. So other protocolls then http:// shouldn't be a problem -eddy
Re: [PHP] Parsing HTML href-Attribute
This one time, at band camp, mike mike...@gmail.com wrote: On Fri, Jan 16, 2009 at 10:58 AM, mike mike...@gmail.com wrote: only if it's parseable xml :) Or not! Ignore me. Supposedly this can handle HTML too. I'll have to try it next time. Normally I wind up having to use tidy to scrub a document and try to get it into xhtml and then use simplexml. I wonder how well this would work with [crappy] HTML input. $dom-loadHTML($html) Kevin http://phpro.org -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
This one time, at band camp, Eric Butera eric.but...@gmail.com wrote: You could also use DOM for this. http://us2.php.net/manual/en/domdocument.getelementsbytagname.php http://www.phpro.org/examples/Get-Links-With-DOM.html Kevin -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
Edmund Hertle wrote: * http://www.google.com/search?q=php ... absolute path (yes, it's a URL, but treat it as absolute) * https://www.example.com/index.php ... absolute path (yes, it's a URL, but to the local server) * /index.php ... absolute path (no protocol given, true absolute path) * index.php ... relative path (relative to current directory on current server) * somefolder/index.php ... relative path (same reason) That is indeed a nifty use of look-ahead, though. That will work for any anchor tag that doesn't reference the server (or any other server) with a protocol spec preceding it. However, if you want to run it through an entire list of anchor tags with any spec (http://, https://, udp://, ftp://, aim://, rss://, etc.)--or lack of spec--and only mess with those that don't have a spec and don't use absolute paths, it needs to get a bit more complex. You've convinced me, however, that it can be done entirely with one regex pattern. // Todd Hey! Wow, I think that was exactly what I was looking for... thank all of you... although I've not tested it, will do that tomorrow, but sounds very nice But Todd just confused me quite a bit with the statement: Is /index.php a case where the RegEx will fail? To add some background: It is about dynamiclly creating pdf files out of html source code and then the links should also work in the pdf file. So other protocolls then http:// shouldn't be a problem -eddy That regex should work on all hrefs. index.php and /index.php will be replaced with http://www.example.com/index.php and somedir/index.php and /somedir/index.php will be replaced with http://www.example.com/somedir/index.php. Any URL starting with http:// or https:// will be ignored. Again, I say that it won't work on URLs with spaces, like my web page.html. When I get a minute I'll fix it. I thought spaces in URLs weren't valid markup, but it seems to validate. -- Thanks! -Shawn http://www.spidean.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML href-Attribute
On Fri, Jan 16, 2009 at 6:18 PM, Kevin Waterson ke...@phpro.org wrote: This one time, at band camp, Eric Butera eric.but...@gmail.com wrote: You could also use DOM for this. http://us2.php.net/manual/en/domdocument.getelementsbytagname.php http://www.phpro.org/examples/Get-Links-With-DOM.html Kevin -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php Nice ;) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Parsing HTML href-Attribute
Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Thanks
Re: [PHP] Parsing HTML href-Attribute
Hi Edmund, You want a regex that looks something like this: $result = preg_replace('%(href=)(|\')(?!c:/)(.+?)(|\')%', '\1\2c:/my_absolute_path\3\4', $subject); This example assumes that your absolute path begins with c:/. You would change this to whatever suits. You would also change c:/my_absolute_path to be whatever appropriate value indicates the absolute path element that you want to prepend. Note: this will NOT accound for hrefs that are not encapsulated in either ' or . The problem being that while you can probably predictably how the substring starts, it would be more difficult to determine how it ends, unless you can provide a white list of file extensions for the regex (ie, if you know you only ever link to, for example, files with .php and or .html extensions). In that case, you probably could alter the regex to test for these instead of a ' or . M is for Murray On Fri, Jan 16, 2009 at 8:13 AM, Edmund Hertle edmund.her...@student.kit.edu wrote: Hey, I want to parse a href-attribute in a given String to check if there is a relative link and then adding an absolute path. Example: $string = 'a class=sample [...additional attributes...] href=/foo/bar.php '; I tried using regular expressions but my knowledge of RegEx is very limited. Things to consider: - $string could be quite long but my concern are only those href attributes (so working with explode() would be not very handy) - Should also work if href= is not using quotes or using single quotes - link could already be an absolute path, so just searching for href= and then inserting absolute path could mess up the link Any ideas? Or can someone create a RegEx to use? Thanks
[PHP] Parsing HTML
I need to extract news items from several news sites. In order to do that, I need to parse the HTML data. I know how to use Regular Expressions, but I wonder if there are other ways to do that. Can anybody please give me some pointers? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP] Parsing HTML
[snip] I need to extract news items from several news sites. In order to do that, I need to parse the HTML data. I know how to use Regular Expressions, but I wonder if there are other ways to do that. Can anybody please give me some pointers? [/snip] Can you be more specific here? This is awfully broad. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML
Boby wrote: I need to extract news items from several news sites. In order to do that, I need to parse the HTML data. I know how to use Regular Expressions, but I wonder if there are other ways to do that. Can anybody please give me some pointers? i could suggest you to use html parsing libraries available on the net try. http://www.sourceforge.net and http://www.phpclasses.org -- Sumeet Shroff http://www.prateeksha.com Web Design and Ecommerce Development, Mumbai India -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Parsing HTML files
Hi all, I was wondering if any classes/functions could help me with this little challenge, (im hopeless at regex ;-) input type=hidden name=id value=593 / I want to extract the value of 'id' from a webpage. Any simple way to do this or am I down to sweating of the regex functions? Much thanks -- Nick W -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML files
No easy way of doing it, regex somthing like: $id = preg_replace(/.*input.*name=\id\ value=\[0-9]+\ \//, $1, $string); where $string is a line from your input'd HTML page Abdul-Wahid On Fri, 10 Sep 2004 12:54:37 +0200, Nick Wilson [EMAIL PROTECTED] wrote: Hi all, I was wondering if any classes/functions could help me with this little challenge, (im hopeless at regex ;-) input type=hidden name=id value=593 / I want to extract the value of 'id' from a webpage. Any simple way to do this or am I down to sweating of the regex functions? Much thanks -- Nick W -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML files
* and then Abdul-Wahid Paterson declared No easy way of doing it, regex somthing like: $id = preg_replace(/.*input.*name=\id\ value=\[0-9]+\ \//, $1, $string); where $string is a line from your input'd HTML page OK, thanks abdul, much appreciated.. -- Nick W -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing HTML files
On Fri, 10 Sep 2004 11:58:58 +0100, in php.general [EMAIL PROTECTED] (Abdul-Wahid Paterson) wrote: I was wondering if any classes/functions could help me with this little challenge, (im hopeless at regex ;-) input type=hidden name=id value=593 / No easy way of doing it, regex somthing like: $id = preg_replace(/.*input.*name=\id\ value=\[0-9]+\ \//, $1, $string); How about just using an xml-based function? Much cleaner, doesn't require name-attribute to be present before value-attribute. ?php $string = 'input type=hidden name=id value=593 /'; $x = xml_parser_create(); xml_parse_into_struct($x,$string,$array); print $array[0]['attributes']['VALUE']; // or, out of curiousity: var_dump($array); ? (and why preg_replace? $1 wouldn't even be set since no capturing parenthesises are used) -- - Peter Brodersen -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Parsing html to extract images
Hi, anyone can help me parsing html files in order to get all the images containing a file? Thanks, Simon. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Parsing html to extract images
Hidrahyl wrote: Hi, anyone can help me parsing html files in order to get all the images containing a file? Thanks, Simon. 1. Use fopen() to grab the HTML file you're after. 2. Read in each line to an array using file(); 3. Loop through the array, and apply the following reg. exp.: preg_match(/\img.*src=[\\'](.*)[\\'].*\/U, $line, $matches); NOTE: this might need a bit of tweeking, since I'm not too hot on regular expressions... :) Regards, David -- David Grant Web Developer [EMAIL PROTECTED] http://www.wiredmedia.co.uk Tel: 0117 930 4365, Fax: 0870 169 7625 Wired Media Ltd Registered Office: 43 Royal Park, Bristol, BS8 3AN Studio: Whittakers House, 32 - 34 Hotwell Road, Bristol, BS8 4UD Company registration number: 4016744 ** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ** -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Parsing HTML
Hi All, I would like to be able to do a serach and replace in a HTML document. For a list of words; for example: hello become buongiorno yes becomes si size become grossezza The problem is that if I change the word size without considering html tags and html comments in the case of inline javascripts I'll end up with broken html. Is there a way to only do the search and replace outside the tags and comments. It is further complicated by the fact that I would still like to do the replacements within strings for example within meta tags! Any ideas. Henry -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] parsing HTML text
I have written form screen which has as one of it's elements a textarea box in which a user can input some text --like a simple bio-- which will appear on another screen. I'd like to edit check this text. It would be a good idea to make sure that it has, among other things, no form elements, say, or to make sure that if a font tag occurs, that a matching /font tag is present. Is anyone aware of a class or a package which I can use to parse this text and do this kind of validation? tia -lee -- When the birdcage is open, | donate to causes I care about: the selfish bird flies away, |http://svcs.affero.net/rm.php?r=leed_25 but the virtuous one stays. | -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Parsing HTML
Hi, I have some HTML including several pseudo tags like tag parameter1=doo parameter2=something parameter3=what should i say After parsing I have an array $content[$i][name] where name is the name of the parameters and $i the Counter of the tag. I'm using a regular expression to find these tags and explode( , $my_tag_line) to geht the parameters out. How can I achive that parameter values can contains whitespaces? I suppose I'll need another regex for split(). Any help is appreciated. Uli -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
[PHP] Parsing html table into MySQL
I need to get from another page table and put it into MySQL table dynamically for example http://66.96.230.191/table.html so I need to parse this table in database. If you have any code how to implement such operation by using php MySQL please help me; thanks in advance _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
Re: [PHP] Parsing html table into MySQL
wait a minute... do you want to parse the HTML to get the values to populate a mysql table, or do you have this table in another DB and just want it copied to your mysql one?? If it is the former, you'll some very hardcore regex work to be done... I once did this... it is very stressing work... - you need to analyse the HTML document and find patterns that indicate 'begin of row' 'begin and end of column' and 'end of row', 'end of table' - these patterns must be unique or you'll find yourself looking for it indefinetly and going into an endless loop - do a giant loop that only ends on 'end of table' and grab the values within this patterns... the code to get this done is huge (not complex), and (I expect) will be only used once, right? At 19:25 16/9/2001 +0500, i_union wrote: I need to get from another page table and put it into MySQL table dynamically for example http://66.96.230.191/table.html so I need to parse this table in database. If you have any code how to implement such operation by using php MySQL please help me; thanks in advance _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] p.s: meu novo email é [EMAIL PROTECTED] . Christian Dechery (lemming) . http://www.tanamesa.com.br . Gaita-L Owner / Web Developer -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
Re: [PHP] Parsing html table into MySQL
No I need to copy the rows values from HTML table you can see it in exlamle http://66.96.230.191/table.html This is a live score system which updates every 2 min, So I need to get these values and parse it in MySQL after that I neeed to get some element from my database and show in my page.. I have problems in regex I dont know good coding and need only smmall support Please help me :) - Original Message - From: Christian Dechery [EMAIL PROTECTED] To: i_union [EMAIL PROTECTED]; Chris Lambert [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Monday, September 10, 2001 7:57 PM Subject: Re: [PHP] Parsing html table into MySQL wait a minute... do you want to parse the HTML to get the values to populate a mysql table, or do you have this table in another DB and just want it copied to your mysql one?? If it is the former, you'll some very hardcore regex work to be done... I once did this... it is very stressing work... - you need to analyse the HTML document and find patterns that indicate 'begin of row' 'begin and end of column' and 'end of row', 'end of table' - these patterns must be unique or you'll find yourself looking for it indefinetly and going into an endless loop - do a giant loop that only ends on 'end of table' and grab the values within this patterns... the code to get this done is huge (not complex), and (I expect) will be only used once, right? At 19:25 16/9/2001 +0500, i_union wrote: I need to get from another page table and put it into MySQL table dynamically for example http://66.96.230.191/table.html so I need to parse this table in database. If you have any code how to implement such operation by using php MySQL please help me; thanks in advance _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] p.s: meu novo email é [EMAIL PROTECTED] . Christian Dechery (lemming) . http://www.tanamesa.com.br . Gaita-L Owner / Web Developer -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
Re: Fwd: Re: [PHP] Parsing html table into MySQL
ok... so the hard way. I'm no regex wizard myself... as a matter of fact I suck at it. I do it the hard (old C like) way. You have to find some chunks of HTML that determine and end of the data in the table and use them to walk trough the doc fetching what you want... let me give an example... html table trtdfont face=arialTitle/font/tdtdfont face=arialPrice/font/td/tr trtdfont face=arialXXX/font/tdtdfont face=arial10.00/font/td/tr trtdfont face=arialYYY/font/tdtdfont face=arial25.2/font/td/tr /table /html I know that with regex this would be a lot easir but you can do this: $fp=fopen(htmldoc,r); while(!feof($fp)) { // lets find the first row of DATA (the first were only titles) while(!strstr(fgets($fp,256),/tr); //now we are the first line while(!strstr(fgets($fp,256),/table) { // see where I'm getting at? } } No I need to copy the rows values from HTML table you can see it in exlamle http://66.96.230.191/table.html This is a live score system which updates every 2 min, So I need to get these values and parse it in MySQL after that I neeed to get some element from my database and show in my page.. I have problems in regex I dont know good coding and need only smmall support Please help me :) - Original Message - From: Christian Dechery [EMAIL PROTECTED] To: i_union [EMAIL PROTECTED]; Chris Lambert [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Monday, September 10, 2001 7:57 PM Subject: Re: [PHP] Parsing html table into MySQL wait a minute... do you want to parse the HTML to get the values to populate a mysql table, or do you have this table in another DB and just want it copied to your mysql one?? If it is the former, you'll some very hardcore regex work to be done... I once did this... it is very stressing work... - you need to analyse the HTML document and find patterns that indicate 'begin of row' 'begin and end of column' and 'end of row', 'end of table' - these patterns must be unique or you'll find yourself looking for it indefinetly and going into an endless loop - do a giant loop that only ends on 'end of table' and grab the values within this patterns... the code to get this done is huge (not complex), and (I expect) will be only used once, right? At 19:25 16/9/2001 +0500, i_union wrote: I need to get from another page table and put it into MySQL table dynamically for example http://66.96.230.191/table.html so I need to parse this table in database. If you have any code how to implement such operation by using php MySQL please help me; thanks in advance _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
[PHP] Parsing HTML files?
Is it possible to parse an HTML like at: http://hyrum.net/wwbl/HTML/watrost.htm ? I'd like to be able to grab the player name and ratings and add them to a pretty HTML output :) Jeff
RE: [PHP] Parsing HTML files?
Yeah it is doable, just use fsockopen, and parse the input into your database and go wild. Keep in mind - doing it directly on request is VERY slow. You should be pre-parsing it and then showing the data from your resources. Sincerely, Maxim Maletsky Founder, Chief Developer PHPBeginner.com (Where PHP Begins) [EMAIL PROTECTED] www.phpbeginner.com -Original Message- From: Jeff Lewis [mailto:[EMAIL PROTECTED]] Sent: Sunday, July 08, 2001 3:19 AM To: [EMAIL PROTECTED] Subject: [PHP] Parsing HTML files? Is it possible to parse an HTML like at: http://hyrum.net/wwbl/HTML/watrost.htm ? I'd like to be able to grab the player name and ratings and add them to a pretty HTML output :) Jeff -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
[PHP] Parsing HTML files from an external web server
Hello List. I'm having a little trouble with parsing HTML files and inputting the data from the HTML file into a MySQL database. I get the following error when trying toparse the file. Warning: file(http://www.server.com/file.htm;) - No error in d:\webpages\world\lists.php on line 8 The following is part of my php code ? $url = http://www.server.com/file.htm;; $fileArray = file($url); $state = 0; $line = 0; $ProvinceCount = 0; $Details = Array(); I then have more code which parses the file and parses the data and puts it in an array. I was wondering whether anybody could provide information as to what the possible problem could be. If you want more information, please contact me off-list. Thanks, James -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
[PHP] Parsing HTML tags
Could anyone tell me how to extract a string from between a pair of HTML tags? Specifically, I would like to extract the page title from between the title and /title tags. I have read the regular expression docs and I'm still a bit stuck. Can anyone help? Thanks in advance, Chris Empson [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
RE: [PHP] Parsing HTML tags
I use this function function title($filename,$dir) { $loc = "path/to/dir/where/file/is"; if(is_file("$loc/$filename")) { $open=fopen("$loc/$filename","r"); while(!feof($open)) { $line=fgets($open,255); $string = $line; while(ereg( 'title([^]*)/title(.*)', $string, $regs ) ) { $string = $regs[2]; } } return $regs[1]; } } call it like so print(title("home.htm","web/articles")); The only drawback is if there is any tags in between the title/title tags it will not get the title, also if the title is on two lines like this titleThis is the title of my page/title it won't get the title either. hth Thank you Brian Paulson Sr. Web Developer [EMAIL PROTECTED] http://www.chieftain.com 1-800-269-6397 -Original Message- From: Chris Empson [mailto:[EMAIL PROTECTED]] Sent: Friday, April 13, 2001 8:45 AM To: [EMAIL PROTECTED] Subject: [PHP] Parsing HTML tags Could anyone tell me how to extract a string from between a pair of HTML tags? Specifically, I would like to extract the page title from between the title and /title tags. I have read the regular expression docs and I'm still a bit stuck. Can anyone help? Thanks in advance, Chris Empson [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
Re: [PHP] Parsing HTML tags
// Get the webpage into a string $html = join ("", file ("http://www.altavista.com")); // Using eregi eregi("title(.*)/title", $html, $tag_contents); // Using preg_match (faster than eregi) // The i in the end means that it is a case insensitive match preg_match("/title(.*)\/title/i", $html, $tag_contents); $title = $tag_contents[1]; // Tobias "Chris Empson" [EMAIL PROTECTED] wrote in message 9b6vkl$jpf$[EMAIL PROTECTED]">news:9b6vkl$jpf$[EMAIL PROTECTED]... Could anyone tell me how to extract a string from between a pair of HTML tags? Specifically, I would like to extract the page title from between the title and /title tags. I have read the regular expression docs and I'm still a bit stuck. Can anyone help? Thanks in advance, Chris Empson [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
RE: [PHP] parsing html / xml (more)
Hi, I wrote php-lib-htmlparse doesn't do they arguments stuff, but should be easily added (code is there already although not functioning) go to www.phpbuilder.com for the code snippets Bolke -Oorspronkelijk bericht- Van: Nathaniel Hekman [mailto:[EMAIL PROTECTED]] Verzonden: Wednesday, March 07, 2001 9:39 PM Aan: '[EMAIL PROTECTED]' Onderwerp: RE: [PHP] parsing html / xml (more) Matt McClanahan wrote: [...] You're not going to find an XML parser that allows for most HTML, because if such a parser did exist, it would be a broken XML parser. :) [...] Fair enough, and that's as I expected. So that brings me to the second part of my question: is there any php library that allows parsing of html? Perhaps I'll have to write one myself. All I want really is something that parses a bunch of text and calls handlers whenever tags are encountered. Just like xml_parse, except I don't care if tags are out of order, I don't care about case, and I don't care if there is a close tag for every open. If anyone knows of a package that does this, please advise. If anyone else would be interested in this, let me know and I could post my code when I'm done (if I have to do this myself). Nate -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
[PHP] parsing html / xml
I'd like to parse a html file in much the same way the xml parser works. Ie calling a method for every tag encountered and so on. The xml parsing methods don't seem to be forgiving enough for much of the html that's out there. For example, many html files have tags like this: TABLE border=0 but xml_parse() will choke on it because there are no quotes around the "0". Also html tags are, in practice, case insensitive, so this is found in many html documents: BThis is bold/b but xml_parse() doesn't like it because it expects the opening and closing tags to be same-case. Are there other functions or libraries I'm not aware of that help in parsing html? Or some options in xml_parse to get by these problems? Thanks in advance. Nate -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
[PHP] parsing html / xml (more)
Here's another case that shows up often in html, but is illegal in xml, that I would need to parse: meta tags, p tags, hr tags, and other "singletons". HEAD META HTTP-EQUIV="Content-Type" CONTENT="text/html" /HEAD xml_parse would give an error, because the HEAD block is being closed with a still-open META "block". Nate -Original Message- From: Nathaniel Hekman [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 07, 2001 9:57 AM To: '[EMAIL PROTECTED]' Subject: [PHP] parsing html / xml I'd like to parse a html file in much the same way the xml parser works. Ie calling a method for every tag encountered and so on. The xml parsing methods don't seem to be forgiving enough for much of the html that's out there. For example, many html files have tags like this: TABLE border=0 but xml_parse() will choke on it because there are no quotes around the "0". Also html tags are, in practice, case insensitive, so this is found in many html documents: BThis is bold/b but xml_parse() doesn't like it because it expects the opening and closing tags to be same-case. Are there other functions or libraries I'm not aware of that help in parsing html? Or some options in xml_parse to get by these problems? Thanks in advance. Nate -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
Re: [PHP] parsing html / xml (more)
On Wed, Mar 07, 2001 at 10:07:37AM -0700, Nathaniel Hekman wrote: Here's another case that shows up often in html, but is illegal in xml, that I would need to parse: meta tags, p tags, hr tags, and other "singletons". HEAD META HTTP-EQUIV="Content-Type" CONTENT="text/html" /HEAD xml_parse would give an error, because the HEAD block is being closed with a still-open META "block". Within the context of parsing HTML as XML, there's not really much that can be done. I suppose you could pre-proces the HTML to make it XML-complaitn, but that's probably more trouble than I would go to. You're not going to find an XML parser that allows for most HTML, because if such a parser did exist, it would be a broken XML parser. :) The only kind of HTML you can reliably parse with XML parsers is the XHTML variety (Which is simply HTML4, made XML-compliant) Matt -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
RE: [PHP] parsing html / xml
Try here to take care of problems.. http://www.w3.org/People/Raggett/tidy/ Chad -Original Message- From: Nathaniel Hekman [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 07, 2001 10:57 AM To: '[EMAIL PROTECTED]' Subject: [PHP] parsing html / xml I'd like to parse a html file in much the same way the xml parser works. Ie calling a method for every tag encountered and so on. The xml parsing methods don't seem to be forgiving enough for much of the html that's out there. For example, many html files have tags like this: TABLE border=0 but xml_parse() will choke on it because there are no quotes around the "0". Also html tags are, in practice, case insensitive, so this is found in many html documents: BThis is bold/b but xml_parse() doesn't like it because it expects the opening and closing tags to be same-case. Are there other functions or libraries I'm not aware of that help in parsing html? Or some options in xml_parse to get by these problems? Thanks in advance. Nate -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED] -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]
RE: [PHP] parsing html / xml (more)
Matt McClanahan wrote: [...] You're not going to find an XML parser that allows for most HTML, because if such a parser did exist, it would be a broken XML parser. :) [...] Fair enough, and that's as I expected. So that brings me to the second part of my question: is there any php library that allows parsing of html? Perhaps I'll have to write one myself. All I want really is something that parses a bunch of text and calls handlers whenever tags are encountered. Just like xml_parse, except I don't care if tags are out of order, I don't care about case, and I don't care if there is a close tag for every open. If anyone knows of a package that does this, please advise. If anyone else would be interested in this, let me know and I could post my code when I'm done (if I have to do this myself). Nate -- PHP General Mailing List (http://www.php.net/) To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] To contact the list administrators, e-mail: [EMAIL PROTECTED]