subject:"\[PHP\] Parsing HTML"

Re: [PHP] Parsing HTML href-Attribute

2009-01-18 Thread Benjamin Hawkes-Lewis


On 16/1/09 23:41, Shawn McKenzie wrote:

Again, I say that it won't work on URLs with spaces, like my web
page.html.  When I get a minute I'll fix it.  I thought spaces in URLs
weren't valid markup, but it seems to validate.


Some small points of information:

An HTML4 validator will only check that a HREF value is CDATA, as 
required by the DTD:


http://www.w3.org/TR/REC-html40/struct/links.html#adef-href

http://www.w3.org/TR/REC-html40/sgml/dtd.html#URI

http://www.w3.org/TR/REC-html40/types.html#type-cdata

Plenty of things can be CDATA without being a valid URI:

http://gbiv.com/protocols/uri/rfc/rfc3986.html

Space characters (U+0020) that are not percent encoded are not valid in 
a URI:


http://gbiv.com/protocols/uri/rfc/rfc3986.html#collected-abnf

That's not to say that browsers haven't developed error handling for 
space characters (and other illegal characters) in HREF values.


The HTML5 draft proposes an algorithm for parsing and resolving HREF 
values that includes such error handling:


http://www.whatwg.org/specs/web-apps/current-work/#parsing-urls

http://www.whatwg.org/specs/web-apps/current-work/#resolving-urls

--
Benjamin Hawkes-Lewis

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-18 Thread Micah Gersten

Depending on the goal, using the base tag in the head section might help:
http://www.w3.org/TR/REC-html40/struct/links.html#h-12.4

Thank you,
Micah Gersten
onShore Networks
Internal Developer
http://www.onshore.com



Edmund Hertle wrote:
 Hey,
 I want to parse a href-attribute in a given String to check if there is a
 relative link and then adding an absolute path.
 Example:
 $string  = 'a class=sample [...additional attributes...]
 href=/foo/bar.php ';

 I tried using regular expressions but my knowledge of RegEx is very limited.
 Things to consider:
 - $string could be quite long but my concern are only those href attributes
 (so working with explode() would be not very handy)
 - Should also work if href= is not using quotes or using single quotes
 - link could already be an absolute path, so just searching for href= and
 then inserting absolute path could mess up the link

 Any ideas? Or can someone create a RegEx to use?

 Thanks

   

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Boyd, Todd M.

 -Original Message-
 From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf
 Of Edmund Hertle
 Sent: Thursday, January 15, 2009 4:13 PM
 To: PHP - General
 Subject: [PHP] Parsing HTML href-Attribute

 Hey,
 I want to parse a href-attribute in a given String to check if there
 is a
 relative link and then adding an absolute path.
 Example:
 $string  = 'a class=sample [...additional attributes...]
 href=/foo/bar.php ';

 I tried using regular expressions but my knowledge of RegEx is very
 limited.
 Things to consider:
 - $string could be quite long but my concern are only those href
 attributes
 (so working with explode() would be not very handy)
 - Should also work if href= is not using quotes or using single quotes
 - link could already be an absolute path, so just searching for href=
 and
 then inserting absolute path could mess up the link

 Any ideas? Or can someone create a RegEx to use?

Just spitballing here, but this is probably how I would start:

RegEx pattern: /a.*? href=(.+?)/ig

Then, using the capture group, determine if the href attribute uses quotes 
(single or double, doesn't matter). If it does, you don't need to worry about 
splitting the capture group at the first white space. If it doesn't, then you 
must assume the first whitespace is the end of the URL and the beginning of 
additional attributes, and just grab the URL up to (but not including) the 
first whitespace.

So...

?php

# here is where $anchorText (text for the a tag) would be assigned
# here is where $curDir (text for the current directory) would be assigned

# find the href attribute
$matches = Array();
preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches);

# determine if it has surrounding quotes
if($matches[1][0] == '\'' || $matches[1][0] == '')
{
# pull everything but the first and last character
$anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
}
else
{
# pull up to the first space (if there is one)
$spacePos = strpos($anchorText, ' ');   
if($spacePos !== false) 
$anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
}

# now, check to see if it is relative or absolute
# (regex pattern searches for protocol spec (i.e., http://), which will be
# treated as an absolute path for the purpose of this algorithm)
if($anchorText[0] != '/'  preg_match('#^\w+://#', $anchorText) == 0)
{
# add current directory to the beginning of the relative path
# (nothing is done to absolute paths or URLs with protocol spec)
$anchorText = $curDir . '/' . $anchorText;
}

echo $anchorText;

?

...UNTESTED.

HTH,

// Todd

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Shawn McKenzie

Boyd, Todd M. wrote:
 -Original Message-
 From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf
 Of Edmund Hertle
 Sent: Thursday, January 15, 2009 4:13 PM
 To: PHP - General
 Subject: [PHP] Parsing HTML href-Attribute

 Hey,
 I want to parse a href-attribute in a given String to check if there
 is a
 relative link and then adding an absolute path.
 Example:
 $string  = 'a class=sample [...additional attributes...]
 href=/foo/bar.php ';

 I tried using regular expressions but my knowledge of RegEx is very
 limited.
 Things to consider:
 - $string could be quite long but my concern are only those href
 attributes
 (so working with explode() would be not very handy)
 - Should also work if href= is not using quotes or using single quotes
 - link could already be an absolute path, so just searching for href=
 and
 then inserting absolute path could mess up the link

 Any ideas? Or can someone create a RegEx to use?

 Just spitballing here, but this is probably how I would start:

 RegEx pattern: /a.*? href=(.+?)/ig

 Then, using the capture group, determine if the href attribute uses quotes 
 (single or double, doesn't matter). If it does, you don't need to worry about 
 splitting the capture group at the first white space. If it doesn't, then you 
 must assume the first whitespace is the end of the URL and the beginning of 
 additional attributes, and just grab the URL up to (but not including) the 
 first whitespace.

 So...

 ?php

 # here is where $anchorText (text for the a tag) would be assigned
 # here is where $curDir (text for the current directory) would be assigned

 # find the href attribute
 $matches = Array();
 preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches);

 # determine if it has surrounding quotes
 if($matches[1][0] == '\'' || $matches[1][0] == '')
 {
   # pull everything but the first and last character
   $anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
 }
 else
 {
   # pull up to the first space (if there is one)
   $spacePos = strpos($anchorText, ' ');   
   if($spacePos !== false) 
   $anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
 }

 # now, check to see if it is relative or absolute
 # (regex pattern searches for protocol spec (i.e., http://), which will be
 # treated as an absolute path for the purpose of this algorithm)
 if($anchorText[0] != '/'  preg_match('#^\w+://#', $anchorText) == 0)
 {
   # add current directory to the beginning of the relative path
   # (nothing is done to absolute paths or URLs with protocol spec)
   $anchorText = $curDir . '/' . $anchorText;
 }

 echo $anchorText;

 ?

 ...UNTESTED.

 HTH,

 // Todd

Wow, that's alot!  This should work with or without quotes and assumes
no spaces in the URL:

$prefix = http://example.com/;;
$html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|,
$1$prefix$2$3, $html);

-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Eric Butera

On Thu, Jan 15, 2009 at 5:13 PM, Edmund Hertle
edmund.her...@student.kit.edu wrote:
 Hey,
 I want to parse a href-attribute in a given String to check if there is a
 relative link and then adding an absolute path.
 Example:
 $string  = 'a class=sample [...additional attributes...]
 href=/foo/bar.php ';

 I tried using regular expressions but my knowledge of RegEx is very limited.
 Things to consider:
 - $string could be quite long but my concern are only those href attributes
 (so working with explode() would be not very handy)
 - Should also work if href= is not using quotes or using single quotes
 - link could already be an absolute path, so just searching for href= and
 then inserting absolute path could mess up the link

 Any ideas? Or can someone create a RegEx to use?

 Thanks


You could also use DOM for this.

http://us2.php.net/manual/en/domdocument.getelementsbytagname.php

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread mike

On Fri, Jan 16, 2009 at 10:54 AM, Eric Butera eric.but...@gmail.com wrote:

 You could also use DOM for this.

 http://us2.php.net/manual/en/domdocument.getelementsbytagname.php

only if it's parseable xml :)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread mike

On Fri, Jan 16, 2009 at 10:58 AM, mike mike...@gmail.com wrote:

 only if it's parseable xml :)


Or not! Ignore me. Supposedly this can handle HTML too. I'll have to
try it next time. Normally I wind up having to use tidy to scrub a
document and try to get it into xhtml and then use simplexml. I wonder
how well this would work with [crappy] HTML input.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Eric Butera

On Fri, Jan 16, 2009 at 1:59 PM, mike mike...@gmail.com wrote:
 On Fri, Jan 16, 2009 at 10:58 AM, mike mike...@gmail.com wrote:

 only if it's parseable xml :)


 Or not! Ignore me. Supposedly this can handle HTML too. I'll have to
 try it next time. Normally I wind up having to use tidy to scrub a
 document and try to get it into xhtml and then use simplexml. I wonder
 how well this would work with [crappy] HTML input.


Great if you use @.  ;)  I'd try to make sure all of my input was
stored as proper x/html in the db before I really tried parsing it, so
I'm not sure of his setup, but I use getElementsByTagName all the time
and love it.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Shawn McKenzie

Shawn McKenzie wrote:
 Boyd, Todd M. wrote:
 -Original Message-
 From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf
 Of Edmund Hertle
 Sent: Thursday, January 15, 2009 4:13 PM
 To: PHP - General
 Subject: [PHP] Parsing HTML href-Attribute

 Hey,
 I want to parse a href-attribute in a given String to check if there
 is a
 relative link and then adding an absolute path.
 Example:
 $string  = 'a class=sample [...additional attributes...]
 href=/foo/bar.php ';

 I tried using regular expressions but my knowledge of RegEx is very
 limited.
 Things to consider:
 - $string could be quite long but my concern are only those href
 attributes
 (so working with explode() would be not very handy)
 - Should also work if href= is not using quotes or using single quotes
 - link could already be an absolute path, so just searching for href=
 and
 then inserting absolute path could mess up the link

 Any ideas? Or can someone create a RegEx to use?
 Just spitballing here, but this is probably how I would start:

 RegEx pattern: /a.*? href=(.+?)/ig

 Then, using the capture group, determine if the href attribute uses quotes 
 (single or double, doesn't matter). If it does, you don't need to worry 
 about splitting the capture group at the first white space. If it doesn't, 
 then you must assume the first whitespace is the end of the URL and the 
 beginning of additional attributes, and just grab the URL up to (but not 
 including) the first whitespace.

 So...

 ?php

 # here is where $anchorText (text for the a tag) would be assigned
 # here is where $curDir (text for the current directory) would be assigned

 # find the href attribute
 $matches = Array();
 preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches);

 # determine if it has surrounding quotes
 if($matches[1][0] == '\'' || $matches[1][0] == '')
 {
  # pull everything but the first and last character
  $anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
 }
 else
 {
  # pull up to the first space (if there is one)
  $spacePos = strpos($anchorText, ' ');   
  if($spacePos !== false) 
  $anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
 }

 # now, check to see if it is relative or absolute
 # (regex pattern searches for protocol spec (i.e., http://), which will be
 # treated as an absolute path for the purpose of this algorithm)
 if($anchorText[0] != '/'  preg_match('#^\w+://#', $anchorText) == 0)
 {
  # add current directory to the beginning of the relative path
  # (nothing is done to absolute paths or URLs with protocol spec)
  $anchorText = $curDir . '/' . $anchorText;
 }

 echo $anchorText;

 ?

 ...UNTESTED.

 HTH,

 // Todd

 Wow, that's alot!  This should work with or without quotes and assumes
 no spaces in the URL:

 $prefix = http://example.com/;;
 $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|,
 $1$prefix$2$3, $html);

Might need to keep a preceding slash out of there:

$html = preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|,
$1$prefix$2$3, $html);

-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Boyd, Todd M.

 -Original Message-
 From: Shawn McKenzie [mailto:nos...@mckenzies.net]
 Sent: Friday, January 16, 2009 1:08 PM
 To: php-general@lists.php.net
 Subject: Re: [PHP] Parsing HTML href-Attribute

 Shawn McKenzie wrote:
  Boyd, Todd M. wrote:
  -Original Message-
  From: farn...@googlemail.com [mailto:farn...@googlemail.com] On
 Behalf
  Of Edmund Hertle
  Sent: Thursday, January 15, 2009 4:13 PM
  To: PHP - General
  Subject: [PHP] Parsing HTML href-Attribute

  Hey,
  I want to parse a href-attribute in a given String to check if
 there
  is a
  relative link and then adding an absolute path.
  Example:
  $string  = 'a class=sample [...additional attributes...]
  href=/foo/bar.php ';

  I tried using regular expressions but my knowledge of RegEx is very
  limited.
  Things to consider:
  - $string could be quite long but my concern are only those href
  attributes
  (so working with explode() would be not very handy)
  - Should also work if href= is not using quotes or using single
 quotes
  - link could already be an absolute path, so just searching for
 href=
  and
  then inserting absolute path could mess up the link

  Any ideas? Or can someone create a RegEx to use?
  Just spitballing here, but this is probably how I would start:

  RegEx pattern: /a.*? href=(.+?)/ig

  Then, using the capture group, determine if the href attribute uses
 quotes (single or double, doesn't matter). If it does, you don't need
 to worry about splitting the capture group at the first white space. If
 it doesn't, then you must assume the first whitespace is the end of the
 URL and the beginning of additional attributes, and just grab the URL
 up to (but not including) the first whitespace.

  So...

  ?php

  # here is where $anchorText (text for the a tag) would be assigned
  # here is where $curDir (text for the current directory) would be
 assigned

  # find the href attribute
  $matches = Array();
  preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches);

  # determine if it has surrounding quotes
  if($matches[1][0] == '\'' || $matches[1][0] == '')
  {
 # pull everything but the first and last character
 $anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
  }
  else
  {
 # pull up to the first space (if there is one)
 $spacePos = strpos($anchorText, ' ');
 if($spacePos !== false)
 $anchorText = substr($anchorText, 0, strpos($anchorText, '
 '))
  }

  # now, check to see if it is relative or absolute
  # (regex pattern searches for protocol spec (i.e., http://), which
 will be
  # treated as an absolute path for the purpose of this algorithm)
  if($anchorText[0] != '/'  preg_match('#^\w+://#', $anchorText) ==
 0)
  {
 # add current directory to the beginning of the relative path
 # (nothing is done to absolute paths or URLs with protocol spec)
 $anchorText = $curDir . '/' . $anchorText;
  }

  echo $anchorText;

  ?

  ...UNTESTED.

  HTH,

  // Todd

  Wow, that's alot!  This should work with or without quotes and
 assumes
  no spaces in the URL:

  $prefix = http://example.com/;;
  $html = preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|,
  $1$prefix$2$3, $html);

 Might need to keep a preceding slash out of there:

 $html = preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|,
 $1$prefix$2$3, $html);

I believe the OP wanted to leave already-absolute paths alone (i.e., only 
convert relative paths). The regex does not take into account fully-qualified 
URLs (i.e., http://www.google.com/search?q=php) and it does not determine if a 
given path is relative or absolute. He was wanting to take the href attribute 
of an anchor tag and, **IF** it was a relative path, turn it into an absolute 
path (meaning to append the relative path to the absolute path of the current 
script).

That was my understanding. Perhaps you saw it differently, but I don't believe 
your pattern is enough to accomplish what the OP was asking for--hence a lot 
of code was in my reply. ;)

Believe me, I'm the first guy to hop on the do it with a regex! bandwagon... 
but there are just some circumstances where regex can't do what you need to do 
(such as more-than-superficial contextual logic).

HTH,

// Todd

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Shawn McKenzie

Boyd, Todd M. wrote:
 -Original Message- From: Shawn McKenzie 
 [mailto:nos...@mckenzies.net] Sent: Friday, January 16, 2009 1:08 
 PM To: php-general@lists.php.net Subject: Re: [PHP] Parsing HTML 
 href-Attribute
 
 Shawn McKenzie wrote:
 Boyd, Todd M. wrote:
 -Original Message- From: farn...@googlemail.com 
 [mailto:farn...@googlemail.com] On
 Behalf
 Of Edmund Hertle Sent: Thursday, January 15, 2009 4:13 PM To:
  PHP - General Subject: [PHP] Parsing HTML href-Attribute
 
 Hey, I want to parse a href-attribute in a given String to 
 check if
 there
 is a relative link and then adding an absolute path. Example:
  $string  = 'a class=sample [...additional attributes...]
  href=/foo/bar.php ';
 
 I tried using regular expressions but my knowledge of RegEx 
 is very limited. Things to consider: - $string could be quite
  long but my concern are only those href attributes (so 
 working with explode() would be not very handy) - Should also
  work if href= is not using quotes or using single
 quotes
 - link could already be an absolute path, so just searching 
 for
 href=
 and then inserting absolute path could mess up the link
 
 Any ideas? Or can someone create a RegEx to use?
 Just spitballing here, but this is probably how I would start:
 
 RegEx pattern: /a.*? href=(.+?)/ig
 
 Then, using the capture group, determine if the href attribute 
 uses
 quotes (single or double, doesn't matter). If it does, you don't 
 need to worry about splitting the capture group at the first white 
 space. If it doesn't, then you must assume the first whitespace is 
 the end of the URL and the beginning of additional attributes, and 
 just grab the URL up to (but not including) the first whitespace.
 So...
 
 ?php
 
 # here is where $anchorText (text for the a tag) would be 
 assigned # here is where $curDir (text for the current 
 directory) would be
 assigned
 # find the href attribute $matches = Array(); 
 preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches);
 
 # determine if it has surrounding quotes if($matches[1][0] == 
 '\'' || $matches[1][0] == '') { # pull everything but the 
 first and last character $anchorText = substr($anchorText, 1, 
 strlen($anchorText) - 3); } else { # pull up to the first space
  (if there is one) $spacePos = strpos($anchorText, ' '); 
 if($spacePos !== false) $anchorText = substr($anchorText, 0, 
 strpos($anchorText, '
 '))
 }
 
 # now, check to see if it is relative or absolute # (regex 
 pattern searches for protocol spec (i.e., http://), which
 will be
 # treated as an absolute path for the purpose of this 
 algorithm) if($anchorText[0] != '/'  preg_match('#^\w+://#', 
 $anchorText) ==
 0)
 { # add current directory to the beginning of the relative path
  # (nothing is done to absolute paths or URLs with protocol 
 spec) $anchorText = $curDir . '/' . $anchorText; }
 
 echo $anchorText;
 
 ?
 
 ...UNTESTED.
 
 HTH,
 
 
 // Todd
 Wow, that's alot!  This should work with or without quotes and
 assumes
 no spaces in the URL:
 
 $prefix = http://example.com/;; $html = 
 preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|, 
 $1$prefix$2$3, $html);
 
 
 Might need to keep a preceding slash out of there:
 
 $html = 
 preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|, 
 $1$prefix$2$3, $html);
 
 I believe the OP wanted to leave already-absolute paths alone (i.e., 
 only convert relative paths). The regex does not take into account 
 fully-qualified URLs (i.e., http://www.google.com/search?q=php) and 
 it does not determine if a given path is relative or absolute. He was
  wanting to take the href attribute of an anchor tag and, **IF** it 
 was a relative path, turn it into an absolute path (meaning to append
  the relative path to the absolute path of the current script).

That's exactly what this regex does :-)  The (?!$prefix) negative
lookahead assertion fails the match if it's already an absolute URL.

 That was my understanding. Perhaps you saw it differently, but I 
 don't believe your pattern is enough to accomplish what the OP was 
 asking for--hence a lot of code was in my reply. ;)
 
 Believe me, I'm the first guy to hop on the do it with a regex! 
 bandwagon... but there are just some circumstances where regex can't 
 do what you need to do (such as more-than-superficial contextual 
 logic).
 
 HTH,
 
 
 // Todd


-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Shawn McKenzie

 I believe the OP wanted to leave already-absolute paths alone
 (i.e., only convert relative paths). The regex does not take into
 account fully-qualified URLs (i.e.,
 http://www.google.com/search?q=php) and it does not determine if a
 given path is relative or absolute. He was wanting to take the href
 attribute of an anchor tag and, **IF** it was a relative path, turn
 it into an absolute path (meaning to append the relative path to
 the absolute path of the current script).
 
 That's exactly what this regex does :-)  The (?!$prefix) negative 
 lookahead assertion fails the match if it's already an absolute URL.
 
 That was my understanding. Perhaps you saw it differently, but I 
 don't believe your pattern is enough to accomplish what the OP was
  asking for--hence a lot of code was in my reply. ;)
 
 Believe me, I'm the first guy to hop on the do it with a regex! 
 bandwagon... but there are just some circumstances where regex
 can't do what you need to do (such as more-than-superficial
 contextual logic).
 
 HTH,
 
 
 // Todd
 
Ahh, but you uncovered a problem for me if the href contains an
absolute URL that doesn't contain the prefix.  Here's the fix:

$html =
preg_replace(|(href=['\]?)(?!http(?:s)?://)[/]?([^'\\s]+)(\s)?|,
$1http://www.example.com/2$3;, $html);

-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Boyd, Todd M.

 -Original Message-
 From: Shawn McKenzie [mailto:nos...@mckenzies.net]
 Sent: Friday, January 16, 2009 2:37 PM
 To: php-general@lists.php.net
 Subject: Re: [PHP] Parsing HTML href-Attribute

  Hey, I want to parse a href-attribute in a given String to
  check if
  there
  is a relative link and then adding an absolute path. Example:
   $string  = 'a class=sample [...additional attributes...]
   href=/foo/bar.php ';

  I tried using regular expressions but my knowledge of RegEx
  is very limited. Things to consider: - $string could be quite
   long but my concern are only those href attributes (so
  working with explode() would be not very handy) - Should also
   work if href= is not using quotes or using single
  quotes
  - link could already be an absolute path, so just searching
  for
  href=
  and then inserting absolute path could mess up the link

  Any ideas? Or can someone create a RegEx to use?
  Just spitballing here, but this is probably how I would start:

  RegEx pattern: /a.*? href=(.+?)/ig

  Then, using the capture group, determine if the href attribute
  uses
  quotes (single or double, doesn't matter). If it does, you don't
  need to worry about splitting the capture group at the first white
  space. If it doesn't, then you must assume the first whitespace is
  the end of the URL and the beginning of additional attributes, and
  just grab the URL up to (but not including) the first whitespace.
  So...

  ?php

  # here is where $anchorText (text for the a tag) would be
  assigned # here is where $curDir (text for the current
  directory) would be
  assigned
  # find the href attribute $matches = Array();
  preg_match('#a.*? href=(.+?)#ig', $anchorText, $matches);

  # determine if it has surrounding quotes if($matches[1][0] ==
  '\'' || $matches[1][0] == '') { # pull everything but the
  first and last character $anchorText = substr($anchorText, 1,
  strlen($anchorText) - 3); } else { # pull up to the first space
   (if there is one) $spacePos = strpos($anchorText, ' ');
  if($spacePos !== false) $anchorText = substr($anchorText, 0,
  strpos($anchorText, '
  '))
  }

  # now, check to see if it is relative or absolute # (regex
  pattern searches for protocol spec (i.e., http://), which
  will be
  # treated as an absolute path for the purpose of this
  algorithm) if($anchorText[0] != '/'  preg_match('#^\w+://#',
  $anchorText) ==
  0)
  { # add current directory to the beginning of the relative path
   # (nothing is done to absolute paths or URLs with protocol
  spec) $anchorText = $curDir . '/' . $anchorText; }

  echo $anchorText;

  ?

  Wow, that's alot!  This should work with or without quotes and
  assumes
  no spaces in the URL:

  $prefix = http://example.com/;; $html =
  preg_replace(|(href=['\]?)(?!$prefix)([^'\\s]+)(\s)?|,
  $1$prefix$2$3, $html);

  Might need to keep a preceding slash out of there:

  $html =
  preg_replace(|(href=['\]?)(?!$prefix)[/]?([^'\\s]+)(\s)?|,
  $1$prefix$2$3, $html);

  I believe the OP wanted to leave already-absolute paths alone (i.e.,
  only convert relative paths). The regex does not take into account
  fully-qualified URLs (i.e., http://www.google.com/search?q=php) and
  it does not determine if a given path is relative or absolute. He was
   wanting to take the href attribute of an anchor tag and, **IF** it
  was a relative path, turn it into an absolute path (meaning to append
   the relative path to the absolute path of the current script).

 That's exactly what this regex does :-)  The (?!$prefix) negative
 lookahead assertion fails the match if it's already an absolute URL.

I see that now. I didn't notice the negative look-ahead the first go 'round. 
However, I still have qualms with it. :) You are only checking for http://, and 
only for the local server. What I meant by absolute path was, for example, 
/index.php (the index in the root directory of the server) as opposed to 
somefolder/index.php (the index in a subfolder of the current directory named 
'somefolder').

* http://www.google.com/search?q=php ... absolute path (yes, it's a URL, but 
treat it as absolute)
* https://www.example.com/index.php ... absolute path (yes, it's a URL, but to 
the local server)
* /index.php ... absolute path (no protocol given, true absolute path)
* index.php ... relative path (relative to current directory on current server)
* somefolder/index.php ... relative path (same reason)

That is indeed a nifty use of look-ahead, though. That will work for any anchor 
tag that doesn't reference the server (or any other server) with a protocol 
spec preceding it. However, if you want to run it through an entire list of 
anchor tags with any spec (http://, https://, udp://, ftp://, aim://, rss://, 
etc.)--or lack of spec--and only mess with those that don't have a spec and 
don't use absolute paths, it needs to get a bit more complex. You've convinced 
me, however, that it can be done entirely with one regex pattern.

Ooh--one more

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Edmund Hertle


 * http://www.google.com/search?q=php ... absolute path (yes, it's a URL,
 but treat it as absolute)
 * https://www.example.com/index.php ... absolute path (yes, it's a URL,
 but to the local server)
 * /index.php ... absolute path (no protocol given, true absolute path)
 * index.php ... relative path (relative to current directory on current
 server)
 * somefolder/index.php ... relative path (same reason)

 That is indeed a nifty use of look-ahead, though. That will work for any
 anchor tag that doesn't reference the server (or any other server) with a
 protocol spec preceding it. However, if you want to run it through an entire
 list of anchor tags with any spec (http://, https://, udp://, ftp://,
 aim://, rss://, etc.)--or lack of spec--and only mess with those that don't
 have a spec and don't use absolute paths, it needs to get a bit more
 complex. You've convinced me, however, that it can be done entirely with one
 regex pattern.

 // Todd


Hey!
Wow, I think that was exactly what I was looking for... thank all of you...
although I've not tested it, will do that tomorrow, but sounds very nice

But Todd just confused me quite a bit with the statement: Is /index.php a
case where the RegEx will fail?

To add some background: It is about dynamiclly creating pdf files out of
html source code and then the links should also work in the pdf file. So
other protocolls then http:// shouldn't be a problem

-eddy

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Kevin Waterson

This one time, at band camp, mike mike...@gmail.com wrote:

 On Fri, Jan 16, 2009 at 10:58 AM, mike mike...@gmail.com wrote:
 
  only if it's parseable xml :)
 
 
 Or not! Ignore me. Supposedly this can handle HTML too. I'll have to
 try it next time. Normally I wind up having to use tidy to scrub a
 document and try to get it into xhtml and then use simplexml. I wonder
 how well this would work with [crappy] HTML input.

$dom-loadHTML($html)

Kevin

http://phpro.org

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Kevin Waterson

This one time, at band camp, Eric Butera eric.but...@gmail.com wrote:

 
 You could also use DOM for this.
 
 http://us2.php.net/manual/en/domdocument.getelementsbytagname.php

http://www.phpro.org/examples/Get-Links-With-DOM.html


Kevin

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Shawn McKenzie

Edmund Hertle wrote:
 * http://www.google.com/search?q=php ... absolute path (yes, it's a URL,
 but treat it as absolute)
 * https://www.example.com/index.php ... absolute path (yes, it's a URL,
 but to the local server)
 * /index.php ... absolute path (no protocol given, true absolute path)
 * index.php ... relative path (relative to current directory on current
 server)
 * somefolder/index.php ... relative path (same reason)

 That is indeed a nifty use of look-ahead, though. That will work for any
 anchor tag that doesn't reference the server (or any other server) with a
 protocol spec preceding it. However, if you want to run it through an entire
 list of anchor tags with any spec (http://, https://, udp://, ftp://,
 aim://, rss://, etc.)--or lack of spec--and only mess with those that don't
 have a spec and don't use absolute paths, it needs to get a bit more
 complex. You've convinced me, however, that it can be done entirely with one
 regex pattern.

 // Todd
 
 
 Hey!
 Wow, I think that was exactly what I was looking for... thank all of you...
 although I've not tested it, will do that tomorrow, but sounds very nice
 
 But Todd just confused me quite a bit with the statement: Is /index.php a
 case where the RegEx will fail?
 
 To add some background: It is about dynamiclly creating pdf files out of
 html source code and then the links should also work in the pdf file. So
 other protocolls then http:// shouldn't be a problem
 
 -eddy
 
That regex should work on all hrefs. index.php and /index.php will be
replaced with http://www.example.com/index.php and somedir/index.php and
/somedir/index.php will be replaced with
http://www.example.com/somedir/index.php.  Any URL starting with http://
or https:// will be ignored.

Again, I say that it won't work on URLs with spaces, like my web
page.html.  When I get a minute I'll fix it.  I thought spaces in URLs
weren't valid markup, but it seems to validate.

-- 
Thanks!
-Shawn
http://www.spidean.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML href-Attribute

2009-01-16 Thread Eric Butera

On Fri, Jan 16, 2009 at 6:18 PM, Kevin Waterson ke...@phpro.org wrote:
 This one time, at band camp, Eric Butera eric.but...@gmail.com wrote:


 You could also use DOM for this.

 http://us2.php.net/manual/en/domdocument.getelementsbytagname.php

 http://www.phpro.org/examples/Get-Links-With-DOM.html


 Kevin

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php



Nice ;)

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] Parsing HTML href-Attribute

2009-01-15 Thread Edmund Hertle

Hey,
I want to parse a href-attribute in a given String to check if there is a
relative link and then adding an absolute path.
Example:
$string  = 'a class=sample [...additional attributes...]
href=/foo/bar.php ';

I tried using regular expressions but my knowledge of RegEx is very limited.
Things to consider:
- $string could be quite long but my concern are only those href attributes
(so working with explode() would be not very handy)
- Should also work if href= is not using quotes or using single quotes
- link could already be an absolute path, so just searching for href= and
then inserting absolute path could mess up the link

Any ideas? Or can someone create a RegEx to use?

Thanks

Re: [PHP] Parsing HTML href-Attribute

2009-01-15 Thread Murray

Hi Edmund,

You want a regex that looks something like this:

$result = preg_replace('%(href=)(|\')(?!c:/)(.+?)(|\')%',
'\1\2c:/my_absolute_path\3\4', $subject);

This example assumes that your absolute path begins with c:/. You would
change this to whatever suits. You would also change c:/my_absolute_path
to be whatever appropriate value indicates the absolute path element that
you want to prepend.

Note: this will NOT accound for hrefs that are not encapsulated in either '
or . The problem being that while you can probably predictably how the
substring starts, it would be more difficult to determine how it ends,
unless you can provide a white list of file extensions for the regex (ie, if
you know you only ever link to, for example, files with .php and or .html
extensions). In that case, you probably could alter the regex to test for
these instead of a ' or .

M is for Murray


On Fri, Jan 16, 2009 at 8:13 AM, Edmund Hertle 
edmund.her...@student.kit.edu wrote:

 Hey,
 I want to parse a href-attribute in a given String to check if there is a
 relative link and then adding an absolute path.
 Example:
 $string  = 'a class=sample [...additional attributes...]
 href=/foo/bar.php ';

 I tried using regular expressions but my knowledge of RegEx is very
 limited.
 Things to consider:
 - $string could be quite long but my concern are only those href attributes
 (so working with explode() would be not very handy)
 - Should also work if href= is not using quotes or using single quotes
 - link could already be an absolute path, so just searching for href= and
 then inserting absolute path could mess up the link

 Any ideas? Or can someone create a RegEx to use?

 Thanks

[PHP] Parsing HTML

2006-02-16 Thread Boby


I need to extract news items from several news sites.

In order to do that, I need to parse the HTML data.

I know how to use Regular Expressions, but I wonder if there are other 
ways to do that.


Can anybody please give me some pointers?

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP] Parsing HTML

2006-02-16 Thread Jay Blanchard

[snip]
I need to extract news items from several news sites.

In order to do that, I need to parse the HTML data.

I know how to use Regular Expressions, but I wonder if there are other 
ways to do that.

Can anybody please give me some pointers?
[/snip]

Can you be more specific here? This is awfully broad.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML

2006-02-16 Thread Sumeet


Boby wrote:

I need to extract news items from several news sites.

In order to do that, I need to parse the HTML data.

I know how to use Regular Expressions, but I wonder if there are other 
ways to do that.


Can anybody please give me some pointers?



i could suggest you to use html parsing libraries available on the net 
try. http://www.sourceforge.net and http://www.phpclasses.org


--
Sumeet Shroff
http://www.prateeksha.com
Web Design and Ecommerce Development, Mumbai India

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] Parsing HTML files

2004-09-10 Thread Nick Wilson

Hi all, 

I was wondering if any classes/functions could help me with this little
challenge, (im hopeless at regex ;-)

input type=hidden name=id value=593 /

I want to extract the value of 'id' from a webpage. Any simple way to do
this or am I down to sweating of the regex functions?

Much thanks
-- 
Nick W

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML files

2004-09-10 Thread Abdul-Wahid Paterson

No easy way of doing it, regex somthing like:

$id = preg_replace(/.*input.*name=\id\ value=\[0-9]+\ \//, $1, $string);

where $string is a line from your input'd HTML page

Abdul-Wahid



On Fri, 10 Sep 2004 12:54:37 +0200, Nick Wilson [EMAIL PROTECTED] wrote:
 Hi all,
 
 I was wondering if any classes/functions could help me with this little
 challenge, (im hopeless at regex ;-)
 
 input type=hidden name=id value=593 /
 
 I want to extract the value of 'id' from a webpage. Any simple way to do
 this or am I down to sweating of the regex functions?
 
 Much thanks
 --
 Nick W
 
 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php
 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML files

2004-09-10 Thread Nick Wilson


* and then Abdul-Wahid Paterson declared
 No easy way of doing it, regex somthing like:
 
 $id = preg_replace(/.*input.*name=\id\ value=\[0-9]+\ \//, $1, $string);
 
 where $string is a line from your input'd HTML page

OK, thanks abdul, much appreciated..


-- 
Nick W

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing HTML files

2004-09-10 Thread Peter Brodersen

On Fri, 10 Sep 2004 11:58:58 +0100, in php.general
[EMAIL PROTECTED] (Abdul-Wahid Paterson) wrote:

 I was wondering if any classes/functions could help me with this little
 challenge, (im hopeless at regex ;-)
 
 input type=hidden name=id value=593 /

No easy way of doing it, regex somthing like:

$id = preg_replace(/.*input.*name=\id\ value=\[0-9]+\ \//, $1, $string);

How about just using an xml-based function? Much cleaner, doesn't
require name-attribute to be present before value-attribute.

?php
$string = 'input type=hidden name=id value=593 /';
$x = xml_parser_create();
xml_parse_into_struct($x,$string,$array);
print $array[0]['attributes']['VALUE'];
// or, out of curiousity:
var_dump($array); 
?

(and why preg_replace? $1 wouldn't even be set since no capturing
parenthesises are used)

-- 
- Peter Brodersen

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] Parsing html to extract images

2003-05-29 Thread Hidrahyl

Hi,

anyone can help me parsing html files in order to get all the images
containing a file?

Thanks, Simon.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Parsing html to extract images

2003-05-29 Thread David Grant

Hidrahyl wrote:
Hi,

anyone can help me parsing html files in order to get all the images
containing a file?
Thanks, Simon.

1. Use fopen() to grab the HTML file you're after.
2. Read in each line to an array using file();
3. Loop through the array, and apply the following reg. exp.:
preg_match(/\img.*src=[\\'](.*)[\\'].*\/U, $line, $matches);

NOTE:  this might need a bit of tweeking, since I'm not too hot on 
regular expressions... :)

Regards,

David

--
David Grant
Web Developer
[EMAIL PROTECTED]
http://www.wiredmedia.co.uk
Tel: 0117 930 4365, Fax: 0870 169 7625

Wired Media Ltd
Registered Office: 43 Royal Park, Bristol, BS8 3AN
Studio: Whittakers House, 32 - 34 Hotwell Road, Bristol, BS8 4UD
Company registration number: 4016744

**
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
**

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] Parsing HTML

2002-10-28 Thread Henry

Hi All,

I would like to be able to do a serach and replace in a HTML document. For a
list of words;

for example:
hello become buongiorno
yes becomes si
size become grossezza

The problem is that if I change the word size without considering html
tags and html comments in the case of inline javascripts I'll end up with
broken html.

Is there a way to only do the search and replace outside the tags and
comments.

It is further complicated by the fact that I would still like to do the
replacements within strings for example within meta tags!

Any ideas.

Henry



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] parsing HTML text

2002-05-07 Thread Lee Doolan




I have written form screen which has as one of it's elements a
textarea box in which a user can input some text --like a simple
bio-- which will appear on another screen.  I'd like to edit check
this text. It would be a good idea to make sure that it has, among other
things, no form elements, say, or to make sure that if a font
tag occurs, that a matching /font tag is present.

Is anyone aware of a  class or a package which I can use to parse this
text and do this kind of validation?

tia
-lee

-- 
When the birdcage is open,   | donate to causes I care about: 
the selfish bird flies away, |http://svcs.affero.net/rm.php?r=leed_25
but the virtuous one stays.  |

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP] Parsing HTML

2001-11-29 Thread Ulrich Hacke


Hi,
I have some HTML including several pseudo tags like
tag parameter1=doo parameter2=something parameter3=what should i say

After parsing I have an array $content[$i][name] where name is the name
of the parameters and $i the Counter of the tag. I'm using a regular
expression to find these tags and explode( , $my_tag_line) to geht the
parameters out. How can I achive that parameter values can contains
whitespaces? I suppose I'll need another regex for split(). Any help is
appreciated.

Uli


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP] Parsing html table into MySQL

2001-09-16 Thread i_union




I  need to get from another page table and put it into MySQL table
dynamically



for example http://66.96.230.191/table.html  so I need to parse this table
in database.



If you have any code how to implement such operation by using php MySQL
please help me;



thanks in advance



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Re: [PHP] Parsing html table into MySQL

2001-09-16 Thread Christian Dechery


wait a minute... do you want to parse the HTML to get the values to 
populate a mysql table, or do you have this table in another DB and just 
want it copied to your mysql one??

If it is the former, you'll some very hardcore regex work to be done... I 
once did this... it is very stressing work...
- you need to analyse the HTML document and find patterns that indicate 
'begin of row' 'begin and end of column' and 'end of row', 'end of table' - 
these patterns must be unique or you'll find yourself looking for it 
indefinetly and going into an endless loop - do a giant loop that only ends 
on 'end of table' and grab the values within this patterns... the code to 
get this done is huge (not complex), and (I expect) will be only used once, 
right?

At 19:25 16/9/2001 +0500, i_union wrote:


I  need to get from another page table and put it into MySQL table
dynamically



for example http://66.96.230.191/table.html  so I need to parse this table
in database.



If you have any code how to implement such operation by using php MySQL
please help me;



thanks in advance



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]


p.s: meu novo email é [EMAIL PROTECTED]

. Christian Dechery (lemming)
. http://www.tanamesa.com.br
. Gaita-L Owner / Web Developer


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Re: [PHP] Parsing html table into MySQL

2001-09-16 Thread i_union


No I need to copy the rows values from HTML table you can see it in exlamle
http://66.96.230.191/table.html  This is a live score system which updates
every  2 min, So I need to get these values and parse it in MySQL after that
I neeed to get some element from my database and show in my page..

I have problems in regex I dont know good coding and need only smmall
support Please help me :)



- Original Message -
From: Christian Dechery [EMAIL PROTECTED]
To: i_union [EMAIL PROTECTED]; Chris Lambert [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, September 10, 2001 7:57 PM
Subject: Re: [PHP] Parsing html table into MySQL


 wait a minute... do you want to parse the HTML to get the values to
 populate a mysql table, or do you have this table in another DB and just
 want it copied to your mysql one??

 If it is the former, you'll some very hardcore regex work to be done... I
 once did this... it is very stressing work...
 - you need to analyse the HTML document and find patterns that indicate
 'begin of row' 'begin and end of column' and 'end of row', 'end of
table' -
 these patterns must be unique or you'll find yourself looking for it
 indefinetly and going into an endless loop - do a giant loop that only
ends
 on 'end of table' and grab the values within this patterns... the code to
 get this done is huge (not complex), and (I expect) will be only used
once,
 right?

 At 19:25 16/9/2001 +0500, i_union wrote:


 I  need to get from another page table and put it into MySQL table
 dynamically
 
 
 
 for example http://66.96.230.191/table.html  so I need to parse this
table
 in database.
 
 
 
 If you have any code how to implement such operation by using php MySQL
 please help me;
 
 
 
 thanks in advance
 
 
 
 _
 Do You Yahoo!?
 Get your free @yahoo.com address at http://mail.yahoo.com
 
 
 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 To contact the list administrators, e-mail: [EMAIL PROTECTED]


 p.s: meu novo email é [EMAIL PROTECTED]
 
 . Christian Dechery (lemming)
 . http://www.tanamesa.com.br
 . Gaita-L Owner / Web Developer


 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 To contact the list administrators, e-mail: [EMAIL PROTECTED]


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Re: Fwd: Re: [PHP] Parsing html table into MySQL

2001-09-16 Thread Christian Dechery


ok... so the hard way.

I'm no regex wizard myself... as a matter of fact I suck at it.
I do it the hard (old C like) way.

You have to find some chunks of HTML that determine and end of the data in 
the table and use them to walk trough the doc fetching what you want... let 
me give an example...

html
table
trtdfont face=arialTitle/font/tdtdfont 
face=arialPrice/font/td/tr
trtdfont face=arialXXX/font/tdtdfont 
face=arial10.00/font/td/tr
trtdfont face=arialYYY/font/tdtdfont 
face=arial25.2/font/td/tr
/table
/html

I know that with regex this would be a lot easir but you can do this:

$fp=fopen(htmldoc,r);
while(!feof($fp))
{
 // lets find the first row of DATA (the first were only titles)
 while(!strstr(fgets($fp,256),/tr);

 //now we are the first line
 while(!strstr(fgets($fp,256),/table)
 {
 // see where I'm getting at?
 }
}

No I need to copy the rows values from HTML table you can see it in exlamle
http://66.96.230.191/table.html This is a live score system which updates
every  2 min, So I need to get these values and parse it in MySQL after that
I neeed to get some element from my database and show in my page..

I have problems in regex I dont know good coding and need only smmall
support Please help me :)



- Original Message -
From: Christian Dechery [EMAIL PROTECTED]
To: i_union [EMAIL PROTECTED]; Chris Lambert [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, September 10, 2001 7:57 PM
Subject: Re: [PHP] Parsing html table into MySQL


  wait a minute... do you want to parse the HTML to get the values to
  populate a mysql table, or do you have this table in another DB and just
  want it copied to your mysql one??
 
  If it is the former, you'll some very hardcore regex work to be done... I
  once did this... it is very stressing work...
  - you need to analyse the HTML document and find patterns that indicate
  'begin of row' 'begin and end of column' and 'end of row', 'end of
table' -
  these patterns must be unique or you'll find yourself looking for it
  indefinetly and going into an endless loop - do a giant loop that only
ends
  on 'end of table' and grab the values within this patterns... the code to
  get this done is huge (not complex), and (I expect) will be only used
once,
  right?
 
  At 19:25 16/9/2001 +0500, i_union wrote:
 
 
  I  need to get from another page table and put it into MySQL table
  dynamically
  
  
  
  for example http://66.96.230.191/table.html so I need to parse this
table
  in database.
  
  
  
  If you have any code how to implement such operation by using php MySQL
  please help me;
  
  
  
  thanks in advance
  
  
  
  _
  Do You Yahoo!?
  Get your free @yahoo.com address at http://mail.yahoo.com
  
  
  --
  PHP General Mailing List (http://www.php.net/)
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  To contact the list administrators, e-mail: [EMAIL PROTECTED]


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP] Parsing HTML files?

2001-07-07 Thread Jeff Lewis


Is it possible to parse an HTML like at:
http://hyrum.net/wwbl/HTML/watrost.htm ?

I'd like to be able to grab the player name and ratings and add them to a
pretty HTML output :)

Jeff

RE: [PHP] Parsing HTML files?

2001-07-07 Thread Maxim Maletsky


Yeah it is doable, just use fsockopen, and parse the input into your
database and go wild.

Keep in mind - doing it directly on request is VERY slow. You should be
pre-parsing it and then showing the data from your resources.


Sincerely,

 Maxim Maletsky
 Founder, Chief Developer

 PHPBeginner.com (Where PHP Begins)
 [EMAIL PROTECTED]
 www.phpbeginner.com




-Original Message-
From: Jeff Lewis [mailto:[EMAIL PROTECTED]]
Sent: Sunday, July 08, 2001 3:19 AM
To: [EMAIL PROTECTED]
Subject: [PHP] Parsing HTML files?


Is it possible to parse an HTML like at:
http://hyrum.net/wwbl/HTML/watrost.htm ?

I'd like to be able to grab the player name and ratings and add them to a
pretty HTML output :)

Jeff


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP] Parsing HTML files from an external web server

2001-04-26 Thread James Kneebone


Hello List.

I'm having a little trouble with parsing HTML files and inputting the data
from
the HTML file into a MySQL database. I get the following error when trying
toparse the file.

Warning: file(http://www.server.com/file.htm;) - No error in
d:\webpages\world\lists.php on line 8

The following is part of my php code

?

$url = http://www.server.com/file.htm;;

$fileArray = file($url);

$state = 0;
$line = 0;
$ProvinceCount = 0;

$Details = Array();



I then have more code which parses the file and parses the data and puts it in
an array.

I was wondering whether anybody could provide information as to what the
possible problem could be. If you want more information, please contact me
off-list.

Thanks,

James  

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP] Parsing HTML tags

2001-04-13 Thread Chris Empson


Could anyone tell me how to extract a string from between a pair of HTML 
tags?

Specifically, I would like to extract the page title from between the 
title and /title tags. I have read the regular expression docs and I'm 
still a bit stuck.

Can anyone help?

Thanks in advance, 

Chris Empson

[EMAIL PROTECTED]


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

RE: [PHP] Parsing HTML tags

2001-04-13 Thread Brian Paulson


I use this function

function title($filename,$dir)
{
 $loc = "path/to/dir/where/file/is";
if(is_file("$loc/$filename"))
{
$open=fopen("$loc/$filename","r");
 while(!feof($open))
   {
$line=fgets($open,255);
   $string = $line;
 while(ereg( 'title([^]*)/title(.*)', $string, $regs ) )
   {
   $string = $regs[2];
}
   }
 return $regs[1];
}
}


call it like so

print(title("home.htm","web/articles"));

The only drawback is if there is any   tags in between the title/title
tags it will not get the title, also if the title is on two lines like this

titleThis is the title of
my page/title

it won't get the title either.

hth

Thank you
Brian Paulson
Sr. Web Developer
[EMAIL PROTECTED]
http://www.chieftain.com
1-800-269-6397

-Original Message-
From: Chris Empson [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 13, 2001 8:45 AM
To: [EMAIL PROTECTED]
Subject: [PHP] Parsing HTML tags


Could anyone tell me how to extract a string from between a pair of HTML
tags?

Specifically, I would like to extract the page title from between the
title and /title tags. I have read the regular expression docs and I'm
still a bit stuck.

Can anyone help?

Thanks in advance,

Chris Empson

[EMAIL PROTECTED]


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]




-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Re: [PHP] Parsing HTML tags

2001-04-13 Thread Tobias Talltorp


// Get the webpage into a string
$html = join ("", file ("http://www.altavista.com"));

// Using eregi
eregi("title(.*)/title", $html, $tag_contents);

// Using preg_match (faster than eregi)
// The i in the end means that it is a case insensitive match
preg_match("/title(.*)\/title/i", $html, $tag_contents);

$title = $tag_contents[1];

// Tobias

"Chris Empson" [EMAIL PROTECTED] wrote in message
9b6vkl$jpf$[EMAIL PROTECTED]">news:9b6vkl$jpf$[EMAIL PROTECTED]...
 Could anyone tell me how to extract a string from between a pair of HTML
 tags?

 Specifically, I would like to extract the page title from between the
 title and /title tags. I have read the regular expression docs and I'm
 still a bit stuck.

 Can anyone help?

 Thanks in advance,

 Chris Empson

 [EMAIL PROTECTED]


 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 To contact the list administrators, e-mail: [EMAIL PROTECTED]




-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

RE: [PHP] parsing html / xml (more)

2001-03-08 Thread Bruin, Bolke de


Hi,

I wrote php-lib-htmlparse
doesn't do they arguments stuff, but should be easily
added (code is there already although not functioning)

go to www.phpbuilder.com for the code snippets

Bolke

-Oorspronkelijk bericht-
Van: Nathaniel Hekman [mailto:[EMAIL PROTECTED]]
Verzonden: Wednesday, March 07, 2001 9:39 PM
Aan: '[EMAIL PROTECTED]'
Onderwerp: RE: [PHP] parsing html / xml (more)


Matt McClanahan wrote:
[...]
 You're not going to find an XML parser that allows for most HTML,
 because if such a parser did exist, it would be a broken XML parser. :)
[...]

Fair enough, and that's as I expected.  So that brings me to the second part
of my question:  is there any php library that allows parsing of html?

Perhaps I'll have to write one myself.  All I want really is something that
parses a bunch of text and calls handlers whenever tags are encountered.
Just like xml_parse, except I don't care if tags are out of order, I don't
care about case, and I don't care if there is a close tag for every open.
If anyone knows of a package that does this, please advise.  If anyone else
would be interested in this, let me know and I could post my code when I'm
done (if I have to do this myself).


Nate

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP] parsing html / xml

2001-03-07 Thread Nathaniel Hekman


I'd like to parse a html file in much the same way the xml parser works.  Ie
calling a method for every tag encountered and so on.  The xml parsing
methods don't seem to be forgiving enough for much of the html that's out
there.  For example, many html files have tags like this:

TABLE border=0

but xml_parse() will choke on it because there are no quotes around the "0".
Also html tags are, in practice, case insensitive, so this is found in many
html documents:

BThis is bold/b

but xml_parse() doesn't like it because it expects the opening and closing
tags to be same-case.

Are there other functions or libraries I'm not aware of that help in parsing
html?  Or some options in xml_parse to get by these problems?

Thanks in advance.


Nate

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

[PHP] parsing html / xml (more)

2001-03-07 Thread Nathaniel Hekman


Here's another case that shows up often in html, but is illegal in xml, that
I would need to parse:  meta tags, p tags, hr tags, and other
"singletons".

HEAD
META HTTP-EQUIV="Content-Type" CONTENT="text/html"
/HEAD

xml_parse would give an error, because the HEAD block is being closed with a
still-open META "block".


Nate

-Original Message-
From: Nathaniel Hekman [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 07, 2001 9:57 AM
To: '[EMAIL PROTECTED]'
Subject: [PHP] parsing html / xml


I'd like to parse a html file in much the same way the xml parser works.  Ie
calling a method for every tag encountered and so on.  The xml parsing
methods don't seem to be forgiving enough for much of the html that's out
there.  For example, many html files have tags like this:

TABLE border=0

but xml_parse() will choke on it because there are no quotes around the "0".
Also html tags are, in practice, case insensitive, so this is found in many
html documents:

BThis is bold/b

but xml_parse() doesn't like it because it expects the opening and closing
tags to be same-case.

Are there other functions or libraries I'm not aware of that help in parsing
html?  Or some options in xml_parse to get by these problems?

Thanks in advance.


Nate

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Re: [PHP] parsing html / xml (more)

2001-03-07 Thread Matt McClanahan


On Wed, Mar 07, 2001 at 10:07:37AM -0700, Nathaniel Hekman wrote:

 Here's another case that shows up often in html, but is illegal in xml, that
 I would need to parse:  meta tags, p tags, hr tags, and other
 "singletons".
 
   HEAD
   META HTTP-EQUIV="Content-Type" CONTENT="text/html"
   /HEAD
 
 xml_parse would give an error, because the HEAD block is being closed with a
 still-open META "block".

Within the context of parsing HTML as XML, there's not really much that
can be done.  I suppose you could pre-proces the HTML to make it
XML-complaitn, but that's probably more trouble than I would go to.

You're not going to find an XML parser that allows for most HTML,
because if such a parser did exist, it would be a broken XML parser. :)
The only kind of HTML you can reliably parse with XML parsers is the
XHTML variety (Which is simply HTML4, made XML-compliant)

Matt

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

RE: [PHP] parsing html / xml

2001-03-07 Thread Angerer, Chad


Try here to take care of problems..

http://www.w3.org/People/Raggett/tidy/

Chad

-Original Message-
From: Nathaniel Hekman [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 07, 2001 10:57 AM
To: '[EMAIL PROTECTED]'
Subject: [PHP] parsing html / xml


I'd like to parse a html file in much the same way the xml parser works.  Ie
calling a method for every tag encountered and so on.  The xml parsing
methods don't seem to be forgiving enough for much of the html that's out
there.  For example, many html files have tags like this:

TABLE border=0

but xml_parse() will choke on it because there are no quotes around the "0".
Also html tags are, in practice, case insensitive, so this is found in many
html documents:

BThis is bold/b

but xml_parse() doesn't like it because it expects the opening and closing
tags to be same-case.

Are there other functions or libraries I'm not aware of that help in parsing
html?  Or some options in xml_parse to get by these problems?

Thanks in advance.


Nate

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

RE: [PHP] parsing html / xml (more)

2001-03-07 Thread Nathaniel Hekman


Matt McClanahan wrote:
[...]
 You're not going to find an XML parser that allows for most HTML,
 because if such a parser did exist, it would be a broken XML parser. :)
[...]

Fair enough, and that's as I expected.  So that brings me to the second part
of my question:  is there any php library that allows parsing of html?

Perhaps I'll have to write one myself.  All I want really is something that
parses a bunch of text and calls handlers whenever tags are encountered.
Just like xml_parse, except I don't care if tags are out of order, I don't
care about case, and I don't care if there is a close tag for every open.
If anyone knows of a package that does this, please advise.  If anyone else
would be interested in this, let me know and I could post my code when I'm
done (if I have to do this myself).


Nate

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

48 matches

Mail list logo