Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Jim Lucas
Brad Fuller wrote:
 I'm looking for a regular expression to accomplish a specific task.
 
 I'm hoping someone who's really good at regex patterns can lend a quick hand.
 
 I need a regex pattern that will grab URLs out of HTML that have a
 certain link text. (i.e. the word Continue)
 
 This is what I have so far but it does not work properly (If there are
 other attributes in the a tag it returns them as part of the URL.)
 
 
 preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
 $html, $matches);
 
 It needs to be able to extract the URL and disregard arbitrary
 attributes in the HTML tag
 
 Test it with the following examples:
 
 a href=/path/to/url.htmlContinue/a
 a href='/path/to/url.html'Continue/a
 a href=http://example.com/path/to/url.html; class=linkContinue/a
 a style=font-size: 12px href=http://example.com/path/to/url.html;
 onlick=someFunction('foo','bar')Continue/a
 
 Please reply
 
 Your help is much appreciated.
 
 Thanks in advance,
 Brad F.
 

Looking at this document from an XML standpoint, I could see doing this rather
easily.  Without having to use regex.  You might look into using DomDocument and
simpleXML to complete the task.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Ashley Sheridan
On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:

 I'm looking for a regular expression to accomplish a specific task.
 
 I'm hoping someone who's really good at regex patterns can lend a quick hand.
 
 I need a regex pattern that will grab URLs out of HTML that have a
 certain link text. (i.e. the word Continue)
 
 This is what I have so far but it does not work properly (If there are
 other attributes in the a tag it returns them as part of the URL.)
 
 
 preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
 $html, $matches);
 
 It needs to be able to extract the URL and disregard arbitrary
 attributes in the HTML tag
 
 Test it with the following examples:
 
 a href=/path/to/url.htmlContinue/a
 a href='/path/to/url.html'Continue/a
 a href=http://example.com/path/to/url.html; class=linkContinue/a
 a style=font-size: 12px href=http://example.com/path/to/url.html;
 onlick=someFunction('foo','bar')Continue/a
 
 Please reply
 
 Your help is much appreciated.
 
 Thanks in advance,
 Brad F.
 


preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^
\\']+?).+?Continue/a#i', $html, $matches);

I just changed your regex a bit. What your regex was previously doing
was matching everything from the first quote after the href= right up
until the first  it found, which would usually be the one that closes
the opening tag. You could make it a bit more intelligent if you wished
with backreferencing to make sure it matches against the same type of
quotation character it matched as the start of the href's value.

Thanks,
Ash
http://www.ashleysheridan.co.uk




Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Brad Fuller
On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
a...@ashleysheridan.co.ukwrote:

  On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:

 I'm looking for a regular expression to accomplish a specific task.

 I'm hoping someone who's really good at regex patterns can lend a quick hand.

 I need a regex pattern that will grab URLs out of HTML that have a
 certain link text. (i.e. the word Continue)

 This is what I have so far but it does not work properly (If there are
 other attributes in the a tag it returns them as part of the URL.)

 
 preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
 $html, $matches);

 It needs to be able to extract the URL and disregard arbitrary
 attributes in the HTML tag

 Test it with the following examples:

 a href=/path/to/url.htmlContinue/a
 a href='/path/to/url.html'Continue/a
 a href=http://example.com/path/to/url.html; class=linkContinue/a
 a style=font-size: 12px href=http://example.com/path/to/url.html;
 onlick=someFunction('foo','bar')Continue/a

 Please reply

 Your help is much appreciated.

 Thanks in advance,
 Brad F.



 preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i',
 $html, $matches);

 I just changed your regex a bit. What your regex was previously doing was
 matching everything from the first quote after the href= right up until the
 first  it found, which would usually be the one that closes the opening
 tag. You could make it a bit more intelligent if you wished with
 backreferencing to make sure it matches against the same type of quotation
 character it matched as the start of the href's value.

   Thanks,
 Ash
 http://www.ashleysheridan.co.uk




I appreciate the help.  However, when try this I only get the first
character of the URL.  Can you double check it please.

Thanks again


Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Ashley Sheridan
On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:

 On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
 a...@ashleysheridan.co.ukwrote:
 
   On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
 
  I'm looking for a regular expression to accomplish a specific task.
 
  I'm hoping someone who's really good at regex patterns can lend a quick 
  hand.
 
  I need a regex pattern that will grab URLs out of HTML that have a
  certain link text. (i.e. the word Continue)
 
  This is what I have so far but it does not work properly (If there are
  other attributes in the a tag it returns them as part of the URL.)
 
  
  preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
  $html, $matches);
 
  It needs to be able to extract the URL and disregard arbitrary
  attributes in the HTML tag
 
  Test it with the following examples:
 
  a href=/path/to/url.htmlContinue/a
  a href='/path/to/url.html'Continue/a
  a href=http://example.com/path/to/url.html; class=linkContinue/a
  a style=font-size: 12px href=http://example.com/path/to/url.html;
  onlick=someFunction('foo','bar')Continue/a
 
  Please reply
 
  Your help is much appreciated.
 
  Thanks in advance,
  Brad F.
 
 
 
  preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i',
  $html, $matches);
 
  I just changed your regex a bit. What your regex was previously doing was
  matching everything from the first quote after the href= right up until the
  first  it found, which would usually be the one that closes the opening
  tag. You could make it a bit more intelligent if you wished with
  backreferencing to make sure it matches against the same type of quotation
  character it matched as the start of the href's value.
 
Thanks,
  Ash
  http://www.ashleysheridan.co.uk
 
 
 
 
 I appreciate the help.  However, when try this I only get the first
 character of the URL.  Can you double check it please.
 
 Thanks again


I think it's probably the first ? in ([^\\']+?)

Remove that and it should do the trick

Thanks,
Ash
http://www.ashleysheridan.co.uk




Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Israel Ekpo
On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
a...@ashleysheridan.co.ukwrote:

 On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:

  On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
  a...@ashleysheridan.co.ukwrote:
 
On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
  
   I'm looking for a regular expression to accomplish a specific task.
  
   I'm hoping someone who's really good at regex patterns can lend a quick
 hand.
  
   I need a regex pattern that will grab URLs out of HTML that have a
   certain link text. (i.e. the word Continue)
  
   This is what I have so far but it does not work properly (If there are
   other attributes in the a tag it returns them as part of the URL.)
  
  
 preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
   $html, $matches);
  
   It needs to be able to extract the URL and disregard arbitrary
   attributes in the HTML tag
  
   Test it with the following examples:
  
   a href=/path/to/url.htmlContinue/a
   a href='/path/to/url.html'Continue/a
   a href=http://example.com/path/to/url.html;
 class=linkContinue/a
   a style=font-size: 12px href=http://example.com/path/to/url.html;
   onlick=someFunction('foo','bar')Continue/a
  
   Please reply
  
   Your help is much appreciated.
  
   Thanks in advance,
   Brad F.
  
  
  
  
 preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i',
   $html, $matches);
  
   I just changed your regex a bit. What your regex was previously doing
 was
   matching everything from the first quote after the href= right up until
 the
   first  it found, which would usually be the one that closes the
 opening
   tag. You could make it a bit more intelligent if you wished with
   backreferencing to make sure it matches against the same type of
 quotation
   character it matched as the start of the href's value.
  
 Thanks,
   Ash
   http://www.ashleysheridan.co.uk
  
  
  
 
  I appreciate the help.  However, when try this I only get the first
  character of the URL.  Can you double check it please.
 
  Thanks again


 I think it's probably the first ? in ([^\\']+?)

 Remove that and it should do the trick

 Thanks,
 Ash
 http://www.ashleysheridan.co.uk



Hi Brad,

I agree with Jim.

Take a look at this. It might help.

?php

$xml_string = TEXT_BOUNDARY
html
head
title/title
/head
body
div
a href=http://example.com/path/to/urlA.html;Continue/a
a href=http://example.com/path/to/url2.html;Brad Fuller/a
a href=http://example.com/path/to/urlB.html;Continue/a
a href=http://example.com/path/to/url4.html;PHP.net/a
a href=http://example.com/path/to/urlC.html;
class=linkContinue/a
a style=font-size: 12px href=
http://example.com/path/to/urlD.html;
onclick=someFunction('foo','bar')Continue/a
/div
/body
/html
TEXT_BOUNDARY;

$xml = simplexml_load_string($xml_string);

$continue_hrefs = $xml-xpath(//a[text() = 'Continue']/@href);

print_r($continue_hrefs);

?

-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.


Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Brad Fuller
On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan
a...@ashleysheridan.co.ukwrote:

  On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:

 On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
 a...@ashleysheridan.co.ukwrote:

   On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
 
  I'm looking for a regular expression to accomplish a specific task.
 
  I'm hoping someone who's really good at regex patterns can lend a quick 
  hand.
 
  I need a regex pattern that will grab URLs out of HTML that have a
  certain link text. (i.e. the word Continue)
 
  This is what I have so far but it does not work properly (If there are
  other attributes in the a tag it returns them as part of the URL.)
 
  
  preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
  $html, $matches);
 
  It needs to be able to extract the URL and disregard arbitrary
  attributes in the HTML tag
 
  Test it with the following examples:
 
  a href=/path/to/url.htmlContinue/a
  a href='/path/to/url.html'Continue/a
  a href=http://example.com/path/to/url.html; class=linkContinue/a
  a style=font-size: 12px href=http://example.com/path/to/url.html;
  onlick=someFunction('foo','bar')Continue/a
 
  Please reply
 
  Your help is much appreciated.
 
  Thanks in advance,
  Brad F.
 
 
 
  preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i',
  $html, $matches);
 
  I just changed your regex a bit. What your regex was previously doing was
  matching everything from the first quote after the href= right up until the
  first  it found, which would usually be the one that closes the opening
  tag. You could make it a bit more intelligent if you wished with
  backreferencing to make sure it matches against the same type of quotation
  character it matched as the start of the href's value.
 
Thanks,
  Ash
  http://www.ashleysheridan.co.uk
 
 
 

 I appreciate the help.  However, when try this I only get the first
 character of the URL.  Can you double check it please.

 Thanks again


 I think it's probably the first ? in ([^\\']+?)

 Remove that and it should do the trick

   Thanks,
 Ash
 http://www.ashleysheridan.co.uk



That did the trick.  Thanks Ash you are awesome!

Also thanks Jim for your suggestion.  I may move to SimpleXML if the project
grows much bigger.  But for now I was looking for a nice one liner and this
is it.

Cheers,
Brad


Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Brad Fuller
On Fri, Oct 23, 2009 at 1:54 PM, Israel Ekpo israele...@gmail.com wrote:


 On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan a...@ashleysheridan.co.uk
 wrote:

 On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote:

  On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan
  a...@ashleysheridan.co.ukwrote:
 
    On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote:
  
   I'm looking for a regular expression to accomplish a specific task.
  
   I'm hoping someone who's really good at regex patterns can lend a
   quick hand.
  
   I need a regex pattern that will grab URLs out of HTML that have a
   certain link text. (i.e. the word Continue)
  
   This is what I have so far but it does not work properly (If there are
   other attributes in the a tag it returns them as part of the URL.)
  
  
   preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i',
   $html, $matches);
  
   It needs to be able to extract the URL and disregard arbitrary
   attributes in the HTML tag
  
   Test it with the following examples:
  
   a href=/path/to/url.htmlContinue/a
   a href='/path/to/url.html'Continue/a
   a href=http://example.com/path/to/url.html;
   class=linkContinue/a
   a style=font-size: 12px href=http://example.com/path/to/url.html;
   onlick=someFunction('foo','bar')Continue/a
  
   Please reply
  
   Your help is much appreciated.
  
   Thanks in advance,
   Brad F.
  
  
  
  
   preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i',
   $html, $matches);
  
   I just changed your regex a bit. What your regex was previously doing
   was
   matching everything from the first quote after the href= right up
   until the
   first  it found, which would usually be the one that closes the
   opening
   tag. You could make it a bit more intelligent if you wished with
   backreferencing to make sure it matches against the same type of
   quotation
   character it matched as the start of the href's value.
  
     Thanks,
   Ash
   http://www.ashleysheridan.co.uk
  
  
  
 
  I appreciate the help.  However, when try this I only get the first
  character of the URL.  Can you double check it please.
 
  Thanks again


 I think it's probably the first ? in ([^\\']+?)

 Remove that and it should do the trick

 Thanks,
 Ash
 http://www.ashleysheridan.co.uk



 Hi Brad,

 I agree with Jim.

 Take a look at this. It might help.

 ?php

 $xml_string = TEXT_BOUNDARY
 html
     head
     title/title
     /head
     body
     div
     a href=http://example.com/path/to/urlA.html;Continue/a
     a href=http://example.com/path/to/url2.html;Brad Fuller/a
     a href=http://example.com/path/to/urlB.html;Continue/a
     a href=http://example.com/path/to/url4.html;PHP.net/a
     a href=http://example.com/path/to/urlC.html;
 class=linkContinue/a
     a style=font-size: 12px
 href=http://example.com/path/to/urlD.html;
 onclick=someFunction('foo','bar')Continue/a
     /div
     /body
 /html
 TEXT_BOUNDARY;

 $xml = simplexml_load_string($xml_string);

 $continue_hrefs = $xml-xpath(//a[text() = 'Continue']/@href);

 print_r($continue_hrefs);

 ?


Thanks, I'm sure I will use this at some point in the future :)

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Paul M Foster
On Fri, Oct 23, 2009 at 01:54:40PM -0400, Brad Fuller wrote:

 Thanks Ash you are awesome!

Brad, you're violating list rules. We never say that kind of thing to
Ash *where he can hear it*. Only behind his back. ;-}

Paul

-- 
Paul M. Foster

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] regex pattern for extracting URLs

2009-10-23 Thread Ashley Sheridan
On Fri, 2009-10-23 at 15:17 -0400, Paul M Foster wrote:

 On Fri, Oct 23, 2009 at 01:54:40PM -0400, Brad Fuller wrote:
 
  Thanks Ash you are awesome!
 
 Brad, you're violating list rules. We never say that kind of thing to
 Ash *where he can hear it*. Only behind his back. ;-}
 
 Paul
 
 -- 
 Paul M. Foster
 


Well, it makes a refreshing change, off list people just want to insult
me :p

Thanks,
Ash
http://www.ashleysheridan.co.uk