Re: [PHP] regex pattern for extracting URLs
Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. Looking at this document from an XML standpoint, I could see doing this rather easily. Without having to use regex. You might look into using DomDocument and simpleXML to complete the task. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] regex pattern for extracting URLs
On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^ \\']+?).+?Continue/a#i', $html, $matches); I just changed your regex a bit. What your regex was previously doing was matching everything from the first quote after the href= right up until the first it found, which would usually be the one that closes the opening tag. You could make it a bit more intelligent if you wished with backreferencing to make sure it matches against the same type of quotation character it matched as the start of the href's value. Thanks, Ash http://www.ashleysheridan.co.uk
Re: [PHP] regex pattern for extracting URLs
On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i', $html, $matches); I just changed your regex a bit. What your regex was previously doing was matching everything from the first quote after the href= right up until the first it found, which would usually be the one that closes the opening tag. You could make it a bit more intelligent if you wished with backreferencing to make sure it matches against the same type of quotation character it matched as the start of the href's value. Thanks, Ash http://www.ashleysheridan.co.uk I appreciate the help. However, when try this I only get the first character of the URL. Can you double check it please. Thanks again
Re: [PHP] regex pattern for extracting URLs
On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote: On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i', $html, $matches); I just changed your regex a bit. What your regex was previously doing was matching everything from the first quote after the href= right up until the first it found, which would usually be the one that closes the opening tag. You could make it a bit more intelligent if you wished with backreferencing to make sure it matches against the same type of quotation character it matched as the start of the href's value. Thanks, Ash http://www.ashleysheridan.co.uk I appreciate the help. However, when try this I only get the first character of the URL. Can you double check it please. Thanks again I think it's probably the first ? in ([^\\']+?) Remove that and it should do the trick Thanks, Ash http://www.ashleysheridan.co.uk
Re: [PHP] regex pattern for extracting URLs
On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote: On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i', $html, $matches); I just changed your regex a bit. What your regex was previously doing was matching everything from the first quote after the href= right up until the first it found, which would usually be the one that closes the opening tag. You could make it a bit more intelligent if you wished with backreferencing to make sure it matches against the same type of quotation character it matched as the start of the href's value. Thanks, Ash http://www.ashleysheridan.co.uk I appreciate the help. However, when try this I only get the first character of the URL. Can you double check it please. Thanks again I think it's probably the first ? in ([^\\']+?) Remove that and it should do the trick Thanks, Ash http://www.ashleysheridan.co.uk Hi Brad, I agree with Jim. Take a look at this. It might help. ?php $xml_string = TEXT_BOUNDARY html head title/title /head body div a href=http://example.com/path/to/urlA.html;Continue/a a href=http://example.com/path/to/url2.html;Brad Fuller/a a href=http://example.com/path/to/urlB.html;Continue/a a href=http://example.com/path/to/url4.html;PHP.net/a a href=http://example.com/path/to/urlC.html; class=linkContinue/a a style=font-size: 12px href= http://example.com/path/to/urlD.html; onclick=someFunction('foo','bar')Continue/a /div /body /html TEXT_BOUNDARY; $xml = simplexml_load_string($xml_string); $continue_hrefs = $xml-xpath(//a[text() = 'Continue']/@href); print_r($continue_hrefs); ? -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: [PHP] regex pattern for extracting URLs
On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote: On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i', $html, $matches); I just changed your regex a bit. What your regex was previously doing was matching everything from the first quote after the href= right up until the first it found, which would usually be the one that closes the opening tag. You could make it a bit more intelligent if you wished with backreferencing to make sure it matches against the same type of quotation character it matched as the start of the href's value. Thanks, Ash http://www.ashleysheridan.co.uk I appreciate the help. However, when try this I only get the first character of the URL. Can you double check it please. Thanks again I think it's probably the first ? in ([^\\']+?) Remove that and it should do the trick Thanks, Ash http://www.ashleysheridan.co.uk That did the trick. Thanks Ash you are awesome! Also thanks Jim for your suggestion. I may move to SimpleXML if the project grows much bigger. But for now I was looking for a nice one liner and this is it. Cheers, Brad
Re: [PHP] regex pattern for extracting URLs
On Fri, Oct 23, 2009 at 1:54 PM, Israel Ekpo israele...@gmail.com wrote: On Fri, Oct 23, 2009 at 1:48 PM, Ashley Sheridan a...@ashleysheridan.co.uk wrote: On Fri, 2009-10-23 at 13:45 -0400, Brad Fuller wrote: On Fri, Oct 23, 2009 at 1:28 PM, Ashley Sheridan a...@ashleysheridan.co.ukwrote: On Fri, 2009-10-23 at 13:23 -0400, Brad Fuller wrote: I'm looking for a regular expression to accomplish a specific task. I'm hoping someone who's really good at regex patterns can lend a quick hand. I need a regex pattern that will grab URLs out of HTML that have a certain link text. (i.e. the word Continue) This is what I have so far but it does not work properly (If there are other attributes in the a tag it returns them as part of the URL.) preg_match_all('#a[\s]+[^]*href\s*=\s*([\\']+)([^]+?)(\1|)Continue/a#i', $html, $matches); It needs to be able to extract the URL and disregard arbitrary attributes in the HTML tag Test it with the following examples: a href=/path/to/url.htmlContinue/a a href='/path/to/url.html'Continue/a a href=http://example.com/path/to/url.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/url.html; onlick=someFunction('foo','bar')Continue/a Please reply Your help is much appreciated. Thanks in advance, Brad F. preg_match_all('#a[\s]+[^]*href\s*=\s*[\\']+([^\\']+?).+?Continue/a#i', $html, $matches); I just changed your regex a bit. What your regex was previously doing was matching everything from the first quote after the href= right up until the first it found, which would usually be the one that closes the opening tag. You could make it a bit more intelligent if you wished with backreferencing to make sure it matches against the same type of quotation character it matched as the start of the href's value. Thanks, Ash http://www.ashleysheridan.co.uk I appreciate the help. However, when try this I only get the first character of the URL. Can you double check it please. Thanks again I think it's probably the first ? in ([^\\']+?) Remove that and it should do the trick Thanks, Ash http://www.ashleysheridan.co.uk Hi Brad, I agree with Jim. Take a look at this. It might help. ?php $xml_string = TEXT_BOUNDARY html head title/title /head body div a href=http://example.com/path/to/urlA.html;Continue/a a href=http://example.com/path/to/url2.html;Brad Fuller/a a href=http://example.com/path/to/urlB.html;Continue/a a href=http://example.com/path/to/url4.html;PHP.net/a a href=http://example.com/path/to/urlC.html; class=linkContinue/a a style=font-size: 12px href=http://example.com/path/to/urlD.html; onclick=someFunction('foo','bar')Continue/a /div /body /html TEXT_BOUNDARY; $xml = simplexml_load_string($xml_string); $continue_hrefs = $xml-xpath(//a[text() = 'Continue']/@href); print_r($continue_hrefs); ? Thanks, I'm sure I will use this at some point in the future :) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] regex pattern for extracting URLs
On Fri, Oct 23, 2009 at 01:54:40PM -0400, Brad Fuller wrote: Thanks Ash you are awesome! Brad, you're violating list rules. We never say that kind of thing to Ash *where he can hear it*. Only behind his back. ;-} Paul -- Paul M. Foster -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] regex pattern for extracting URLs
On Fri, 2009-10-23 at 15:17 -0400, Paul M Foster wrote: On Fri, Oct 23, 2009 at 01:54:40PM -0400, Brad Fuller wrote: Thanks Ash you are awesome! Brad, you're violating list rules. We never say that kind of thing to Ash *where he can hear it*. Only behind his back. ;-} Paul -- Paul M. Foster Well, it makes a refreshing change, off list people just want to insult me :p Thanks, Ash http://www.ashleysheridan.co.uk