On 1/11/07 5:14 AM, "David Bovill" <[EMAIL PROTECTED]> wrote:
> As I am not much good at regular expressions I thought I would share my > ignorance with others :) Here is the best I can do with regard to extracting > links from htmltext in fields - they only work on a single line and they do > not find links with variable whitespace as you may get in html web pages. > > on html_DeconstructNameLink nextHtmlLine, @someText, @someLink > -- <a name="/Users/david/Movies/crossingTheBridge.mp4">Crossing The > Bridge</a> > > put "<a name=" & quote & "([^>]*)" & quote & ">([^<]*)</a>" into someReg > return matchText(nextHtmlLine, someReg, someLink, someText) > end html_DeconstructNameLink > > on html_DeconstructRefLink nextHtmlLine, @someText, @someLink > -- <a href="/Users/david/Movies/crossingTheBridge.mp4">Crossing The > Bridge</a> > > put "<a href=" & quote & "([^>]*)" & quote & ">([^<]*)</a>" into someReg > return matchText(nextHtmlLine, someReg, someLink, someText) > end html_DeconstructRefLink > > Is there a better way? As much as I like and use RegEx, there are better of ways that I use for links depending on the web content you encounter. Some pages are driven by javascript, php, or other server program and become very nicely consistent. Others are done using templates and are haphazardly composed. One of the starting points I have sent to the list in the past couple months is the non-Regex method: replace cr with empty in pageText --remove all cr's replace "<a" with (cr & "<a") in pageText replace "</a" with (cr & "</a") filter pageText with "<a*" -- now you only have a list of <A> tags filter pageText with "href" -- now you only have the <A with HREF assuming you want the "http" only replace "http:" with cr & "http:" -- now all lines start with "http:" What may work in your pages is repeat for each line LNN in pageText put word 1 of LNN & cr after newList end repeat delete last char of newList -- the gotcha would be spaces in the link This is a little more robust replace "http:" with cr & "http:" set the itemDel to quote repeat for each line LNN in pageText put item 1 of LNN & cr after newList end repeat delete last char of newList Hope this gets you close. There are more examples I have posted in the past, so you might want to search the archives on my name to find those threads. http://www.mail-archive.com/[email protected]/ Jim Ault Las Vegas _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
