On 12/22/06 4:23 PM, "David Bovill" <[EMAIL PROTECTED]> wrote: > On an aside note - despite searching for a very long time the archives I > cannot find the previous post on how to parse HTML to extract all image > links or href links... I remember some clever replacing and filtering going > on... but I forget the sequence... > > Anyone have some scripts for extracting all anchors (ie "a name="http:...."> > ) or href/image links from htmltext?
This depends on your html page. General rules I use to begin, then later refine to get to my goal. put sorceTxt into htmlPage replace cr with empty in htmlPage --text is now one line replace "href=" with "href="&cr in htmlPage replace "</a" with "</a"&cr in htmlPage filter htmlPage with "*http://*" set the itemdel to ">" repeat for each line LNN in htmlPage put item 1 of LNN & cr after newLinkList end repeat replace cr with empty in htmlPage2 --text is now one line replace "imgsrc=" with "img" & cr & "src=" in htmlPage2 replace ".jpg" with ".jpg"&cr in htmlPage2 replace ".gif" with ".gif"&cr in htmlPage2 filter htmlPage2 with "src=*" set the itemDel to "=" repeat for each line LNN in htmlPage2 put item 2 of LNN & cr after newImgList end repeat You need to refine to match your html page and your goals. Not all 'img' tags are in link tags Not all links have "http://" Other variations can occur, so I would recommend doing the following test: [1] do these 3 steps, then read the result to see how to proceed replace cr with empty in htmlList --text is one line replace "href=" with "href="&cr in htmlPage replace "</a" with "</a"&cr in htmlPage [2] do these 4 steps, then read the result to see how to proceed replace cr with empty in htmlList --text is one line replace "href=" with "href="&cr in htmlPage replace ".jpg" with ".jpg"&cr in htmlPage replace ".gif" with ".gif"&cr in htmlPage Look for lines that have multiple hits that did not get separated, etc. Look for spaces such as "href=", "href = ", "href= " The reason you cannot remove spaces for the whole container is that img src links could contain folder names with spaces. If you know spaces won't occur in your links or imgs, then add this line to the top of the code... replace space with empty in htmlPage Hope this helps Jim Ault Las Vegas _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
