In that book I wrote there is a chapter on making a web scraper, something that 
could pull images and other media from a web page. I soon found all the 
articles talking about not using regex with HTML, so I used a mixture of 
techniques instead. Here’s the first part I wrote about it:

“A common approach when extracting a known pattern of text is to use regular 
expressions, often referred to as regex or regexp. At its simplest it's easy to 
understand, but it can get quite complex. Read the Wikipedia article if you 
want to understand it in depth:

http://en.wikipedia.org/wiki/Regular_expression

Another useful source of information is this Packt article on regular 
expressions:

http://www.packtpub.com/article/regular-expressions-python-26-text-processing

One problem though is that using regexp to parse HTML content is frowned upon. 
There are scores of articles online telling you outright not to parse HTML with 
regexp! Here's one pithy example:

http://boingboing.net/2011/11/24/why-you-shouldnt-parse-html.html

Now, parsing HTML source is exactly what we want to do here, and one solution 
to the problem is to mix and match, using LiveCode's other text matching and 
filtering abilities to do most of the work. Although it's not exactly regexp, 
LiveCode can use regular expressions in some of its matching and filtering 
functions, and they are somewhat easier to understand than full-blown regexp. 
So, let's begin by using those …”

A few pages later I do use some regex to pull text from the page:

function getText pPageSource
   put 
replaceText(pPageSource,"(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)","")
 into pPageSource
   replace lf with "" in pPageSource”
   replace tab with " " in pPageSource
   return pPageSource
end getText
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to