In that book I wrote there is a chapter on making a web scraper, something that could pull images and other media from a web page. I soon found all the articles talking about not using regex with HTML, so I used a mixture of techniques instead. Here’s the first part I wrote about it:
“A common approach when extracting a known pattern of text is to use regular expressions, often referred to as regex or regexp. At its simplest it's easy to understand, but it can get quite complex. Read the Wikipedia article if you want to understand it in depth: http://en.wikipedia.org/wiki/Regular_expression Another useful source of information is this Packt article on regular expressions: http://www.packtpub.com/article/regular-expressions-python-26-text-processing One problem though is that using regexp to parse HTML content is frowned upon. There are scores of articles online telling you outright not to parse HTML with regexp! Here's one pithy example: http://boingboing.net/2011/11/24/why-you-shouldnt-parse-html.html Now, parsing HTML source is exactly what we want to do here, and one solution to the problem is to mix and match, using LiveCode's other text matching and filtering abilities to do most of the work. Although it's not exactly regexp, LiveCode can use regular expressions in some of its matching and filtering functions, and they are somewhat easier to understand than full-blown regexp. So, let's begin by using those …” A few pages later I do use some regex to pull text from the page: function getText pPageSource put replaceText(pPageSource,"(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)","") into pPageSource replace lf with "" in pPageSource” replace tab with " " in pPageSource return pPageSource end getText _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode