Eric Chatonet wrote: > > I searched the list archive and the net for a regex that would allow > to retrieve the meaningful text from any web page, stripping all html > tags, extra code, etc. but I did not find something really convincing > :-( > Any help would be much appreciated :-)
I have cast a few 'data mining' scripts with regex, but tailor them for more than just removing tags. Are you also trying to format strings (text, paragraphs) or data (tables, labeled values)? Specifics are important. One example is that a page of accounting data that has been working great for 3.5 months, now has a glitch since the authors changed the web page format. tip: Check to see if </HTML> is in the text, which means that the download was complete, whenever it occurred. top: Convert all returns to "MMMM" so that now there is only one line between ^ and $ (since returns mean nothing in html, why deal with empties and multiple empties?) One step you should try to incorporate is a 'back check'... does the result have enough/too many characters, does it contain "<" or ">", are key words present/absent. tip: Replacing some tags with a tab char means that you can copy/paste the block into a spreadsheet and see where the columns are and excess to be trimmed. Send a page or two my way and I will see if something I have conjured will work for you. I'll just toss it in my caldron and see what bubbles to the top. </Halloween ref> Jim Ault Las Vegas _______________________________________________ use-revolution mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
