Re: Regex to remove all tags from a web page

Alex Tweedly Mon, 31 Oct 2005 04:00:21 -0800

Eric Chatonet wrote:

Hi all,
I searched the list archive and the net for a regex that would allowto retrieve the meaningful text from any web page, stripping all htmltags, extra code, etc. but I did not find something really convincing:-(
Any help would be much appreciated :-)
PS. I don't want to use "set the htmlText/get text" using a field:this way crashes Rev unpredictably when doing batch processing.

I suspect this will be "not really convincing" :-)

Just removing tags should be

    put  "<[^><]*>" into tRex
put replacetext(fld "in", tRex, "") into fld "out"


That assumes the html has no "<" or ">" , and is generally well-formed.

That seems too simple - so it can't be convincing :-)


--
Alex Tweedly       http://www.tweedly.net



--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.361 / Virus Database: 267.12.6/151 - Release Date: 28/10/2005

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Regex to remove all tags from a web page

Reply via email to