Re: Regex to remove all tags from a web page

Eric Chatonet Mon, 31 Oct 2005 05:51:32 -0800

Hi Alex,

Thanks a lot.

That's a first good step since the out text is about 20/30% of the intext :-)HTML tags are stripped but extra code (php,java, etc.) of courseremains.

Any ideas for these ones?


Le 31 oct. 05 à 13:00, Alex Tweedly a écrit :

Eric Chatonet wrote:
Hi all,
I searched the list archive and the net for a regex that wouldallow to retrieve the meaningful text from any web page,stripping all html tags, extra code, etc. but I did not findsomething really convincing :-(
Any help would be much appreciated :-)
PS. I don't want to use "set the htmlText/get text" using afield: this way crashes Rev unpredictably when doing batchprocessing.
I suspect this will be "not really convincing" :-)

Just removing tags should be
    put  "<[^><]*>" into tRex
    put replacetext(fld "in", tRex, "") into fld "out"
That assumes the html has no "<" or ">" , and is generally well-formed.
That seems too simple - so it can't be convincing :-)

Alex Tweedly       http://www.tweedly.net


Best Regards from Paris,

Eric Chatonet.
----------------------------------------------------------------
So Smart Software

For institutions, companies and associations
Built-to-order applications: management, multimedia, internet, etc.
Windows, Mac OS and Linux... With the French touch

Free plugins and tutorials on my website
----------------------------------------------------------------
Web site        http://www.sosmartsoftware.com/
Email        [EMAIL PROTECTED]/
Phone        33 (0)1 43 31 77 62
Mobile        33 (0)6 20 74 50 86
----------------------------------------------------------------

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Regex to remove all tags from a web page

Reply via email to