Re: Stripping away HTML

Wilhelm Sanke Fri, 08 Apr 2005 12:47:50 -0700

On Fri Apr 8, Eric Chatonet eric.chatonet at sosmartsoftware.com wrote:

Hi Gregory,

Put your web page into a field and get the text of the field:

put url "xyz" into fld "MyHiddenField"
put fld "MyHiddenField" into tPlainText

Le 7 avr. 05, � 23:07, Gregory Lypny a �crit :

> Hello Everyone,
>
> Is there a way in Revolution to strip away the HTML code from a web
> page, leaving just the content in plain text?
>
>     Greg

While this is a convenient and quick way to get sort of a "raw" version of the text of a HTML file - which then in most cases needs to be further edited - if you want to extract text from a special kind of HTMl files more often or on a regular basis, you should adapt your script to the specific structure of the HTML file.

Two examples:

1. Extracting text from articles of the online version of magazine "Education Week" <www.edweek.org>

The script assumes you have got three fields, two of which are named "HTMLText" and "Transtext"

"on mouseUp # fld 1 contains the HTML code of an "Education Week" article from <www.edweek.org> set the htmltext of fld "HTMLText" to fld 1 put fld "HTMLText" into fld "Transtext" put the htmltext of fld "TransText" into tInterim put the number of lines of tInterim into LNumber repeat with i = LNumber down to 1 if line i of tInterim contains " " then delete line i of tInterim end repeat replace "â" with Quote in tInterim replace "â" with Quote in tInterim replace "â" with "'" in tInterim replace "â" with "'" in tInterim set the htmltext of fld "Transtext" to tInterim end mouseUp"

The "replace" lines provide proper "Quotes" and apostrophes.

Line "if line i of tInterim contains " " then delete line i of tInterim" serves to remove code from the beginning of the web page. If this line would be left out you would get text like the following at the beginning of your "plain" text:

"var _hbEC=0,_hbE=new Array;function _hbEvent(a,b){b=_hbE[_hbEC++]=new Object();b._N=a;b._C=0;return b;} var hbx=_hbEvent("pv");hbx.vpc="HBX0100u";hbx.gn="ehg-editorialpro.hitbox.com"; //BEGIN EDITABLE SECTION //CONFIGURATION VARIABLES hbx.acct="DM540902PMCA";//ACCOUNT NUMBER(S) hbx.pn="PUT+PAGE+NAME+HERE";//PAGE NAME(S) hbx.mlc="CONTENT+CATEGORY";//MULTI-LEVEL CONTENT CATEGORY hbx.pndef="title";//DEFAULT PAGE NAME hbx.ctdef="full";//DEFAULT CONTENT CATEGORY //OPTIONAL PAGE VARIABLES //ACTION SETTINGS hbx.fv="";//FORM VALIDATION MINIMUM ELEMENTS OR SUBMIT FUNCTION NAME hbx.lt="auto";//LINK TRACKING hbx.dlf="n";//DOWNLOAD FILTER hbx.dft="n";//DOWNLOAD FILE NAMING hbx.elf="n";//EXIT LINK FILTER //SEGMENTS AND FUNNELS hbx.seg="++";//VISITOR SEGMENTATION hbx.fnl="";//FUNNELS //CAMPAIGNS hbx.cmp="";//CAMPAIGN ID hbx.cmpn="";//CAMPAIGN ID IN QUERY hbx.dcmp="";//DYNAMIC CAMPAIGN ID hbx.dcmpn="";//DYNAMIC CAMPAIGN ID IN QUERY..." etc.

2. Extracting the plain text worth searching from the XML files of the Rev "Dictionary"

I used similar routines to store the searchable text portions as arrays in my tool "Searchdocs" (See last version at <http://www.sanke.org/Software/SearchDocsXML24-Rev.zip>

The script assumes you have got two fields named "Display" and "Transtext". Because during the conversion from XML to text "Tabs" can happen to be inserted into the plain text,

line

"replace Tab with CR in tXML"

is helpful for better formatting. See the different results when you leave out this line.


"on mouseUp
 answer file "Choose XML file from"&&Quote&"Dictionary"&Quote&&"folder."
 put it into Adresse
 put "file:"&Adresse  into Fxml
 put URL Fxml into tXML
 put offset("<name>",tXML) + 15 into ANam
 put offset("]]></name>",tXML) -1 into ENam
 put char ANam to ENam of tXML into tTitle
 put offset("<syntax>",tXML) + 17 into Asyn
 put offset("]]></syntax>",tXML) -1 into Esyn
 put char Asyn to Esyn of tXML into tSyntax
 put lineoffset("<summary>",tXML) into Zeile
 delete line 2 to (Zeile - 1) of tXML
 put tsyntax before tXML
 set the htmltext of fld "Transtext" to tXML
 put the text of fld "Transtext" into tXML
 replace Tab with CR in tXML
 put tTitle&CR&CR before tXML
 put tXML into fld "Display"
 set the textstyle of line 1 of fld "Display" to bold
end mouseUp"

Parsing the XML files to achieve a layout similar to that of the display of the full articles of the Dictionary in the left pane of stack "SearchDocs" of course needs a different and more complex approach.


Regards,

Wilhelm Sanke
<http://www.sanke.org/MetaMedia>

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Stripping away HTML

Reply via email to