On Fri Apr 8, Eric Chatonet eric.chatonet at sosmartsoftware.com wrote:

Hi Gregory,

Put your web page into a field and get the text of the field:

put url "xyz" into fld "MyHiddenField"
put fld "MyHiddenField" into tPlainText

Le 7 avr. 05, � 23:07, Gregory Lypny a �crit :

> Hello Everyone,
>
> Is there a way in Revolution to strip away the HTML code from a web
> page, leaving just the content in plain text?
>
>     Greg



While this is a convenient and quick way to get sort of a "raw" version of the text of a HTML file - which then in most cases needs to be further edited - if you want to extract text from a special kind of HTMl files more often or on a regular basis, you should adapt your script to the specific structure of the HTML file.


Two examples:

1. Extracting text from articles of the online version of magazine "Education Week" <www.edweek.org>

The script assumes you have got three fields, two of which are named "HTMLText" and "Transtext"

"on mouseUp
# fld 1 contains the HTML code of an "Education Week" article from <www.edweek.org>
set the htmltext of fld "HTMLText" to fld 1
put fld "HTMLText" into fld "Transtext"
put the htmltext of fld "TransText" into tInterim
put the number of lines of tInterim into LNumber
repeat with i = LNumber down to 1
if line i of tInterim contains "&nbsp;" then delete line i of tInterim
end repeat
replace "&acirc;&#128;&#156;" with Quote in tInterim
replace "&acirc;&#128;&#157;" with Quote in tInterim
replace "&acirc;&#128;&#153;" with "'" in tInterim
replace "&acirc;&#128;&#148;" with "'" in tInterim
set the htmltext of fld "Transtext" to tInterim
end mouseUp"


The "replace" lines provide proper "Quotes" and apostrophes.

Line
"if line i of tInterim contains "&nbsp;" then delete line i of tInterim"
serves to remove code from the beginning of the web page. If this line would be left out you would get text like the following at the beginning of your "plain" text:


"var _hbEC=0,_hbE=new Array;function _hbEvent(a,b){b=_hbE[_hbEC++]=new Object();b._N=a;b._C=0;return b;} var hbx=_hbEvent("pv");hbx.vpc="HBX0100u";hbx.gn="ehg-editorialpro.hitbox.com"; //BEGIN EDITABLE SECTION //CONFIGURATION VARIABLES hbx.acct="DM540902PMCA";//ACCOUNT NUMBER(S) hbx.pn="PUT+PAGE+NAME+HERE";//PAGE NAME(S) hbx.mlc="CONTENT+CATEGORY";//MULTI-LEVEL CONTENT CATEGORY hbx.pndef="title";//DEFAULT PAGE NAME hbx.ctdef="full";//DEFAULT CONTENT CATEGORY //OPTIONAL PAGE VARIABLES //ACTION SETTINGS hbx.fv="";//FORM VALIDATION MINIMUM ELEMENTS OR SUBMIT FUNCTION NAME hbx.lt="auto";//LINK TRACKING hbx.dlf="n";//DOWNLOAD FILTER hbx.dft="n";//DOWNLOAD FILE NAMING hbx.elf="n";//EXIT LINK FILTER //SEGMENTS AND FUNNELS hbx.seg="++";//VISITOR SEGMENTATION hbx.fnl="";//FUNNELS //CAMPAIGNS hbx.cmp="";//CAMPAIGN ID hbx.cmpn="";//CAMPAIGN ID IN QUERY hbx.dcmp="";//DYNAMIC CAMPAIGN ID hbx.dcmpn="";//DYNAMIC CAMPAIGN ID IN QUERY..."
etc.


2. Extracting the plain text worth searching from the XML files of the Rev "Dictionary"

I used similar routines to store the searchable text portions as arrays in my tool "Searchdocs" (See last version at
<http://www.sanke.org/Software/SearchDocsXML24-Rev.zip>


The script assumes you have got two fields named "Display" and "Transtext". Because during the conversion from XML to text "Tabs" can happen to be inserted into the plain text,

line

"replace Tab with CR in tXML"

is helpful for better formatting. See the different results when you leave out this line.


"on mouseUp answer file "Choose XML file from"&&Quote&"Dictionary"&Quote&&"folder." put it into Adresse put "file:"&Adresse into Fxml put URL Fxml into tXML put offset("<name>",tXML) + 15 into ANam put offset("]]></name>",tXML) -1 into ENam put char ANam to ENam of tXML into tTitle put offset("<syntax>",tXML) + 17 into Asyn put offset("]]></syntax>",tXML) -1 into Esyn put char Asyn to Esyn of tXML into tSyntax put lineoffset("<summary>",tXML) into Zeile delete line 2 to (Zeile - 1) of tXML put tsyntax before tXML set the htmltext of fld "Transtext" to tXML put the text of fld "Transtext" into tXML replace Tab with CR in tXML put tTitle&CR&CR before tXML put tXML into fld "Display" set the textstyle of line 1 of fld "Display" to bold end mouseUp"

Parsing the XML files to achieve a layout similar to that of the display of the full articles of the Dictionary in the left pane of stack "SearchDocs" of course needs a different and more complex approach.


Regards,

Wilhelm Sanke
<http://www.sanke.org/MetaMedia>

_______________________________________________
use-revolution mailing list
[email protected]
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to