Hi,

I am having a problem getting this to compile, do you think you could email it in a stack? Or post it somewhere and I'll download it.

Thanks a lot
All the Best
Dave

On 3 Nov 2007, at 15:53, Jim Ault wrote:

The 'seriously detailed stripper' was written by Eric, and I made some
adjustments for converting a web page to a formatted data set, therefore some special lines were added. I did not post the complete version, since
it was a custom solution.
Sorry about the confusion when the subject was simply 'tag stripping'

Step 1
You correction is not actually the right way:
The function here is to add numtoChar(160) before every tag "<td"
replace "<td" with  numtochar(160)&"<td" in pHtml
should be...
replace "<td"  with numtochar(160)&"td>" in pHtml
so
 replace "<td" with  numtochar(160)&"<td" in pHtml
is intended. Later numtochar(160) will be replaced with a cr. In the full workflow, numtochar(160) will occur for many reasons and in the end stage all of these will be converted to cr to create a table of the core data.

Step 2
Yes, emailers can morph the tags.
Should have posted between <pre>...</pre> to avoid this.

-----
  replace "&nbsp;" with space in pHtml
  replace "<B"&"R>" with return in pHtml --BR
  replace "<p" &">" with return in pHtml --p tag
  -----
so....
in a web page, white space and returns mean nothing to the browser, except
for the single space.  A run of spaces in a web html document are
interpreted as a single space to the viewer, so we spend a few lines in transcript converting &nbsp; to a space char, then dealing with what the space characters will mean as data separators (eg a table of values). In this case, I wanted to convert spaces in part of a web doc to tabs, but other sections of the document could be discarded, so this worked well for
my app.

In addition, returns mean nothing to a web browser so they can be replaced
with empty.

Also important for me was the specific order of replacements to extract the
data from a web page.

Hope this clarifies some of the gymnastics I went through for tag stripping
and data mining.

Jim Ault
Las Vegas

On 11/3/07 1:48 AM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
wrote:


This is a seriously detailed stripper, Jim!

Small error in syntax:

replace "<td" with  numtochar(160)&"<td" in pHtml
should be...
replace "<td"  with numtochar(160)&"td>" in pHtml

Also, a couple of lines were posted html2Txt-mangled. Could you clarify:
  -----
replace "&nbsp;" with space in pHtml
replace "
" with return in pHtml
replace "

" with return in pHtml
-----

If you post the handler as plain text, any html formatted text should be
correctly handled by the emailer.


/H

-------------------------------
-------------------------------------------------
function  StripTags pHtml
local tRegex,tPrevText
get   ("&eacute;,&agrave;,&ccedil;")
get  it &  (",&gt;,&lt;,&ecirc;")
get  it &  (",&egrave;,&copy;,&#149;")
get  it &  (",&#39;,&middot;,&amp;")
-- add more chars if you wish,  then...
constant kHtml = it
constant kConvertedHtml =  "é,à,ç,>,<,ê,è,©"
--using contants means you cannot  accidentally
--    modify these vars and damage the  results
-----
replace numtochar(13) with empty in  pHtml
replace tab with empty in pHtml
replace "<td" with  numtochar(160)&"<td" in pHtml
-----
put  replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into  pHtml
put replacetext(pHtml,"(?Usi)<\?.*\?>","") into  pHtml
-----
replace "&nbsp;" with space in  pHtml
replace "
" with return in pHtml
replace "

" with return in pHtml
-----
put   "<[^><]*>" into tRegex
put replacetext(pHtml,tRegex,"")  into pHtml
put replacetext(pHtml,tRegex,"") into pHtml

  ----- repeat replacements until there are no changes
repeat until tPrevText is pHtml
put pHtml into  tPrevText
put replacetext(pHtml," +",space) into  pHtml
put replacetext(pHtml,"^ ","") into pHtml
end repeat
-----
replace (space & return) with return in  pHtml
replace (return & space) with return in pHtml
filter pHtml without empty
replace numtochar(160) with empty in  pHtml
-----
replace "&quot;" with quote in  pHtml
repeat with i = 1 to the number of items of  kHtml
replace item i of kHtml with item i of  kConvertedHtml in pHtml
end repeat
-----
--put  pHtml into msg  --let's you see the result in the msg box
return  pHtml
end StripTags


Jim Ault
Las Vegas

------------------------------------------------
--------------------------------




_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to