The 'seriously detailed stripper' was written by Eric, and I made some
adjustments for converting a web page to a formatted data set,
therefore
some special lines were added. I did not post the complete
version, since
it was a custom solution.
Sorry about the confusion when the subject was simply 'tag stripping'
Step 1
You correction is not actually the right way:
The function here is to add numtoChar(160) before every tag "<td"
replace "<td" with numtochar(160)&"<td" in pHtml
should be...
replace "<td" with numtochar(160)&"td>" in pHtml
so
replace "<td" with numtochar(160)&"<td" in pHtml
is intended. Later numtochar(160) will be replaced with a cr. In
the full
workflow, numtochar(160) will occur for many reasons and in the end
stage
all of these will be converted to cr to create a table of the core
data.
Step 2
Yes, emailers can morph the tags.
Should have posted between <pre>...</pre> to avoid this.
-----
replace " " with space in pHtml
replace "<B"&"R>" with return in pHtml --BR
replace "<p" &">" with return in pHtml --p tag
-----
so....
in a web page, white space and returns mean nothing to the browser,
except
for the single space. A run of spaces in a web html document are
interpreted as a single space to the viewer, so we spend a few
lines in
transcript converting to a space char, then dealing with
what the
space characters will mean as data separators (eg a table of
values). In
this case, I wanted to convert spaces in part of a web doc to tabs,
but
other sections of the document could be discarded, so this worked
well for
my app.
In addition, returns mean nothing to a web browser so they can be
replaced
with empty.
Also important for me was the specific order of replacements to
extract the
data from a web page.
Hope this clarifies some of the gymnastics I went through for tag
stripping
and data mining.
Jim Ault
Las Vegas
On 11/3/07 1:48 AM, "[EMAIL PROTECTED]"
<[EMAIL PROTECTED]>
wrote:
This is a seriously detailed stripper, Jim!
Small error in syntax:
replace "<td" with numtochar(160)&"<td" in pHtml
should be...
replace "<td" with numtochar(160)&"td>" in pHtml
Also, a couple of lines were posted html2Txt-mangled. Could you
clarify:
-----
replace " " with space in pHtml
replace "
" with return in pHtml
replace "
" with return in pHtml
-----
If you post the handler as plain text, any html formatted text
should be
correctly handled by the emailer.
/H
-------------------------------
-------------------------------------------------
function StripTags pHtml
local tRegex,tPrevText
get ("é,à,ç")
get it & (",>,<,ê")
get it & (",è,©,•")
get it & (",',·,&")
-- add more chars if you wish, then...
constant kHtml = it
constant kConvertedHtml = "é,à,ç,>,<,ê,è,©"
--using contants means you cannot accidentally
-- modify these vars and damage the results
-----
replace numtochar(13) with empty in pHtml
replace tab with empty in pHtml
replace "<td" with numtochar(160)&"<td" in pHtml
-----
put replacetext(pHtml,"(?Usi)<SCRIPT.*</SCRIPT>","") into pHtml
put replacetext(pHtml,"(?Usi)<STYLE>.*</STYLE>","") into pHtml
put replacetext(pHtml,"(?Usi)<\?.*\?>","") into pHtml
-----
replace " " with space in pHtml
replace "
" with return in pHtml
replace "
" with return in pHtml
-----
put "<[^><]*>" into tRegex
put replacetext(pHtml,tRegex,"") into pHtml
put replacetext(pHtml,tRegex,"") into pHtml
----- repeat replacements until there are no changes
repeat until tPrevText is pHtml
put pHtml into tPrevText
put replacetext(pHtml," +",space) into pHtml
put replacetext(pHtml,"^ ","") into pHtml
end repeat
-----
replace (space & return) with return in pHtml
replace (return & space) with return in pHtml
filter pHtml without empty
replace numtochar(160) with empty in pHtml
-----
replace """ with quote in pHtml
repeat with i = 1 to the number of items of kHtml
replace item i of kHtml with item i of kConvertedHtml in pHtml
end repeat
-----
--put pHtml into msg --let's you see the result in the msg box
return pHtml
end StripTags
Jim Ault
Las Vegas
------------------------------------------------
--------------------------------
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your
subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution