have a look at peg_replace in the man, you could also get your users to save as
filtered html which get rid of some of it, there's also a MS tool Microsoft Office
HTML Filter 2 that will clean it some more, it says it's for word 2000 but it works
fine for word 2002/XP.
but your best option is to use preg_replace to swap out all the smart tags etc.
Paul Roberts
http://www.paul-roberts.com
[EMAIL PROTECTED]
- Original Message -
From: DL Neil [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Saturday, August 31, 2002 12:02 AM
Subject: Re: [PHP] Ridding myself of HTML tags
Liam,
If you were to stristr()/remove everything up to and including the /head
tag, would that take care of things?
=dn
I've got a lil problem with HTML tags. Here's the description.
My site accepts HTML files by upload. A lot of these files are written in
MS
Word and then saved as HTML files from that. MS Word likes to put a bunch
of
garbage at the beginning of the file. Now, when users upload their HTML
files, my script goes and striptags all of the unnecessary junk in there
except it can't rid all this junk (HTML, XML, CSS, JavaScript) at the
beginning of the HTML file. Some of these tags span multiple lines, and my
script goes through line-by-line, so it won't identify these as tags. Is
there a simpler fashion? I don't need the junk about style sheeting and
stuff, because I have a style sheet that will take care of styling the
files
the way they should be. I don't want the extra tags, even though they're
invisible to users when they web-view, because these are e-mailable files
(for HTML mail, it's fine; for text mail, I need to strip it down and
that's
the problem).
=
Just in case, I've included the HTML code below:
html xmlns:o=urn:schemas-microsoft-com:office:office
xmlns:w=urn:schemas-microsoft-com:office:word
xmlns=http://www.w3.org/TR/REC-html40;
head
meta http-equiv=Content-Type content=text/html; charset=windows-1252
meta name=ProgId content=Word.Document
meta name=Generator content=Microsoft Word 10
meta name=Originator content=Microsoft Word 10
link rel=File-List href=NW100_files/filelist.xml
titleTest test test/title
!--[if gte mso 9]xml
o:DocumentProperties
o:AuthorLiam Gibbs/o:Author
o:LastAuthorLiam Gibbs/o:LastAuthor
o:Revision1/o:Revision
o:TotalTime1/o:TotalTime
o:Created2002-08-30T18:09:00Z/o:Created
o:LastSaved2002-08-30T18:10:00Z/o:LastSaved
o:Pages1/o:Pages
o:Words13/o:Words
o:Characters79/o:Characters
o:CompanySXIA/o:Company
o:Lines1/o:Lines
o:Paragraphs1/o:Paragraphs
o:CharactersWithSpaces91/o:CharactersWithSpaces
o:Version10.3501/o:Version
/o:DocumentProperties
/xml![endif]--!--[if gte mso 9]xml
w:WordDocument
w:SpellingStateClean/w:SpellingState
w:GrammarStateClean/w:GrammarState
w:Compatibility
w:BreakWrappedTables/
w:SnapToGridInCell/
w:WrapTextWithPunct/
w:UseAsianBreakRules/
/w:Compatibility
w:BrowserLevelMicrosoftInternetExplorer4/w:BrowserLevel
/w:WordDocument
/xml![endif]--
style
!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:;
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Times New Roman;
mso-fareast-font-family:Times New Roman;}
span.SpellE
{mso-style-name:;
mso-spl-e:yes;}
@page Section1
{size:612.0pt 792.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:35.4pt;
mso-footer-margin:35.4pt;
mso-paper-source:0;}
div.Section1
{page:Section1;}
--
/style
!--[if gte mso 10]
style
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:Table Normal;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:;
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:Times New Roman;}
/style
![endif]--
/head
body lang=EN-US style='tab-interval:36.0pt'
div class=Section1
p class=MsoNormalTest span class=SpellEtest/span span
class=SpellEtest/span/p
p class=MsoNormal align=center style='text-align:center'span
class=SpellEFdjfkasdjfkla/span/p
p class=MsoNormal align=center style='text-align:center'span
class=SpellEb
style='mso-bidi-font-weight:normal'Fdjkslafjdklaf/b/span/p
p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p
p class=MsoNormal style='text-align:justify'span
class=SpellEFdasfdfasffasdfdaadfdfs/span/p
p class=MsoNormal style='text-align:justify'span
class=SpellEDfsdfs/span/p
p class=MsoNormal style='text-align:justify'Hi/p
p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p
p class=MsoNormal style='text-align:justify'span
style='mso