Re: [PHP] Ridding myself of HTML tags

2002-08-31 Thread Paul Roberts

have a look at peg_replace in the man, you could also get your users to save as 
filtered html which get rid of some of it, there's also a MS tool Microsoft Office 
HTML Filter 2 that will clean it some more, it says it's for word 2000 but it works 
fine for word 2002/XP.

but your best option is to use preg_replace to swap out all the smart tags etc.
Paul Roberts
http://www.paul-roberts.com
[EMAIL PROTECTED]



- Original Message - 
From: DL Neil [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Saturday, August 31, 2002 12:02 AM
Subject: Re: [PHP] Ridding myself of HTML tags


 Liam,
 If you were to stristr()/remove everything up to and including the /head
 tag, would that take care of things?
 =dn
 
 
  I've got a lil problem with HTML tags. Here's the description.
 
  My site accepts HTML files by upload. A lot of these files are written in
 MS
  Word and then saved as HTML files from that. MS Word likes to put a bunch
 of
  garbage at the beginning of the file. Now, when users upload their HTML
  files, my script goes and striptags all of the unnecessary junk in there
  except it can't rid all this junk (HTML, XML, CSS, JavaScript) at the
  beginning of the HTML file. Some of these tags span multiple lines, and my
  script goes through line-by-line, so it won't identify these as tags. Is
  there a simpler fashion? I don't need the junk about style sheeting and
  stuff, because I have a style sheet that will take care of styling the
 files
  the way they should be. I don't want the extra tags, even though they're
  invisible to users when they web-view, because these are e-mailable files
  (for HTML mail, it's fine; for text mail, I need to strip it down and
 that's
  the problem).
 
  =
  Just in case, I've included the HTML code below:
 
 
  html xmlns:o=urn:schemas-microsoft-com:office:office
  xmlns:w=urn:schemas-microsoft-com:office:word
  xmlns=http://www.w3.org/TR/REC-html40;
 
  head
  meta http-equiv=Content-Type content=text/html; charset=windows-1252
  meta name=ProgId content=Word.Document
  meta name=Generator content=Microsoft Word 10
  meta name=Originator content=Microsoft Word 10
  link rel=File-List href=NW100_files/filelist.xml
  titleTest test test/title
  !--[if gte mso 9]xml
   o:DocumentProperties
o:AuthorLiam Gibbs/o:Author
o:LastAuthorLiam Gibbs/o:LastAuthor
o:Revision1/o:Revision
o:TotalTime1/o:TotalTime
o:Created2002-08-30T18:09:00Z/o:Created
o:LastSaved2002-08-30T18:10:00Z/o:LastSaved
o:Pages1/o:Pages
o:Words13/o:Words
o:Characters79/o:Characters
o:CompanySXIA/o:Company
o:Lines1/o:Lines
o:Paragraphs1/o:Paragraphs
o:CharactersWithSpaces91/o:CharactersWithSpaces
o:Version10.3501/o:Version
   /o:DocumentProperties
  /xml![endif]--!--[if gte mso 9]xml
   w:WordDocument
w:SpellingStateClean/w:SpellingState
w:GrammarStateClean/w:GrammarState
w:Compatibility
 w:BreakWrappedTables/
 w:SnapToGridInCell/
 w:WrapTextWithPunct/
 w:UseAsianBreakRules/
/w:Compatibility
w:BrowserLevelMicrosoftInternetExplorer4/w:BrowserLevel
   /w:WordDocument
  /xml![endif]--
  style
  !--
   /* Style Definitions */
   p.MsoNormal, li.MsoNormal, div.MsoNormal
  {mso-style-parent:;
  margin:0cm;
  margin-bottom:.0001pt;
  mso-pagination:widow-orphan;
  font-size:12.0pt;
  font-family:Times New Roman;
  mso-fareast-font-family:Times New Roman;}
  span.SpellE
  {mso-style-name:;
  mso-spl-e:yes;}
  @page Section1
  {size:612.0pt 792.0pt;
  margin:72.0pt 90.0pt 72.0pt 90.0pt;
  mso-header-margin:35.4pt;
  mso-footer-margin:35.4pt;
  mso-paper-source:0;}
  div.Section1
  {page:Section1;}
  --
  /style
  !--[if gte mso 10]
  style
   /* Style Definitions */
   table.MsoNormalTable
  {mso-style-name:Table Normal;
  mso-tstyle-rowband-size:0;
  mso-tstyle-colband-size:0;
  mso-style-noshow:yes;
  mso-style-parent:;
  mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
  mso-para-margin:0cm;
  mso-para-margin-bottom:.0001pt;
  mso-pagination:widow-orphan;
  font-size:10.0pt;
  font-family:Times New Roman;}
  /style
  ![endif]--
  /head
 
  body lang=EN-US style='tab-interval:36.0pt'
 
  div class=Section1
 
  p class=MsoNormalTest span class=SpellEtest/span span
  class=SpellEtest/span/p
 
  p class=MsoNormal align=center style='text-align:center'span
  class=SpellEFdjfkasdjfkla/span/p
 
  p class=MsoNormal align=center style='text-align:center'span
  class=SpellEb
  style='mso-bidi-font-weight:normal'Fdjkslafjdklaf/b/span/p
 
  p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p
 
  p class=MsoNormal style='text-align:justify'span
  class=SpellEFdasfdfasffasdfdaadfdfs/span/p
 
  p class=MsoNormal style='text-align:justify'span
  class=SpellEDfsdfs/span/p
 
  p class=MsoNormal style='text-align:justify'Hi/p
 
  p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p
 
  p class=MsoNormal style='text-align:justify'span
  style='mso

Re: [PHP] Ridding myself of HTML tags

2002-08-30 Thread DL Neil

Liam,
If you were to stristr()/remove everything up to and including the /head
tag, would that take care of things?
=dn


 I've got a lil problem with HTML tags. Here's the description.

 My site accepts HTML files by upload. A lot of these files are written in
MS
 Word and then saved as HTML files from that. MS Word likes to put a bunch
of
 garbage at the beginning of the file. Now, when users upload their HTML
 files, my script goes and striptags all of the unnecessary junk in there
 except it can't rid all this junk (HTML, XML, CSS, JavaScript) at the
 beginning of the HTML file. Some of these tags span multiple lines, and my
 script goes through line-by-line, so it won't identify these as tags. Is
 there a simpler fashion? I don't need the junk about style sheeting and
 stuff, because I have a style sheet that will take care of styling the
files
 the way they should be. I don't want the extra tags, even though they're
 invisible to users when they web-view, because these are e-mailable files
 (for HTML mail, it's fine; for text mail, I need to strip it down and
that's
 the problem).

 =
 Just in case, I've included the HTML code below:


 html xmlns:o=urn:schemas-microsoft-com:office:office
 xmlns:w=urn:schemas-microsoft-com:office:word
 xmlns=http://www.w3.org/TR/REC-html40;

 head
 meta http-equiv=Content-Type content=text/html; charset=windows-1252
 meta name=ProgId content=Word.Document
 meta name=Generator content=Microsoft Word 10
 meta name=Originator content=Microsoft Word 10
 link rel=File-List href=NW100_files/filelist.xml
 titleTest test test/title
 !--[if gte mso 9]xml
  o:DocumentProperties
   o:AuthorLiam Gibbs/o:Author
   o:LastAuthorLiam Gibbs/o:LastAuthor
   o:Revision1/o:Revision
   o:TotalTime1/o:TotalTime
   o:Created2002-08-30T18:09:00Z/o:Created
   o:LastSaved2002-08-30T18:10:00Z/o:LastSaved
   o:Pages1/o:Pages
   o:Words13/o:Words
   o:Characters79/o:Characters
   o:CompanySXIA/o:Company
   o:Lines1/o:Lines
   o:Paragraphs1/o:Paragraphs
   o:CharactersWithSpaces91/o:CharactersWithSpaces
   o:Version10.3501/o:Version
  /o:DocumentProperties
 /xml![endif]--!--[if gte mso 9]xml
  w:WordDocument
   w:SpellingStateClean/w:SpellingState
   w:GrammarStateClean/w:GrammarState
   w:Compatibility
w:BreakWrappedTables/
w:SnapToGridInCell/
w:WrapTextWithPunct/
w:UseAsianBreakRules/
   /w:Compatibility
   w:BrowserLevelMicrosoftInternetExplorer4/w:BrowserLevel
  /w:WordDocument
 /xml![endif]--
 style
 !--
  /* Style Definitions */
  p.MsoNormal, li.MsoNormal, div.MsoNormal
 {mso-style-parent:;
 margin:0cm;
 margin-bottom:.0001pt;
 mso-pagination:widow-orphan;
 font-size:12.0pt;
 font-family:Times New Roman;
 mso-fareast-font-family:Times New Roman;}
 span.SpellE
 {mso-style-name:;
 mso-spl-e:yes;}
 page Section1
 {size:612.0pt 792.0pt;
 margin:72.0pt 90.0pt 72.0pt 90.0pt;
 mso-header-margin:35.4pt;
 mso-footer-margin:35.4pt;
 mso-paper-source:0;}
 div.Section1
 {page:Section1;}
 --
 /style
 !--[if gte mso 10]
 style
  /* Style Definitions */
  table.MsoNormalTable
 {mso-style-name:Table Normal;
 mso-tstyle-rowband-size:0;
 mso-tstyle-colband-size:0;
 mso-style-noshow:yes;
 mso-style-parent:;
 mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
 mso-para-margin:0cm;
 mso-para-margin-bottom:.0001pt;
 mso-pagination:widow-orphan;
 font-size:10.0pt;
 font-family:Times New Roman;}
 /style
 ![endif]--
 /head

 body lang=EN-US style='tab-interval:36.0pt'

 div class=Section1

 p class=MsoNormalTest span class=SpellEtest/span span
 class=SpellEtest/span/p

 p class=MsoNormal align=center style='text-align:center'span
 class=SpellEFdjfkasdjfkla/span/p

 p class=MsoNormal align=center style='text-align:center'span
 class=SpellEb
 style='mso-bidi-font-weight:normal'Fdjkslafjdklaf/b/span/p

 p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p

 p class=MsoNormal style='text-align:justify'span
 class=SpellEFdasfdfasffasdfdaadfdfs/span/p

 p class=MsoNormal style='text-align:justify'span
 class=SpellEDfsdfs/span/p

 p class=MsoNormal style='text-align:justify'Hi/p

 p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p

 p class=MsoNormal style='text-align:justify'span
 style='mso-tab-count:3' /spanspan
 class=SpellEJfdklas/span/p

 p class=MsoNormal style='text-align:justify'o:pnbsp;/o:p/p

 /div

 /body

 /html

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php




-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php