>My site accepts HTML files by upload. A lot of these files are written in MS
>Word and then saved as HTML files from that. MS Word likes to put a bunch of
>garbage at the beginning of the file. Now, when users upload their HTML
>files, my script goes and striptags all of the unnecessary junk in there
>except it can't rid all this junk (HTML, XML, CSS, JavaScript) at the
>beginning of the HTML file.

But those are all enclosed in HTML tags, even with something as sucky as MS
Word involved.

>Some of these tags span multiple lines, and my
>script goes through line-by-line, so it won't identify these as tags. Is
>there a simpler fashion?

There's your true problem.

An HTML tag can span multiple lines, regardless of where it comes from.

Even my hand-coded HTML will occasionally end up with a multi-line HTML
tag... Well, okay, maybe not, but I could if I wanted to :-)

You need to http://php.net/implode all your HTML into one big long string
*before* you strip_tags:

$html = implode('', $html);
$html = strip_tags($html);

If you really need the multi-line HTML turned into an array after that, you
can do:

$html = explode("\n", $html);

But you probably are storing this stuff in a file or database, and it's just
as easy to fwrite the large string as to mess with it as an array.

-- 
Like Music?  http://l-i-e.com/artists.htm


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to