On 18/09/2010, at 6:21 PM, Geoffrey van Wyk wrote:
> Hi All,
>
> I want to remove empty paragraphs from an HTML document using
> simple_html_dom.php. I know how to do it using the DOMDocument class, but,
> because the HTML files I work with are prepared in MS Word, the DOMDocument's
> loadHTMLFile() function gives this exception "Namespaces are not defined".
>
> This is the code I use with the DOMDocument object for HTML files not
> prepared in MS Word:
>
> /* Using the DOMDocument class */
>
> /* Create a new DOMDocument object. */
> $html = new DOMDocument("1.0", "UTF-8");
>
> /* Load HTML code from an HTML file into the DOMDocument. */
> $html->loadHTMLFile("HTML File With Empty Paragraphs.html");
>
> /* Assign all the elements into the $pars DOMNodeList object. */
> $pars = $html->getElementsByTagName("p");
>
> echo "The initial number of paragraphs is " . $pars->length . ".";
>
> /* The trim() function is used to remove leading and trailing spaces as well
> as
> * newline characters. */
> for ($i = 0; $i < $pars->length; $i++){
>if (trim($pars->item($i)->textContent == "")){
>$pars->item($i)->parentNode->removeChild($pars->item($i));
>$i--;
>}
> }
>
> echo "The final number of paragraphs is " . $pars->length . ".";
>
> // Write the HTML code back into an HTML file.
> $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
> ?>
>
> This is the code I use with the simple_html_dom.php module for HTML files
> prepared in MS Word:
>
> /* Using simple_html_dom.php */
>
> include("simple_html_dom.php");
>
> $html = file_get_html("HTML File With Empty Paragraphs.html");
>
> $pars = $html->find("p");
>
> for ($i = 0; $i < count($pars); $i++) {
>if (trim($pars[$i]->plaintext == "")) {
>unset($pars[$i]);
>$i--;
>}
> }
>
> $html->save("HTML File without Empty Paragraphs.html");
> ?>
>
> It is almost the same, except that that the $pars variable is a DOMNodeList
> when using DOMDocument and an array when using simple_html_dom.php. But this
> code does not work. First it runs for two minutes and then reports these
> errors: "Undefined offset: 1" and "Trying to get property of nonobject" for
> this line: "if (trim($pars[$i]->plaintext == "")) {".
>
> Does anyone know how I can fix this?
>
> Thank you.
>
> Geoffrey van Wyk
>
Personally, I'd just use regex to do it. Something like
preg_replace('#]*?>\s*#m', '', $html) should do it.
Otherwise, you've got trim($pars[$i]->plaintext == "") instead of
trim($pars[$i]->plaintext) == "".
---
Simon Welsh
Admin of http://simon.geek.nz/
Who said Microsoft never created a bug-free program? The blue screen never,
ever crashes!
http://www.thinkgeek.com/brain/gimme.cgi?wid=81d520e5e
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php