[PHP] Re: Is there a good way to extract the / content in HTML with/without closing tag?

Sun, 23 May 2010 13:07:52 -0700

Chian Hsieh wrote:
Hi,

I want to extract all contents started with <embed> and <object>
with/without closing tags.
My solution is using a regular expression to get it work, but there is some
exception I could not handle out.

The REGEXs I used are:

// With closing tag
if (preg_match_all("#(<(object|embed)[^>]+>.*?</\\2>)#is", $str,
$matchObjs)) {
  // blahblah

// Without closing tag
} else if (preg_match_all("#(<(?:object|embed)[^>]+>)#",$str,$matchObjs)){
  // blahblah
}

But it might be failed if the $str are mixed with/without closing tags:

$str ='<div><div><object type="application/x-shockwave-flash"><param
name="zz" value="xx"></object></div><div><embed src="http://sample.com";
/></div>'

In this situation, it will only get the
<object type="application/x-shockwave-flash"><param name="zz"
value="xx"></object>

but I want to get the two results which are
<object type="application/x-shockwave-flash"><param name="zz"
value="xx"></object>
<embed src="http://sample.com"; />


So, is there a good way to use one REGEX to process this issue?

If you're open to using methods other than regex; then one way to get pretty good results is to run the document through HTML Tidy, then parse it in to a DOM and query it using xpath/xquery - basically mimic the base way in which the browsers do it (and the way recommended by the HTML specs)

Best,

Nathan

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to