Hi,

I am using libxml2  and Xpath  to  parse through HTML
to  find all <script> elements   and remove   them

If the <script> is inside  of another tag (like a <div> or a <span>   AND
the  script has a single  quoted string that also has the same tag  (a <div>
or a <span>)  inside of  the quotes  , then the  <script> tag does not get
properly removed.  It only gets removed up to the  <div> or <span> that¹s
inside the quoted string.

Here is an example of the HTML  and after it,  the   parsed result. You can
see there is a remnant of the quoted string that¹s inside the <script>  that
appears in the parsed output.

<html>
<head></head>
<body>
<h1><<br>PROBLEM CASE<br><h1>
<span>
<script language="javascript">

        document.write('<span>Shop Products</span>');

</script>
</span>
</body>
</html>



PARSED RESULT:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
  <head>
    <link rel="stylesheet" type="text/css" href="index.css" media="all"/>
  </head>
  <body>
    <h1>&lt;<br/>PROBLEM CASE<br/></h1>
    <h1>
<span>

</span>');   

<span>Shop Products</span>

</h1>
  </body>
</html>



Can this be fixed?  Could someone validate that it is indeed a BUG?

Thank You!!





_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to