Hi,
I am using libxml2 and Xpath to parse through HTML
to find all <script> elements and remove them
If the <script> is inside of another tag (like a <div> or a <span> AND
the script has a single quoted string that also has the same tag (a <div>
or a <span>) inside of the quotes , then the <script> tag does not get
properly removed. It only gets removed up to the <div> or <span> that¹s
inside the quoted string.
Here is an example of the HTML and after it, the parsed result. You can
see there is a remnant of the quoted string that¹s inside the <script> that
appears in the parsed output.
<html>
<head></head>
<body>
<h1><<br>PROBLEM CASE<br><h1>
<span>
<script language="javascript">
document.write('<span>Shop Products</span>');
</script>
</span>
</body>
</html>
PARSED RESULT:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<link rel="stylesheet" type="text/css" href="index.css" media="all"/>
</head>
<body>
<h1><<br/>PROBLEM CASE<br/></h1>
<h1>
<span>
</span>');
<span>Shop Products</span>
</h1>
</body>
</html>
Can this be fixed? Could someone validate that it is indeed a BUG?
Thank You!!
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml