On Oct 6, 2010, at 10:08 AM, [email protected] wrote:

> On Wed, Oct 6, 2010 at 12:18 AM, Steven Falken  wrote:
>> Hi,
>> I'm trying to parse bare.txt (attached, yes it is simply cnn.com). For
>> this purpose I'm using parse.c (also attached).
>> The output is output.txt (Attachment!).
>> If you look at bare.txt, you see a <script> block from line 826 to
>> line 886. Now if you look at output.txt, you see the
>> <script>-Tag in line 759, but the end-Tag (</script>) is in line 784;
>> the problem is, that this end-Tag is in the middle
>> of the javascript-code, which is actually bad :(
> 
> This is because cnn's HTML sucks :). They can't seem to make up their
> mind between HTML and XHTML.
> 
> Take a look at line 792 of output.txt: the for statement is mangled.
> Looks like the '<' operator was interpreted by libxml as a start tag.
> The </script> is in the place where a </a> is in bare.txt
> 
> Perhaps libxml2 betrayed its true nature (an XML parser) and parsed
> bare.txt as XML (XHTML). In this case the content of <script> is also
> parsed as, and must be valid XML (which it isn't).
> See http://javascript.about.com/library/blxhtml.htm

Alternatively, this is yet another reason why inline JavaScript should be 
avoided if at all possible.  Use the src, Luke.


David

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to