On Wed, Oct 6, 2010 at 12:18 AM, Steven Falken wrote: > Hi, > I'm trying to parse bare.txt (attached, yes it is simply cnn.com). For > this purpose I'm using parse.c (also attached). > The output is output.txt (Attachment!). > If you look at bare.txt, you see a <script> block from line 826 to > line 886. Now if you look at output.txt, you see the > <script>-Tag in line 759, but the end-Tag (</script>) is in line 784; > the problem is, that this end-Tag is in the middle > of the javascript-code, which is actually bad :(
This is because cnn's HTML sucks :). They can't seem to make up their mind between HTML and XHTML. Take a look at line 792 of output.txt: the for statement is mangled. Looks like the '<' operator was interpreted by libxml as a start tag. The </script> is in the place where a </a> is in bare.txt Perhaps libxml2 betrayed its true nature (an XML parser) and parsed bare.txt as XML (XHTML). In this case the content of <script> is also parsed as, and must be valid XML (which it isn't). See http://javascript.about.com/library/blxhtml.htm -- GCS a+ e++ d- C++ ULS$ L+$ !E- W++ P+++$ L w++$ tv+ b++ DI D++ 5++ Life is complex, with real and imaginary parts. "Ok, it boots. Which means it must be bug-free and perfect. " -- Linus Torvalds "People disagree with me. I just ignore them." -- Linus Torvalds _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
