Re: [whatwg] Distinguishing XML and HTML by content sniffing

Julian Reschke Sun, 04 Mar 2007 02:47:49 -0800

Michael Day schrieb:

...
I think that approach could easily misidentify valid HTML documents asbeing XML. It would be easy to parse the first 8Kb of many HTMLdocuments with an XML parser, as unclosed tags like <link> and <meta>would not trigger any well-formedness errors unless you parsed all theway to the end of the document -- not just the first 8Kb -- and foundthat they were never closed.
On a more pragmatic level, I think it would also be slightly moredifficult to implement this approach with libxml2, as you would have tocarefully feed the parser only 8Kb (or some other amount) and then stopit before it hits the end of the buffer and complains about all theunclosed tags. However, the misidentification problem is a more seriousissue affecting this approach.

Hm.

What, except efficiency, prevents you from parsing the whole file withan XML parser? If it parses, it is XML. Otherwise it isn't.


Best regards, Julian

Re: [whatwg] Distinguishing XML and HTML by content sniffing

Reply via email to