Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote: I assume you meant mostly as in most of the pages are well-formed, not pages are mostly well-formed, since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. . . . So either you're looking at a totally different dataset or mostly is a bit of a stretch I admit I didn't look closely. At a guess, maybe the default WordPress skin(s) are valid XHTML, but custom skins are very popular for WordPress and those mostly aren't valid XHTML? MediaWiki is unreasonably difficult to reskin, so that's not much of a problem for us . . . Sure. 0.01% of all websites is a significant number. I just think it's broken often enough, and easy enough to break by accident, that relying on it working for screen scraping is not likely to be happening on a wide scale You're probably right. Or stop using HTML named entities, yes. That's not really a very good option, given the size of MediaWiki's code base and the size of Wikipedia's database, and the ugliness of trying to remember what #160; is when reading the HTML source. It sounds like we're stuck with a legacy doctype if we don't want to break screen-scrapers.
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
Aryeh Gregor wrote: On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote: I assume you meant mostly as in most of the pages are well-formed, not pages are mostly well-formed, since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. . . . So either you're looking at a totally different dataset or mostly is a bit of a stretch I admit I didn't look closely. At a guess, maybe the default WordPress skin(s) are valid XHTML, but custom skins are very popular for WordPress and those mostly aren't valid XHTML? MediaWiki is unreasonably difficult to reskin, so that's not much of a problem for us . . . Even with the default skin it's easy to break (e.g., search for U+). That'll be output to the page and make it not well-formed. -- Geoffrey Sneddon — Opera Software http://gsnedders.com/ http://www.opera.com/
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On 11/11/09 11:16 PM, Aryeh Gregor wrote: I'm pretty sure that XHR is used for screen-scraping beyond Wikipedia, Since it'd fail any time the data is not well-formed XML, I'd actually expect such usage to be rare. It's not all that common to find XHTML on the web that happens to be well-formed XML. Could some reasonably minimal, distinctive doctype be invented that would avoid the problem but not make the document look to humans and validators like it thinks it's some old version of XHTML? Yes, but browsers would have to add explicit support for it. Also, is this a wider problem? Are there any other tools besides browsers that might be magically allowing named entities for some doctypes only? Sure; anything that actually goes and loads the DTD. -Boris
[whatwg] HTML5 doctypes incompatible with XHR if named entities present
I already filed a bug http://www.w3.org/Bugs/Public/show_bug.cgi?id=8268, but figured I'd copy it here to get more discussion. Wikipedia just experimented with switching to an HTML5 doctype. A lot of user tools broke, and after two hours of investigation, we determined that the problem is intractable and switched back to XHTML 1.0 Transitional. XMLHttpRequest was historically intended only for XML, and lots of scripts rely on the responseXML property being set to a Document. In current browsers, this only happens when the document is actually well-formed XML. But named entities are treated differently based on the doctype. Consider this document: !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; htmlhead titleHello/title /head body pnbsp;/p /body /html This works just fine in all browsers I tested in (latestish versions of Firefox, Chrome, Opera). However, if you serve the exact same document but replace the doctype with !DOCTYPE html, all of them throw a syntax error on nbsp;. Practically speaking, this means that any site that wants to serve content compatible with XHR cannot use either of the two doctypes that the spec recommends for authors. There are a variety of widely-used scripts on Wikipedia that rely on XHR, so this is currently a blocker for us. It's very unlikely that we'll deploy HTML5 in the foreseeable future if it means our users have to rewrite all their scripts. I'm pretty sure that XHR is used for screen-scraping beyond Wikipedia, too, so this will probably crop up elsewhere too. I don't know what the extent of the magic is that causes this problem. Could some reasonably minimal, distinctive doctype be invented that would avoid the problem but not make the document look to humans and validators like it thinks it's some old version of XHTML? If an existing XHTML doctype must be reused, should validators continue to raise warnings as they do now, or should an XHTML doctype be promoted from obsolete permitted DOCTYPE to a fully permitted doctype? Also, is this a wider problem? Are there any other tools besides browsers that might be magically allowing named entities for some doctypes only?
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On Wed, Nov 11, 2009 at 11:39 PM, Boris Zbarsky bzbar...@mit.edu wrote: Since it'd fail any time the data is not well-formed XML, I'd actually expect such usage to be rare. It's not all that common to find XHTML on the web that happens to be well-formed XML. A number of popular web apps output mostly well-formed XML, as far as I know: vBulletin, WordPress, etc. Not even close to most websites, of course, but a significant number, I'd think. Of course, they probably don't have the same kind of script-writing community that Wikipedia does -- that's very peculiar to Wikipedia. Yes, but browsers would have to add explicit support for it. That mostly defeats the point -- they could equally add explicit support for non-XML responseXML first. This should be a short- to medium-term problem only. This makes it sound like if Wikipedia switches to HTML5 and isn't willing to break all screen-scrapers on principle, we'll have to use an obsolete but conforming doctype. That's kind of a pain, particularly from an evangelism point of view. Especially if it raises validator warnings. But I guess that's where XML has gotten us.
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On 11/11/09 11:57 PM, Aryeh Gregor wrote: A number of popular web apps output mostly well-formed XML, as far as I know: vBulletin, WordPress, etc. I assume you meant mostly as in most of the pages are well-formed, not pages are mostly well-formed, since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. These were not-well-formed: http://www.dria.org/wordpress/archives/2009/11/10/1043/ http://bisdaktech.wordpress.com/ http://weekinthenee.wordpress.com/2009/11/11/sitting-in-a-park-in-paris-france/ http://terrytao.wordpress.com/2009/10/29/displaying-mathematics-on-the-web/ http://ehren.wordpress.com/2009/10/24/a-gcc-hack-my-0-1-release/ http://www.nvnews.net/vbulletin/showthread.php?t=104201 http://www.nvnews.net/vbulletin/showthread.php?t=132449 These are: http://boomswaggerboom.wordpress.com/ http://fiber-space.de/wordpress/?p=1016 http://dafizilla.wordpress.com/2009/11/08/karmic-koala-hides-firefox-context-menuitems-icons/ So either you're looking at a totally different dataset or mostly is a bit of a stretch Not even close to most websites, of course, but a significant number, I'd think. Sure. 0.01% of all websites is a significant number. I just think it's broken often enough, and easy enough to break by accident, that relying on it working for screen scraping is not likely to be happening on a wide scale Yes, but browsers would have to add explicit support for it. That mostly defeats the point -- they could equally add explicit support for non-XML responseXML first. Yep. This makes it sound like if Wikipedia switches to HTML5 and isn't willing to break all screen-scrapers on principle, we'll have to use an obsolete but conforming doctype. Or stop using HTML named entities, yes. -Boris