Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote: I assume you meant mostly as in most of the pages are well-formed, not pages are mostly well-formed, since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. . . . So either you're looking at a totally different dataset or mostly is a bit of a stretch I admit I didn't look closely. At a guess, maybe the default WordPress skin(s) are valid XHTML, but custom skins are very popular for WordPress and those mostly aren't valid XHTML? MediaWiki is unreasonably difficult to reskin, so that's not much of a problem for us . . . Sure. 0.01% of all websites is a significant number. I just think it's broken often enough, and easy enough to break by accident, that relying on it working for screen scraping is not likely to be happening on a wide scale You're probably right. Or stop using HTML named entities, yes. That's not really a very good option, given the size of MediaWiki's code base and the size of Wikipedia's database, and the ugliness of trying to remember what #160; is when reading the HTML source. It sounds like we're stuck with a legacy doctype if we don't want to break screen-scrapers.
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
Aryeh Gregor wrote: On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote: I assume you meant mostly as in most of the pages are well-formed, not pages are mostly well-formed, since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. . . . So either you're looking at a totally different dataset or mostly is a bit of a stretch I admit I didn't look closely. At a guess, maybe the default WordPress skin(s) are valid XHTML, but custom skins are very popular for WordPress and those mostly aren't valid XHTML? MediaWiki is unreasonably difficult to reskin, so that's not much of a problem for us . . . Even with the default skin it's easy to break (e.g., search for U+). That'll be output to the page and make it not well-formed. -- Geoffrey Sneddon — Opera Software http://gsnedders.com/ http://www.opera.com/
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On 11/11/09 11:16 PM, Aryeh Gregor wrote: I'm pretty sure that XHR is used for screen-scraping beyond Wikipedia, Since it'd fail any time the data is not well-formed XML, I'd actually expect such usage to be rare. It's not all that common to find XHTML on the web that happens to be well-formed XML. Could some reasonably minimal, distinctive doctype be invented that would avoid the problem but not make the document look to humans and validators like it thinks it's some old version of XHTML? Yes, but browsers would have to add explicit support for it. Also, is this a wider problem? Are there any other tools besides browsers that might be magically allowing named entities for some doctypes only? Sure; anything that actually goes and loads the DTD. -Boris
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On Wed, Nov 11, 2009 at 11:39 PM, Boris Zbarsky bzbar...@mit.edu wrote: Since it'd fail any time the data is not well-formed XML, I'd actually expect such usage to be rare. It's not all that common to find XHTML on the web that happens to be well-formed XML. A number of popular web apps output mostly well-formed XML, as far as I know: vBulletin, WordPress, etc. Not even close to most websites, of course, but a significant number, I'd think. Of course, they probably don't have the same kind of script-writing community that Wikipedia does -- that's very peculiar to Wikipedia. Yes, but browsers would have to add explicit support for it. That mostly defeats the point -- they could equally add explicit support for non-XML responseXML first. This should be a short- to medium-term problem only. This makes it sound like if Wikipedia switches to HTML5 and isn't willing to break all screen-scrapers on principle, we'll have to use an obsolete but conforming doctype. That's kind of a pain, particularly from an evangelism point of view. Especially if it raises validator warnings. But I guess that's where XML has gotten us.
Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present
On 11/11/09 11:57 PM, Aryeh Gregor wrote: A number of popular web apps output mostly well-formed XML, as far as I know: vBulletin, WordPress, etc. I assume you meant mostly as in most of the pages are well-formed, not pages are mostly well-formed, since the latter is useless, right? I did a brief survey of obvious sites fitting those descriptions that I had in my browser history at the moment. These were not-well-formed: http://www.dria.org/wordpress/archives/2009/11/10/1043/ http://bisdaktech.wordpress.com/ http://weekinthenee.wordpress.com/2009/11/11/sitting-in-a-park-in-paris-france/ http://terrytao.wordpress.com/2009/10/29/displaying-mathematics-on-the-web/ http://ehren.wordpress.com/2009/10/24/a-gcc-hack-my-0-1-release/ http://www.nvnews.net/vbulletin/showthread.php?t=104201 http://www.nvnews.net/vbulletin/showthread.php?t=132449 These are: http://boomswaggerboom.wordpress.com/ http://fiber-space.de/wordpress/?p=1016 http://dafizilla.wordpress.com/2009/11/08/karmic-koala-hides-firefox-context-menuitems-icons/ So either you're looking at a totally different dataset or mostly is a bit of a stretch Not even close to most websites, of course, but a significant number, I'd think. Sure. 0.01% of all websites is a significant number. I just think it's broken often enough, and easy enough to break by accident, that relying on it working for screen scraping is not likely to be happening on a wide scale Yes, but browsers would have to add explicit support for it. That mostly defeats the point -- they could equally add explicit support for non-XML responseXML first. Yep. This makes it sound like if Wikipedia switches to HTML5 and isn't willing to break all screen-scrapers on principle, we'll have to use an obsolete but conforming doctype. Or stop using HTML named entities, yes. -Boris