Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-12 Thread Aryeh Gregor
On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote:
 I assume you meant mostly as in most of the pages are well-formed, not
 pages are mostly well-formed, since the latter is useless, right?

 I did a brief survey of obvious sites fitting those descriptions that I had
 in my browser history at the moment. . . .

 So either you're looking at a totally different dataset or mostly is a bit
 of a stretch

I admit I didn't look closely.  At a guess, maybe the default
WordPress skin(s) are valid XHTML, but custom skins are very popular
for WordPress and those mostly aren't valid XHTML?  MediaWiki is
unreasonably difficult to reskin, so that's not much of a problem for
us . . .

 Sure.  0.01% of all websites is a significant number.  I just think it's
 broken often enough, and easy enough to break by accident, that relying on
 it working for screen scraping is not likely to be happening on a wide
 scale

You're probably right.

 Or stop using HTML named entities, yes.

That's not really a very good option, given the size of MediaWiki's
code base and the size of Wikipedia's database, and the ugliness of
trying to remember what #160; is when reading the HTML source.  It
sounds like we're stuck with a legacy doctype if we don't want to
break screen-scrapers.


Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-12 Thread Geoffrey Sneddon

Aryeh Gregor wrote:

On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote:

I assume you meant mostly as in most of the pages are well-formed, not
pages are mostly well-formed, since the latter is useless, right?

I did a brief survey of obvious sites fitting those descriptions that I had
in my browser history at the moment. . . .

So either you're looking at a totally different dataset or mostly is a bit
of a stretch


I admit I didn't look closely.  At a guess, maybe the default
WordPress skin(s) are valid XHTML, but custom skins are very popular
for WordPress and those mostly aren't valid XHTML?  MediaWiki is
unreasonably difficult to reskin, so that's not much of a problem for
us . . .


Even with the default skin it's easy to break (e.g., search for U+). 
That'll be output to the page and make it not well-formed.


--
Geoffrey Sneddon — Opera Software
http://gsnedders.com/
http://www.opera.com/


Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Boris Zbarsky

On 11/11/09 11:16 PM, Aryeh Gregor wrote:

I'm pretty sure that XHR is used for screen-scraping beyond Wikipedia,


Since it'd fail any time the data is not well-formed XML, I'd actually 
expect such usage to be rare.  It's not all that common to find XHTML 
on the web that happens to be well-formed XML.



  Could some reasonably minimal, distinctive doctype be invented that
would avoid the problem but not make the document look to humans and
validators like it thinks it's some old version of XHTML?


Yes, but browsers would have to add explicit support for it.


Also, is this a wider problem?  Are there any other tools besides
browsers that might be magically allowing named entities for some
doctypes only?


Sure; anything that actually goes and loads the DTD.

-Boris


Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Aryeh Gregor
On Wed, Nov 11, 2009 at 11:39 PM, Boris Zbarsky bzbar...@mit.edu wrote:
 Since it'd fail any time the data is not well-formed XML, I'd actually
 expect such usage to be rare.  It's not all that common to find XHTML on
 the web that happens to be well-formed XML.

A number of popular web apps output mostly well-formed XML, as far as
I know: vBulletin, WordPress, etc.  Not even close to most websites,
of course, but a significant number, I'd think.  Of course, they
probably don't have the same kind of script-writing community that
Wikipedia does -- that's very peculiar to Wikipedia.

 Yes, but browsers would have to add explicit support for it.

That mostly defeats the point -- they could equally add explicit
support for non-XML responseXML first.  This should be a short- to
medium-term problem only.

This makes it sound like if Wikipedia switches to HTML5 and isn't
willing to break all screen-scrapers on principle, we'll have to use
an obsolete but conforming doctype.  That's kind of a pain,
particularly from an evangelism point of view.  Especially if it
raises validator warnings.  But I guess that's where XML has gotten
us.


Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Boris Zbarsky

On 11/11/09 11:57 PM, Aryeh Gregor wrote:

A number of popular web apps output mostly well-formed XML, as far as
I know: vBulletin, WordPress, etc.


I assume you meant mostly as in most of the pages are well-formed, 
not pages are mostly well-formed, since the latter is useless, right?


I did a brief survey of obvious sites fitting those descriptions that I 
had in my browser history at the moment.  These were not-well-formed:


http://www.dria.org/wordpress/archives/2009/11/10/1043/
http://bisdaktech.wordpress.com/
http://weekinthenee.wordpress.com/2009/11/11/sitting-in-a-park-in-paris-france/
http://terrytao.wordpress.com/2009/10/29/displaying-mathematics-on-the-web/
http://ehren.wordpress.com/2009/10/24/a-gcc-hack-my-0-1-release/
http://www.nvnews.net/vbulletin/showthread.php?t=104201
http://www.nvnews.net/vbulletin/showthread.php?t=132449

These are:

http://boomswaggerboom.wordpress.com/
http://fiber-space.de/wordpress/?p=1016
http://dafizilla.wordpress.com/2009/11/08/karmic-koala-hides-firefox-context-menuitems-icons/

So either you're looking at a totally different dataset or mostly is a 
bit of a stretch



Not even close to most websites, of course, but a significant number, I'd think.


Sure.  0.01% of all websites is a significant number.  I just think 
it's broken often enough, and easy enough to break by accident, that 
relying on it working for screen scraping is not likely to be happening 
on a wide scale



Yes, but browsers would have to add explicit support for it.


That mostly defeats the point -- they could equally add explicit
support for non-XML responseXML first.


Yep.


This makes it sound like if Wikipedia switches to HTML5 and isn't
willing to break all screen-scrapers on principle, we'll have to use
an obsolete but conforming doctype.


Or stop using HTML named entities, yes.

-Boris