subject:"\[whatwg\] HTML5 doctypes incompatible with XHR if named entities present"

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-12 Thread Aryeh Gregor

On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote:
 I assume you meant mostly as in most of the pages are well-formed, not
 pages are mostly well-formed, since the latter is useless, right?

 I did a brief survey of obvious sites fitting those descriptions that I had
 in my browser history at the moment. . . .

 So either you're looking at a totally different dataset or mostly is a bit
 of a stretch

I admit I didn't look closely.  At a guess, maybe the default
WordPress skin(s) are valid XHTML, but custom skins are very popular
for WordPress and those mostly aren't valid XHTML?  MediaWiki is
unreasonably difficult to reskin, so that's not much of a problem for
us . . .

 Sure.  0.01% of all websites is a significant number.  I just think it's
 broken often enough, and easy enough to break by accident, that relying on
 it working for screen scraping is not likely to be happening on a wide
 scale

You're probably right.

 Or stop using HTML named entities, yes.

That's not really a very good option, given the size of MediaWiki's
code base and the size of Wikipedia's database, and the ugliness of
trying to remember what #160; is when reading the HTML source.  It
sounds like we're stuck with a legacy doctype if we don't want to
break screen-scrapers.

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-12 Thread Geoffrey Sneddon


Aryeh Gregor wrote:

On Thu, Nov 12, 2009 at 12:33 AM, Boris Zbarsky bzbar...@mit.edu wrote:

I assume you meant mostly as in most of the pages are well-formed, not
pages are mostly well-formed, since the latter is useless, right?

I did a brief survey of obvious sites fitting those descriptions that I had
in my browser history at the moment. . . .

So either you're looking at a totally different dataset or mostly is a bit
of a stretch


I admit I didn't look closely.  At a guess, maybe the default
WordPress skin(s) are valid XHTML, but custom skins are very popular
for WordPress and those mostly aren't valid XHTML?  MediaWiki is
unreasonably difficult to reskin, so that's not much of a problem for
us . . .


Even with the default skin it's easy to break (e.g., search for U+). 
That'll be output to the page and make it not well-formed.


--
Geoffrey Sneddon — Opera Software
http://gsnedders.com/
http://www.opera.com/

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Boris Zbarsky


On 11/11/09 11:16 PM, Aryeh Gregor wrote:

I'm pretty sure that XHR is used for screen-scraping beyond Wikipedia,


Since it'd fail any time the data is not well-formed XML, I'd actually 
expect such usage to be rare.  It's not all that common to find XHTML 
on the web that happens to be well-formed XML.



  Could some reasonably minimal, distinctive doctype be invented that
would avoid the problem but not make the document look to humans and
validators like it thinks it's some old version of XHTML?


Yes, but browsers would have to add explicit support for it.


Also, is this a wider problem?  Are there any other tools besides
browsers that might be magically allowing named entities for some
doctypes only?


Sure; anything that actually goes and loads the DTD.

-Boris

[whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Aryeh Gregor

I already filed a bug
http://www.w3.org/Bugs/Public/show_bug.cgi?id=8268, but figured I'd
copy it here to get more discussion.

Wikipedia just experimented with switching to an HTML5 doctype.  A lot
of user tools broke, and after two hours of investigation, we
determined that the problem is intractable and switched back to XHTML
1.0 Transitional.

XMLHttpRequest was historically intended only for XML, and lots of
scripts rely on the responseXML property being set to a Document.  In
current browsers, this only happens when the document is actually
well-formed XML.  But named entities are treated differently based on
the doctype.  Consider this document:

!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;
htmlhead
titleHello/title
/head
body
pnbsp;/p
/body
/html

This works just fine in all browsers I tested in (latestish versions
of Firefox, Chrome, Opera).  However, if you serve the exact same
document but replace the doctype with !DOCTYPE html, all of them
throw a syntax error on nbsp;.

Practically speaking, this means that any site that wants to serve
content compatible with XHR cannot use either of the two doctypes that
the spec recommends for authors.  There are a variety of widely-used
scripts on Wikipedia that rely on XHR, so this is currently a blocker
for us.  It's very unlikely that we'll deploy HTML5 in the foreseeable
future if it means our users have to rewrite all their scripts.  I'm
pretty sure that XHR is used for screen-scraping beyond Wikipedia,
too, so this will probably crop up elsewhere too.

I don't know what the extent of the magic is that causes this problem.
 Could some reasonably minimal, distinctive doctype be invented that
would avoid the problem but not make the document look to humans and
validators like it thinks it's some old version of XHTML?  If an
existing XHTML doctype must be reused, should validators continue to
raise warnings as they do now, or should an XHTML doctype be promoted
from obsolete permitted DOCTYPE to a fully permitted doctype?

Also, is this a wider problem?  Are there any other tools besides
browsers that might be magically allowing named entities for some
doctypes only?

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Aryeh Gregor

On Wed, Nov 11, 2009 at 11:39 PM, Boris Zbarsky bzbar...@mit.edu wrote:
 Since it'd fail any time the data is not well-formed XML, I'd actually
 expect such usage to be rare.  It's not all that common to find XHTML on
 the web that happens to be well-formed XML.

A number of popular web apps output mostly well-formed XML, as far as
I know: vBulletin, WordPress, etc.  Not even close to most websites,
of course, but a significant number, I'd think.  Of course, they
probably don't have the same kind of script-writing community that
Wikipedia does -- that's very peculiar to Wikipedia.

 Yes, but browsers would have to add explicit support for it.

That mostly defeats the point -- they could equally add explicit
support for non-XML responseXML first.  This should be a short- to
medium-term problem only.

This makes it sound like if Wikipedia switches to HTML5 and isn't
willing to break all screen-scrapers on principle, we'll have to use
an obsolete but conforming doctype.  That's kind of a pain,
particularly from an evangelism point of view.  Especially if it
raises validator warnings.  But I guess that's where XML has gotten
us.

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

2009-11-11 Thread Boris Zbarsky

On 11/11/09 11:57 PM, Aryeh Gregor wrote:

A number of popular web apps output mostly well-formed XML, as far as
I know: vBulletin, WordPress, etc.

I assume you meant mostly as in most of the pages are well-formed,
not pages are mostly well-formed, since the latter is useless, right?

I did a brief survey of obvious sites fitting those descriptions that I
had in my browser history at the moment. These were not-well-formed:

http://www.dria.org/wordpress/archives/2009/11/10/1043/
http://bisdaktech.wordpress.com/
http://weekinthenee.wordpress.com/2009/11/11/sitting-in-a-park-in-paris-france/
http://terrytao.wordpress.com/2009/10/29/displaying-mathematics-on-the-web/
http://ehren.wordpress.com/2009/10/24/a-gcc-hack-my-0-1-release/
http://www.nvnews.net/vbulletin/showthread.php?t=104201
http://www.nvnews.net/vbulletin/showthread.php?t=132449

These are:

http://boomswaggerboom.wordpress.com/
http://fiber-space.de/wordpress/?p=1016
http://dafizilla.wordpress.com/2009/11/08/karmic-koala-hides-firefox-context-menuitems-icons/

So either you're looking at a totally different dataset or mostly is a
bit of a stretch

Not even close to most websites, of course, but a significant number, I'd think.

Sure. 0.01% of all websites is a significant number. I just think
it's broken often enough, and easy enough to break by accident, that
relying on it working for screen scraping is not likely to be happening
on a wide scale

Yes, but browsers would have to add explicit support for it.

That mostly defeats the point -- they could equally add explicit
support for non-XML responseXML first.

Yep.

This makes it sound like if Wikipedia switches to HTML5 and isn't
willing to break all screen-scrapers on principle, we'll have to use
an obsolete but conforming doctype.

Or stop using HTML named entities, yes.

-Boris

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

[whatwg] HTML5 doctypes incompatible with XHR if named entities present

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

Re: [whatwg] HTML5 doctypes incompatible with XHR if named entities present

6 matches

Site Navigation

Mail list logo

Footer information