Re: [whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

2010-04-07 Thread Ian Hickson
On Wed, 3 Mar 2010, Brett Zamir wrote:
 On 3/2/2010 6:54 PM, Ian Hickson wrote:
  On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:
 
   Briefly it seems that? causes the parser to go into Bogus comment 
   state, which is fair enough. (I wouldn't really recommend that 
   anyone use processing instructions in HTML syntax anyway.) However 
   the parser comes out of that state at the first. Because processing 
   instructions can contain and terminate only at the two character 
   sequence ? this could cause PI processing to terminate early and 
   leave a lot more error handling and a confused parser state in the 
   text yet to come.
 
  In HTML4, PIs ended at the first, not at ?. ?target data is the 
  syntax of PIs when the SGML options used by HTML4 are applied.
  
  In any case, the parser in HTML5 is based on what browsers do, which 
  is also to terminate at the first. It's unlikely that we can change 
  that, given backwards-compatibility needs.
 
 Are there really a lot of folks out there depending on old HTML4-style 
 processing instructions not being broken?

Not knowingly, but I wouldn't at all be surprised if there were lots of 
pages that triggered this, yes. People rely on all kinds of weird things. 
(See for example the sample from Philip below.)


 Given that as I understand it such HTML4 processing instructions were 
 not even used by any standard at that time, and with XHTML 1.0+ 
 processing instructions bringing into practice the XML form, and 
 especially with all of the progress made in X/HTML5 on harmonizing HTML 
 and XHTML, I'd think that it'd really be ideal if this issue would not 
 get in the way (along with the unfortunate loss of external DTDs)...

In practice this issue shouldn't get in the way anyway, since PIs aren't 
allowed in HTML.


 As long as website creators have the freedom to be sloppy

Authors don't have the freedom to be sloppy.


 why not go a little further to make XML compatibility better?

XML compatibility isn't a goal. There is a minor goal of making it 
possible to transition easily from XHTML to HTML. PI-like syntax in XHTML 
is only used for two purposes:

 - the XML declaration, which can simply be removed when publishing HTML, 
   and which if not removed will just be ignored (since it never contains 
   a  character, so ending on the first  is fine).

 - the XML Stylesheet PI, which needs to be converted to a link element 
   anyway, so isn't a problem.


 It'd be a whole lot more appealing to work in both environments out of 
 the box than deal with complex server-side conversion solutions...

I don't really understand why you would ever use a PI to be honest.


On Wed, 3 Mar 2010, Philip Taylor wrote:
 
 Yes, e.g. a load of pages like 
 http://www.forex.com.cn/html/2008-01/821561.htm (to pick one example at 
 random) say:
 
   ?xml:namespace prefix = o ns = urn:schemas-microsoft-com:office:office /
 
 and don't have the string ? anywhere.

Indeed.


On Fri, 5 Mar 2010, Brett Zamir wrote:
 
 Ok, fair enough.  But while it is great that HTML5 seeks to be 
 transitional and backwards compatible, HTML5 (thankfully) already breaks 
 compatibility for the sake of XML compatibility (e.g., localName or 
 getElementsByTagNameNS).

This is actually just for implementation sanity, it's not about XML syntax 
compatibility.


 It seems to me that there should still be a role of eventually 
 transitioning into something more full-featured in a fundamental, 
 language-neutral way (e.g., supporting a fuller subset of XML's features 
 such as external entities and yes, XML-style processing instructions); 
 extensible, including the ability to include XML from other namespaces 
 which may also encourage or rely on using their own XML processing 
 instructions, for those who wish to experiment or supplement the HTML 
 standard behavior; and more harmonious and compatible with a simpler 
 syntax (i.e., XML's)--even if the more complex syntax is more prominent 
 and continues to be supported indefinitely.

People can use XML if they want, but I don't really see a path from 
today's HTML to a generic language that doesn't break backwards 
compatibility. If you're ok with breaking back-compat, though, there's no 
need to worry about HTML at all. Just use XHTML.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

2010-03-22 Thread Ian Hickson
On Thu, 18 Mar 2010, Brett Zamir wrote:
 On 3/2/2010 6:54 PM, Ian Hickson wrote:
  On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:
  
   The handling of processing instructions in the XHTML syntax seems 
   reasonably well-defined; but it feels a little off in the HTML 
   syntax.
 
  There's no such thing as processing instructions in text/html.
  
  There was such a thing in HTML4, because of its SGML heritage, though 
  it was explicitly deprecated.
 
 Doesn't seem deprecated per 
 http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.6

Section B.3.3 says, speaking of SGML features with limited support, which 
at the time of that section's writing included PIs, that We recommend 
that authors avoid using all of these features. Section 3.2 specifically 
says The appendix lists some SGML features that are not widely supported 
by HTML tools and user agents and should be avoided.


   Briefly it seems that? causes the parser to go into Bogus comment 
   state, which is fair enough. (I wouldn't really recommend that 
   anyone use processing instructions in HTML syntax anyway.) However 
   the parser comes out of that state at the first. Because processing 
   instructions can contain and terminate only at the two character 
   sequence ? this could cause PI processing to terminate early and 
   leave a lot more error handling and a confused parser state in the 
   text yet to come.
 
  In HTML4, PIs ended at the first, not at ?. ?target data is the 
  syntax of PIs when the SGML options used by HTML4 are applied.
  
  In any case, the parser in HTML5 is based on what browsers do, which 
  is also to terminate at the first. It's unlikely that we can change 
  that, given backwards-compatibility needs.
  
  There's a simple workaround: don't use PIs in text/html, since they 
  don't exist in HTML5 at all, and don't send XML as text/html, since 
  XML and HTML have different syntaxes and aren't compatible.
 
 In http://dev.w3.org/html5/html4-differences/ , it says:
 
 HTML5 defines an HTML syntax that is compatible with HTML4 and XHTML1 
 documents published on the Web, but is not compatible with the more 
 esoteric SGML features of HTML4, such as processing instructions 
 http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.3.6 
 and shorthand markup 
 http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.3.7.
 
 This seems to me to suggest that backward compatibility can be broken as 
 far as processing instructions (i.e., requiring ? and not merely  to 
 close a processing instruction).

Backwards compatibility with legacy content can only be broken in extreme 
cases (e.g. for security reasons). That's one of the fundamental design 
goals of HTML5.


 If not, then it doesn't seem clear from the specification that 
 processing instructions are indeed not allowed because the parsing model 
 does allow them, and with processing instructions being 
 platform-specific by definition and not apparently explicitly prohibited 
 by HTML5 (unless that is what you are trying to say here), HTML5 syntax 
 does seem to be compatible with them.

HTML5 prohibits PIs in text/html. See:

   
http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing

...and notice how PIs are not listed as a possible syntax element.


 But if you are trying to prohibit them for any use whatsoever yet still 
 technically allow them to be ignored for compatibility, it seems this 
 would contradict the statement at 
 http://dev.w3.org/html5/html4-differences/ that there is no longer a 
 need for marking features deprecated. Or is the difference that these 
 are forbidden from doing anything but will be allowed (and ignored) 
 indefinitely into the future in future versions of HTML?

They are forbidden but are ignored in this (and probably future) 
version(s) of HTML.


 Btw, I've added a talk section at the wiki page 
 http://wiki.whatwg.org/wiki/Talk:HTML_vs._XHTML#Harmony to suggest 
 covering XHTML-HTML compatibility guidelines specifically, so would 
 appreciate a reply there, so I know whether we can begin edits in this 
 vein on the page.

Please feel free to edit the wiki or add new pages! Everyone is welcome to 
edit the wiki.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

2010-03-17 Thread Brett Zamir

On 3/2/2010 6:54 PM, Ian Hickson wrote:

On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:
   

The handling of processing instructions in the XHTML syntax seems
reasonably well-defined; but it feels a little off in the HTML syntax.
 

There's no such thing as processing instructions in text/html.

There was such a thing in HTML4, because of its SGML heritage, though it
was explicitly deprecated.

   


Doesn't seem deprecated per 
http://www.w3.org/TR/html401/appendix/notes.html#h-B.3.6



Briefly it seems that? causes the parser to go into Bogus comment
state, which is fair enough. (I wouldn't really recommend that anyone
use processing instructions in HTML syntax anyway.) However the parser
comes out of that state at the first. Because processing instructions
can contain  and terminate only at the two character sequence ?  this
could cause PI processing to terminate early and leave a lot more error
handling and a confused parser state in the text yet to come.
 

In HTML4, PIs ended at the first, not at ?. ?target data is the
syntax of PIs when the SGML options used by HTML4 are applied.

In any case, the parser in HTML5 is based on what browsers do, which is
also to terminate at the first. It's unlikely that we can change that,
given backwards-compatibility needs.

There's a simple workaround: don't use PIs in text/html, since they don't
exist in HTML5 at all, and don't send XML as text/html, since XML and HTML
have different syntaxes and aren't compatible.

   


In http://dev.w3.org/html5/html4-differences/ , it says:

HTML5 defines an HTML syntax that is compatible with HTML4 and XHTML1 
documents published on the Web, but is not compatible with the more 
esoteric SGML features of HTML4, such as processing instructions 
http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.3.6 
and shorthand markup 
http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.3.7.


This seems to me to suggest that backward compatibility can be broken as 
far as processing instructions (i.e., requiring ? and not merely  to 
close a processing instruction). If not, then it doesn't seem clear from 
the specification that processing instructions are indeed not allowed 
because the parsing model does allow them, and with processing 
instructions being platform-specific by definition and not apparently 
explicitly prohibited by HTML5 (unless that is what you are trying to 
say here), HTML5 syntax does seem to be compatible with them. But if you 
are trying to prohibit them for any use whatsoever yet still technically 
allow them to be ignored for compatibility, it seems this would 
contradict the statement at http://dev.w3.org/html5/html4-differences/ 
that there is no longer a need for marking features deprecated. Or 
is the difference that these are forbidden from doing anything but will 
be allowed (and ignored) indefinitely into the future in future versions 
of HTML?


Btw, I've added a talk section at the wiki page 
http://wiki.whatwg.org/wiki/Talk:HTML_vs._XHTML#Harmony to suggest 
covering XHTML-HTML compatibility guidelines specifically, so would 
appreciate a reply there, so I know whether we can begin edits in this 
vein on the page.


thanks,
Brett



Re: [whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

2010-03-04 Thread Brett Zamir

On 3/3/2010 7:06 PM, Philip Taylor wrote:

On Wed, Mar 3, 2010 at 10:55 AM, Brett Zamirbret...@yahoo.com  wrote:
   

On 3/2/2010 6:54 PM, Ian Hickson wrote:
 

On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:

   

Briefly it seems that? causes the parser to go into Bogus comment
state, which is fair enough. (I wouldn't really recommend that anyone
use processing instructions in HTML syntax anyway.) However the parser
comes out of that state at the first. Because processing instructions
can containand terminate only at the two character sequence ?this
could cause PI processing to terminate early and leave a lot more error
handling and a confused parser state in the text yet to come.

 

In HTML4, PIs ended at the first, not at ?. ?target data is the
syntax of PIs when the SGML options used by HTML4 are applied.

In any case, the parser in HTML5 is based on what browsers do, which is
also to terminate at the first. It's unlikely that we can change that,
given backwards-compatibility needs.

   

Are there really a lot of folks out there depending on old HTML4-style
processing instructions not being broken?
 

Yes, e.g. a load of pages like
http://www.forex.com.cn/html/2008-01/821561.htm (to pick one example
at random) say:

   ?xml:namespace prefix = o ns = urn:schemas-microsoft-com:office:office /

and don't have the string ? anywhere.
   


Ok, fair enough.  But while it is great that HTML5 seeks to be 
transitional and backwards compatible, HTML5 (thankfully) already breaks 
compatibility for the sake of XML compatibility (e.g., localName or 
getElementsByTagNameNS). It seems to me that there should still be a 
role of eventually transitioning into something more full-featured in a 
fundamental, language-neutral way (e.g., supporting a fuller subset of 
XML's features such as external entities and yes, XML-style processing 
instructions); extensible, including the ability to include XML from 
other namespaces which may also encourage or rely on using their own XML 
processing instructions, for those who wish to experiment or supplement 
the HTML standard behavior; and more harmonious and compatible with a 
simpler syntax (i.e., XML's)--even if the more complex syntax is more 
prominent and continues to be supported indefinitely.


Brett



Re: [whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

2010-03-03 Thread Philip Taylor
On Wed, Mar 3, 2010 at 10:55 AM, Brett Zamir bret...@yahoo.com wrote:
 On 3/2/2010 6:54 PM, Ian Hickson wrote:

 On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:


 Briefly it seems that? causes the parser to go into Bogus comment
 state, which is fair enough. (I wouldn't really recommend that anyone
 use processing instructions in HTML syntax anyway.) However the parser
 comes out of that state at the first. Because processing instructions
 can contain  and terminate only at the two character sequence ?  this
 could cause PI processing to terminate early and leave a lot more error
 handling and a confused parser state in the text yet to come.


 In HTML4, PIs ended at the first, not at ?. ?target data is the
 syntax of PIs when the SGML options used by HTML4 are applied.

 In any case, the parser in HTML5 is based on what browsers do, which is
 also to terminate at the first. It's unlikely that we can change that,
 given backwards-compatibility needs.


 Are there really a lot of folks out there depending on old HTML4-style
 processing instructions not being broken?

Yes, e.g. a load of pages like
http://www.forex.com.cn/html/2008-01/821561.htm (to pick one example
at random) say:

  ?xml:namespace prefix = o ns = urn:schemas-microsoft-com:office:office /

and don't have the string ? anywhere.

-- 
Philip Taylor
exc...@gmail.com


Re: [whatwg] Parsing processing instructions in HTML syntax: 10.2.4.44 Bogus comment state

2010-03-02 Thread Ian Hickson
On Tue, 2 Mar 2010, Elliotte Rusty Harold wrote:

 The handling of processing instructions in the XHTML syntax seems
 reasonably well-defined; but it feels a little off in the HTML syntax.

There's no such thing as processing instructions in text/html.

There was such a thing in HTML4, because of its SGML heritage, though it 
was explicitly deprecated.


 Briefly it seems that ? causes the parser to go into Bogus comment 
 state, which is fair enough. (I wouldn't really recommend that anyone 
 use processing instructions in HTML syntax anyway.) However the parser 
 comes out of that state at the first . Because processing instructions 
 can contain  and terminate only at the two character sequence ? this 
 could cause PI processing to terminate early and leave a lot more error 
 handling and a confused parser state in the text yet to come.

In HTML4, PIs ended at the first , not at ?. ?target data is the 
syntax of PIs when the SGML options used by HTML4 are applied.

In any case, the parser in HTML5 is based on what browsers do, which is 
also to terminate at the first . It's unlikely that we can change that, 
given backwards-compatibility needs.

There's a simple workaround: don't use PIs in text/html, since they don't 
exist in HTML5 at all, and don't send XML as text/html, since XML and HTML 
have different syntaxes and aren't compatible.

HTH,
-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'