Re: [whatwg] Valid Unicode

2008-05-23 Thread Ian Hickson
On Tue, 22 Apr 2008, Henri Sivonen wrote:
 On Apr 22, 2008, at 14:18, Ian Hickson wrote:
  On Fri, 1 Dec 2006, Elliotte Harold wrote:
   2. Are control characters allowed (probably yes, based on other parts of
   the spec).
  
  No as raw characters. Control characters that aren't in U+80-U+9F are
  allowed as entities.
 ...
   6. Are noncharacters U+FDD0..U+FDEF allowed (?)
   7. Are the noncharacters from the last two characters of each plane
   allowed (?)
  
  Not as raw charactes but, for now, as entities yes.
 
 Why the distinction between raw characters and entities? Won't that just 
 complicate things--serializers in particular?

This has now been fixed.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Valid Unicode

2008-04-22 Thread Ian Hickson
On Fri, 1 Dec 2006, Elliotte Harold wrote:

 In 9.1.3 we see
 
 Text must consist of valid Unicode characters other than U+. Text should
 not contain control characters other than space characters.
 
 
 Later in 9.2.3.1 we find:
 
 If the number is not a valid Unicode character (e.g. if the number is higher
 than 1114111), or if the number is zero, then return a character token for the
 U+FFFD REPLACEMENT CHARACTER character instead.
 
 
 I do not think the Unicode spec defines the notion of a valid Unicode
 character. (It does define a valid Unicode code unit sequence, but that's a
 little different. A code unit sequence generally consists of more than one
 character.) Thus I suggest we need to be more precise here about what is and
 is not a valid Unicode character.

The spec is much more precise now. Is it ok?


 In particular:
 
 1. Are private use characters allowed?

Yes.

 2. Are control characters allowed (probably yes, based on other parts of 
 the spec).

No as raw characters. Control characters that aren't in U+80-U+9F are 
allowed as entities.

 3. Are surrogate characters allowed? (probably no)

No.

 4. Are non-characters beyond 10 allowed (no)

No.

 5. Are reserved but currently undefined characters allowed (yes)

Yes.

 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
 7. Are the noncharacters from the last two characters of each plane 
 allowed (?)

Not as raw charactes but, for now, as entities yes.


On Sun, 3 Dec 2006, Henri Sivonen wrote:
 On Dec 2, 2006, at 18:24, Sam Ruby wrote:
  
  It would not be wise for HTML5 to limit itself to the more constrained 
  character set of XML.  In particular, the form feed character is 
  pretty popular,
  
  This is yet another case where take HTML5, read it into a DOM, and 
  serialize it as XML, and voil�: you have valid XHTML doesn't work.
 
 What I am advocating is making sure that *conforming* HTML5 documents 
 can be serialized as XHTML5 without dataloss. This is important in order 
 to be able to promise that an XML tool chain can be used for 
 processing *conforming* HTML5 by sticking an HTML5 parser in front of 
 the processing pipeline (for *non-browser* use cases like data mining, 
 content management or conformance checking where scripts aren't executed 
 nor CSS rendering performed). The motivation is to make processing HTML5 
 in non-browser apps less expensive without giving an incentive for the 
 solutions to violate the spec ad hoc on their own.
 
 For example, an XML tool chain is important enough for my conformance 
 checking service that if at this point the assumption of *conforming* 
 HTML5 being convertible to XHTML5 was broken in corner cases, I'd 
 probably come up with ad hoc trickery for masking it instead of throwing 
 away the tool chain. I'd prefer not having to do that and not having to 
 explain to everyone else who finds an XML tool chain to be of value 
 what tricks I needed to pull off to fake it.
 
 I am not suggesting that HTML5 browsers halt and catch fire upon finding 
 a form feed. And it is obvious that lossless conversion of all possible 
 non-conforming HTML5 documents to XML is impossible anyway, so making 
 that a goal would not be worthwhile.
 
 But what legitimate and popular use would a form feed have in HTML5? Why 
 can't we call it non-conforming? Are there use cases other than 
 converting .txt RFCs to HTML with regexps without bothering to get rid 
 of the form feeds?

I don't think that it would be valuable to make that use case raise 
errors.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Valid Unicode

2008-04-22 Thread Henri Sivonen

On Apr 22, 2008, at 14:18, Ian Hickson wrote:


On Fri, 1 Dec 2006, Elliotte Harold wrote:
2. Are control characters allowed (probably yes, based on other  
parts of

the spec).


No as raw characters. Control characters that aren't in U+80-U+9F are
allowed as entities.

...

6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane
allowed (?)


Not as raw charactes but, for now, as entities yes.



Why the distinction between raw characters and entities? Won't that  
just complicate things--serializers in particular?


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] Valid Unicode

2006-12-03 Thread Henri Sivonen

On Dec 3, 2006, at 03:47, Sam Ruby wrote:


What I am advocating is making sure that *conforming* HTML5 documents
can be serialized as XHTML5 without dataloss.


Then you will also need to disallow newlines in attribute values.


I believe that is not the case. See the last line of the table at the  
end of section 3.3.3 in the XML 1.0 spec.

http://www.w3.org/TR/REC-xml/#AVNormalize

(Note that if some of this doesn't currently work in Gecko, Gecko has  
a bug. Expat does the XML-compliant thing but then nsExpatDriver runs  
whitespace normalization again, which is bogus. https:// 
bugzilla.mozilla.org/show_bug.cgi?id=343870 It doesn't make sense to  
fix it until bug 18333 has landed.)



In any case, I understand the desire; my read is that the WG's desire
for backwards compatibility is higher.  Limiting the character set to
the allowable XML 1.1 character set should not be a problem for
backwards compatibility purposes.


XML 1.1 doesn't really solve anything in this area. XML 1.1 is part  
of the problem. It creates incompatibility in corner cases without  
compelling benefits. The real XML that is known to work with any XML  
tool chain is XML 1.0.


I should point out that HTML5 proclaims non-conforming some things  
that no doubt exist on the Web and are far more common that form  
feeds. You can't even achieve any useful effect by including a form  
feed in HTML.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] Valid Unicode

2006-12-02 Thread Sam Ruby

On 12/1/06, Elliotte Harold [EMAIL PROTECTED] wrote:

Henri Sivonen wrote:

 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
 7. Are the noncharacters from the last two characters of each plane
 allowed (?)

 I don't have particularly strong feelings here. Putting those characters
 is HTML is a bad idea, but allowing them is not a problem for HTML5 to
 XHTML5 conversion and they aren't a common problem like C1 controls.

FFFE and  are specifically forbidden by XML so they should probably
be forbidden here too. I think the others are allowed.


Unicode (not XML) reserves U+D800 – U+DFFF as well as U+FFFE and U+.

XML 1.0 only allows the following characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1], [#x2FFFE-#x2], [#x3FFFE-#x3],
[#x4FFFE-#x4], [#x5FFFE-#x5], [#x6FFFE-#x6],
[#x7FFFE-#x7], [#x8FFFE-#x8], [#x9FFFE-#x9],
[#xAFFFE-#xA], [#xBFFFE-#xB], [#xCFFFE-#xC],
[#xDFFFE-#xD], [#xEFFFE-#xE], [#xE-#xF],
[#x10FFFE-#x10].

It would not be wise for HTML5 to limit itself to the more constrained
character set of XML.  In particular, the form feed character is
pretty popular,

This is yet another case where take HTML5, read it into a DOM, and
serialize it as XML, and voilà: you have valid XHTML doesn't work.


--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/


- Sam Ruby


Re: [whatwg] Valid Unicode

2006-12-02 Thread Henri Sivonen

On Dec 2, 2006, at 18:24, Sam Ruby wrote:


It would not be wise for HTML5 to limit itself to the more constrained
character set of XML.  In particular, the form feed character is
pretty popular,

This is yet another case where take HTML5, read it into a DOM, and
serialize it as XML, and voilà: you have valid XHTML doesn't work.


What I am advocating is making sure that *conforming* HTML5 documents  
can be serialized as XHTML5 without dataloss. This is important in  
order to be able to promise that an XML tool chain can be used for  
processing *conforming* HTML5 by sticking an HTML5 parser in front of  
the processing pipeline (for *non-browser* use cases like data  
mining, content management or conformance checking where scripts  
aren't executed nor CSS rendering performed). The motivation is to  
make processing HTML5 in non-browser apps less expensive without  
giving an incentive for the solutions to violate the spec ad hoc on  
their own.


For example, an XML tool chain is important enough for my  
conformance checking service that if at this point the assumption of  
*conforming* HTML5 being convertible to XHTML5 was broken in corner  
cases, I'd probably come up with ad hoc trickery for masking it  
instead of throwing away the tool chain. I'd prefer not having to do  
that and not having to explain to everyone else who finds an XML  
tool chain to be of value what tricks I needed to pull off to fake it.


I am not suggesting that HTML5 browsers halt and catch fire upon  
finding a form feed. And it is obvious that lossless conversion of  
all possible non-conforming HTML5 documents to XML is impossible  
anyway, so making that a goal would not be worthwhile.


But what legitimate and popular use would a form feed have in HTML5?  
Why can't we call it non-conforming? Are there use cases other than  
converting .txt RFCs to HTML with regexps without bothering to get  
rid of the form feeds?


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] Valid Unicode

2006-12-02 Thread Sam Ruby

On 12/2/06, Henri Sivonen [EMAIL PROTECTED] wrote:

On Dec 2, 2006, at 18:24, Sam Ruby wrote:

 It would not be wise for HTML5 to limit itself to the more constrained
 character set of XML.  In particular, the form feed character is
 pretty popular,


BTW, I copy and pasted the wrong table.  The characters I mentioned
were discouraged (and include such things as Microsoft smart quotes
mislabeled as iso-8859-1).  The actual allowed set in XML 1.0 is as
follows:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]

For XML 1.1 the list is as follows:

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]


 This is yet another case where take HTML5, read it into a DOM, and
 serialize it as XML, and voilà: you have valid XHTML doesn't work.

What I am advocating is making sure that *conforming* HTML5 documents
can be serialized as XHTML5 without dataloss.


Then you will also need to disallow newlines in attribute values.

In any case, I understand the desire; my read is that the WG's desire
for backwards compatibility is higher.  Limiting the character set to
the allowable XML 1.1 character set should not be a problem for
backwards compatibility purposes.

- Sam Ruby


Re: [whatwg] Valid Unicode

2006-12-01 Thread Henri Sivonen

On Dec 1, 2006, at 14:38, Elliotte Harold wrote:


1. Are private use characters allowed?


I think the answer should be Yes, because not allowing them could  
make people subvert Unicode and use e.g. Latin-1 code points for a  
different purpose with a bogus font. Also, not allowing them would be  
a violation of Charmod requirements for specs.


2. Are control characters allowed (probably yes, based on other  
parts of the spec).


Personally, I'd like to make non-conforming the control characters  
that XML 1.0 disallows (in order to keep conforming HTML5 documents  
convertible to XHTML5) as well as C1 controls (because they have no  
legitimate use in HTML but are a sign of a common bug).



3. Are surrogate characters allowed? (probably no)


Surrogates are an artifact of UTF-16. They have no place on the  
character level. So I'd say No.



6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane  
allowed (?)


I don't have particularly strong feelings here. Putting those  
characters is HTML is a bad idea, but allowing them is not a problem  
for HTML5 to XHTML5 conversion and they aren't a common problem like  
C1 controls.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] Valid Unicode

2006-12-01 Thread Elliotte Harold

Henri Sivonen wrote:


6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane 
allowed (?)


I don't have particularly strong feelings here. Putting those characters 
is HTML is a bad idea, but allowing them is not a problem for HTML5 to 
XHTML5 conversion and they aren't a common problem like C1 controls.


FFFE and  are specifically forbidden by XML so they should probably 
be forbidden here too. I think the others are allowed.


--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/