Re: [whatwg] 9.2.2: replacement characters. How many?

2007-06-22 Thread Øistein E . Andersen
Ian Hickson wrote:

 On Fri, 3 Nov 2006, Elliotte Harold wrote:

 Section 9.2.2 of the current Web Apps 1.0 draft states:
 
 Bytes or sequences of bytes in the original byte stream that could not 
 be converted to Unicode characters must be converted to U+FFFD 
 REPLACEMENT CHARACTER code points.
 
 [This does not specify the exact number of replacement chracters.]

 I don't really know how to define this.
 I'd like to say that it's up to the encoding specifications
 to define it. Any suggestions?

Unicode 5.0 remains vague on this point. (E.g., definition D92
defines well-formed and ill-formed UTF-8 byte sequences, but
conformance requirement C10 only requires ill-formed sequences
to be treated as an error condition and suggests that a one-byte
ill-formed sequence may be either filtered out or replaced by
a U+FFFD replacement character.) More generally, character
encoding specifications can hardly be expected to define proper
error handling, since they are usually not terribly preoccupied
with mislabelled data.

Henri Sivonen has pointed out that a strict requirement on the
number of replacement characters generated may cause
unnecessary incompatibilities with current browsers and extant
tools.

The current text may nevertheless be two liberal. It would
notably be possible to construct an arbitrarily long Chinese
text in a legacy encoding which -- according to the spec -- could
be replaced by one single U+FFFD replacement character if
incorrectly handled as UTF-8, which might lead the user to
think that the page is completely uninteresting and therefore
move on, whereas a larger number of replacement characters
would have led him to try another encoding. (This is only a
problem, of course, if an implementor chooses to emit the
minimal number of replacement characters sanctioned by the spec.)

The current upper bound (number of bytes replaced) seems
intuitive and completely harmless.

A meaningful lower bound is less obvious, at least
if we want to give some leeway to different implementations.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
details an approach for UTF-8 that basically emits a replacement
character and removes read bytes from the buffer each time a
minimal malformed byte sequence has been detected. Safari,
Opera and Firefox all mostly follow this, whereas IE7 usually
emits one replacement character per replaced byte. (Interesting
cases include byte sequences encoding forbidden characters like
U+ mod U+1, or exceeding U+10,.)

It should be relatively simple to define something like this
for any multi-byte encoding, but perhaps less straightforward
for encodings using escape sequences to switch between different
alphabets or other more exotic encodings -- if we have to worry
about those.

-- 
�istein E. Andersen


Re: [whatwg] 9.2.2: replacement characters. How many?

2007-06-14 Thread Ian Hickson
On Fri, 3 Nov 2006, Elliotte Harold wrote:

 Section 9.2.2 of the current Web Apps 1.0 draft states:
 
 Bytes or sequences of bytes in the original byte stream that could not 
 be converted to Unicode characters must be converted to U+FFFD 
 REPLACEMENT CHARACTER code points.
 
 I'm concerned about the or. For example, suppose there are six upper 
 halves of a Unicode surrogate pair in a row and no lower halves. Does 
 that turn into six replacement characters or one? Both interpretations 
 seem possible.
 
 I suppose I prefer six rather than one, but I don't care a great deal as 
 long as this is locked down one way or the other.

I don't really know how to define this. I'd like to say that it's up to 
the encoding specifications to define it. Any suggestions?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'