> So don't say that there are one-for-one equivalences. I was just quoting this section of the standard: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
> There is a simple, one-to-one mapping between 7-bit (and 8-bit) control codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically equal to its corresponding Unicode code point. A one-to-one equivalency between bytes and unicode-points is exactly what is specified here, limited to the domain of "8-bit control codes". On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy <[email protected]> wrote: > If you are thinking about "byte values" you are working at the encoding > scheme level (in fact another lower level which defines a protocol > presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints > are conceptually not an encoding scheme, just a coded character set > (independant of the encoding scheme). > > Separate the levels of abstraction and you'll be much more fine. Forget > the apparent homonymies that exist between distinct layers of abstraction > and use each standard in what it is designed for (including the Unicode > "character/glyph model" which is not defining an encoding scheme). > > So don't say that there are one-for-one equivalences. This is wrong : the > adaptation layer must exist between abstraction levels and between separate > standards, but the Unicode standard does not specify them completely (with > the only exception of standard UTF encodings schemes, which is just one > possible adaptation across some abstraction levels, but is not made to > adapt alone to other standards than what is in the Unicode standard itself). > > > > 2012/11/17 Buck Golemon <[email protected]> > >> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <[email protected]> wrote: >> >>> Buck Golemon wrote: >>> >>> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and >>>> to map it to the equally-non-semantic U+81 ? >>>> >>>> This would allow systems that follow the html5 standard and use cp1252 >>>> in place of latin1 to continue to be binary-faithful and reversible. >>>> >>> >>> This isn't quite as black-and-white as the question about Latin-1. If >>> you are targeting HTML5, you are probably safe in treating an incoming 0x81 >>> (for example) as either U+0081 or U+FFFD, or throwing some kind of error. >> >> >> Why do you make this conditional on targeting html5? >> >> To me, replacement and error is out because it means the system loses >> data or completely fails where it used to succeed. >> Currently there's no reasonable way for me to implement the U+0081 option >> other than inventing a new "cp1252+latin1" codec, which seems undesirable. >> >> >>> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no >>> longer matters what the byte is in 8859-1. >> >> >> I feel like you skipped a step. The byte is 0x81 full stop. I agree that >> it doesn't matter how it's defined in latin1 (also it's not defined in >> latin1). >> The section of the unicode standard that says control codes are equal to >> their unicode characters doesn't mention latin1. Should it? >> I was under the impression that it meant any single-byte encoding, since >> it goes out of its way to talk about "8-bit" control codes. >> > >

