Re: cp1252 decoder implementation

Doug Ewell Fri, 16 Nov 2012 21:07:28 -0800

Buck Golemon wrote:

This isn't quite as black-and-white as the question about Latin-1. If
you are targeting HTML5, you are probably safe in treating an
incoming 0x81 (for example) as either U+0081 or U+FFFD, or throwing
some kind of error.


Why do you make this conditional on targeting html5?

Because WHATWG has seen fit to redefine "ISO-8859-1" as an alias on"Windows-1252", and to create its own mapping tables and rules fordecoding, superseding all existing tables and documents created over theyears by vendors and SDOs:


http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

If you are targeting HTML5, you will probably be considerednonconformant if you don't follow this document and associated tables.

If you are not targeting HTML5, then use the tables for ISO 8859-1 orCP1252 (as appropriate) from the Unicode Standard.

HTML5 insists that you treat 8859-1 as if it were CP1252, so it no
longer matters what the byte is in 8859-1.


I feel like you skipped a step. The byte is 0x81 full stop. I agree
that it doesn't matter how it's defined in latin1 (also it's not
defined in latin1).

Are you concerned about the mapping between Latin-1 and Unicode, orabout the control semantic of the character? The former is defined byUnicode; the latter is defined by ISO 6429.

The section of the unicode standard that says control codes are equal
to their unicode characters doesn't mention latin1. Should it?


It applies to all ISO 8859-x parts.

--
Doug Ewell | Thornton, Colorado, USA

http://www.ewellic.org | @DougEwell

Re: cp1252 decoder implementation

Reply via email to